smol-IQ1_KT

#3
by wunderschnitzel - opened

Hi,

sharing the results I got on my home rig, so anyone that is thinking about something like this knows what to expect:
Ryzen 9900x+Asus X870E creator + 192GB@6000MT+5090+2x3090+4070TI super
Only a quick note on the processor choice, it's the cheapest model that has 2 CCDS thus giving the same RAM bandwidth as more expensive variants. It's also easier to cool!

./ik_llama.cpp/build/bin/llama-sweep-bench \
    --model /home/llm_models/kimi-k2-0925/smol-IQ1_KT/Kimi-K2-Instruct-0905-smol-IQ1_KT-00001-of-00005.gguf\
    --ctx-size 32768 \
    -fa -fmoe \
    -ngl 99 \
   -ctk q8_0 \
    -mla 3 \
    -ot "blk\.(1|2|3|4|5|6)\.ffn_.*=CUDA0" \
    -ot "blk\.(7|8)\.ffn_.*=CUDA1" \
    -ot "blk\.(9|10|11|12)\.ffn_.*=CUDA2" \
    -ot "blk\.(13|14|15|16)\.ffn_.*=CUDA3" \
    -ot exps=CPU \
    --no-mmap \
    --threads 11 \
    --parallel 1 \
    -b 4096 -ub 4096

main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 11, n_threads_batch = 11

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   14.478 |   282.91 |  149.151 |     6.87 |
|  4096 |   1024 |   4096 |   14.566 |   281.20 |  150.927 |     6.78 |
|  4096 |   1024 |   8192 |   14.758 |   277.54 |  153.422 |     6.67 |
|  4096 |   1024 |  12288 |   14.958 |   273.83 |  155.219 |     6.60 |
|  4096 |   1024 |  16384 |   15.158 |   270.22 |  156.512 |     6.54 |
|  4096 |   1024 |  20480 |   15.529 |   263.76 |  158.236 |     6.47 |
|  4096 |   1024 |  24576 |   16.423 |   249.40 |  161.957 |     6.32 |
|  4096 |   1024 |  28672 |   17.273 |   237.13 |  163.094 |     6.28 |

What I found very surprising is that a 1 bit quant like this is very usable, for conversation and RPing, at least, while 1 bit Deepseek quants felt like you were talking to a very smart but also very very sleepy person. It's very creative and reasonably coherent.

Overall it's mindblowing we can run a 1T model on consumer, even if expensive hardware, even if it's just a 1 bit quant. Feels like when I first played Wing Commander 2 on a 386sx, it lagged, but felt monumental lol

Really amazing what you can pull out of that AM5 rig with 4x populated DDR5 DIMMs and some 4x GPUs (are they at x4 PCIe lanes each?)

No presh but the some new Ling-1T quants are available now including a couple "small" enough to fit your RAM+VRAM: https://huggingface.co/ubergarm2/Ling-1T-GGUF but it will likely be a little slower TG given A50B (though I tried to shrink the active weights some so it isn't too bad)...

The smol-IQ1_KT is indeed an odd one, as the KT quants tend to be CPU bottle-necked for TG given have to calculate trellis each time around, but I feel like when mixed with some other quant types and with not a ton of RAM bandwidth to begin with they are still able to pack in a ton of quality without losing much speed.

lol I too played winged commander 2 on my friend 386sx back around 1994 probably haha... that and privateer were fun on an early flight-stick! my old radioshack tandy 286 10mhz couldn't handle it at all...

cheers!

The gpus are connected to the three PCIE slots and one M2 through an adapter, so they are 5 x8, 4 x8, 4 x4, 4 x4. The 5090 is in the main slot, the other cards are slotted with risers, two 90cm lian li PCIE 4 and another PCIE 4 riser 1 meter long, externally mounted over three lian li vertical GPU mounts sticked together. All powered by a seasonic/noctua 1500W, inference tops at 1000W even if the cards are not limited, for now. It's brutal and effective, I guess. For the memory I was lucky, it's plain corsair vengeance, it runs almost stable at 6000MT, almost meaning it gives an error after one hour or two of memtest but in day to day work gives no problems. Again, not the best but it works. Considering an epyc build, but I'd have to rewire the home wiring, and that's a huge hassle!

Thanks for the heads up on Ling-1T, I didn't realize you already cooked a 1Q quant, I'll try it after freeing up some space!

Sign up or log in to comment