Qwen3-1.7B-f16 / Q3_Quantization_Comparison.md
geoffmunn's picture
Upload 3 files
24c55f4 verified

Qwen3-1.7B Quantization Comparison Summary

Q3_HIFI (Adaptive/Custom)

Pros:

  • ๐Ÿ† Best quality with lowest perplexity of 17.65 (21.4% better than Q3_K_M, 26.7% better than Q3_K_S)
  • ๐Ÿ“ฆ Smaller than Q3_K_M (993.5 vs 1017.9 MiB) while being significantly better quality
  • ๐ŸŽฏ Uses intelligent layer-sensitive quantization (Q3_HIFI on sensitive layers, mixed q3_K/q4_K elsewhere)
  • ๐Ÿ“Š Most consistent results (lowest standard deviation in perplexity: ยฑ0.16)

Cons:

  • ๐Ÿข Slowest inference at 411.1 TPS (3.4% slower than Q3_K_S)
  • ๐Ÿ”ง Custom quantization may have less community support

Best for: Production deployments where output quality matters, tasks requiring accuracy (reasoning, coding, complex instructions), or when you want the best quality-to-size ratio.

Performance Comparison (Q3_HIFI vs the others)

Q3_K_M

Metric Q3_HIFI Q3_K_M Difference
Speed (TPS) 411.11 416.70 -5.59 (1.3% slower)
Perplexity 17.65 22.44 -4.79 (21.4% better)
File Size 993.5 MiB 1017.9 MiB -24.4 MiB (2.4% smaller)
Bits Per Weight 4.10 4.20 -0.10 (2.4% less)

Pros:

  • โš–๏ธ Traditional "balanced" approach between speed and quality
  • ๐Ÿ“š Well-documented, standard quantization method

Cons:

  • ๐Ÿ’พ Largest file size at 1017.9 MiB despite not being the best quality
  • ๐ŸŒ Middle-of-the-road speed (416.7 TPS)
  • โŒ Outclassed by Q3_HIFI which is smaller AND better quality

Best for: Legacy compatibility or when you need a proven, standard quantization approach. Summary: Q3_HIFI delivers dramatically better quality (21.4% lower perplexity) in a smaller package (2.4% less storage) with only a marginal 1.3% speed penalty.

Q3_K_S

Metric Q3_HIFI Q3_K_S Difference
Speed (TPS) 411.11 425.64 -14.53 (3.4% slower)
Perplexity 17.65 24.07 -6.42 (26.7% better)
File Size 993.5 MiB 948.9 MiB +44.6 MiB (4.7% larger)
Bits Per Weight 4.10 3.92 +0.18 (4.6% more)

Pros:

  • โšก Fastest inference at 425.6 TPS (~3.4% faster than Q3_HIFI)
  • ๐Ÿ’พ Smallest file size at 948.9 MiB
  • โœ… Best choice when speed and storage are critical

Cons:

  • โŒ Worst quality with perplexity of 24.07 (36% higher than Q3_HIFI)
  • Uses only q3_K quantization throughout (no mixed precision)

Best for: Resource-constrained environments, real-time applications where latency matters more than accuracy, or initial prototyping. Summary: Q3_HIFI trades a 3.4% speed reduction and 4.7% larger file size for a substantial 26.7% improvement in quality (lower perplexity).


Recommendation Matrix

Priority Recommended Model Rationale
Quality First Q3_HIFI 27% better perplexity than Q3_K_S with minimal speed loss
Speed First Q3_K_S 3.4% faster inference, acceptable quality tradeoff for latency-sensitive apps
Best Balance Q3_HIFI Better quality AND smaller size than Q3_K_M, only 1.3% slower
Smallest Size Q3_K_S ~5% smaller than alternatives

Key Insight

Q3_HIFI represents a clear advancement over the traditional Q3_K_M approach. It achieves:

  • 21.4% lower perplexity (better accuracy)
  • 2.4% smaller file size (993.5 vs 1017.9 MiB)
  • Only 1.3% slower inference (411 vs 417 TPS)

The Q3_K_M quantization is essentially obsoleted by Q3_HIFI for most use cases. The only remaining choice is between Q3_K_S (maximum speed, acceptable quality) and Q3_HIFI (maximum quality, acceptable speed).

The perplexity improvement on the 1.7B model is even more dramatic than on larger modelsโ€”Q3_HIFI achieves nearly 27% better quality than Q3_K_S while sacrificing only 3.4% speed. This suggests the importance-matrix-guided quantization is particularly effective at preserving quality in smaller models where every parameter matters more.

Appendix (Test Environment Details)

Component Specification
OS Ubuntu 24.04.3 LTS
CPU AMD EPYC 9254 24-Core Processor
CPU Cores 96 cores (2 threads/core)
RAM 1.0Ti
GPU NVIDIA L40S ร— 2
VRAM 46068 MiB per GPU
CUDA 12.9