What LLM Quantization Actually Does to the One Task You Run

Published quantization benchmarks make 4-bit look almost free, with average scores barely moving. The capability that breaks first is rarely the one the benchmark measures, which is why a 50-example test on a real task beats any leaderboard.
Published quantization benchmarks make 4-bit look nearly free, because a model's average score barely moves. The trap is that the capability which degrades first under compression is almost never the one a broad benchmark measures. The fix is boring and reliable: a 50-example test on the task that actually ships, run before and after quantizing.
I went back through the quantization numbers that piled up this spring, and the headline is almost suspiciously calm. A Digital Applied study from late April lined up six open-weight 70B-class models, the usual suspects like Llama 4 70B, Qwen 3 72B, DeepSeek V4-Flash, Mistral Large 2, Command-R+, and Yi 2, then measured what four different compression formats did to their scores. The worst performer, 4-bit GPTQ, landed within 1.9 points of full precision on MMLU-Pro and HumanEval+. FP8 came within 0.4. If anyone stopped reading there, quantization looks like a free lunch: roughly half the memory, a real throughput bump, and a rounding error of quality lost.
That number is true. It is also the most dangerous kind of true.
What quantization in an LLM actually changes
Quantization, stripped of mystique, is just storing a model’s weights in fewer bits. FP16 keeps sixteen bits per number. INT8 and FP8 use eight. The 4-bit formats, the GGUF families, AWQ, and GPTQ, use four. Fewer bits means less memory and faster math, which is the entire reason anyone bothers.
The recent numbers make the appeal concrete.
| Format | Points from full precision |
|---|---|
| FP8 | within 0.4 |
| INT8 | within 0.7 |
| AWQ 4-bit | within 1.6 |
| GPTQ 4-bit | within 1.9 |
On an H100 doing batch-1 decode, that same Digital Applied run clocked FP8 at a 1.4 to 1.7x throughput lift for half the VRAM, INT8 at 1.4x for half the VRAM, and AWQ 4-bit at 3.1x. NVIDIA’s vLLM release notes from June put dynamic FP8 in the same territory: roughly a 2x memory cut and up to 1.6x throughput on Hopper and Ada, with no calibration data required. On the consumer side, an InsiderLLM writeup on the new NVFP4 format landing in llama.cpp measured a 68% prefill jump on a 5090, moving Qwen 3.6 27B from about 17GB down to 14GB. Underneath all of it, the established GGUF rule of thumb still holds: Q4_K_M keeps around 95% of full-precision quality while cutting memory roughly fourfold.
"FP8 lands within 0.4 points of FP16 on MMLU-Pro and HumanEval+, INT8 within 0.7 points, AWQ-4 within 1.6 points, and GPTQ-4 within 1.9 points."
So the appeal is real and the average cost is small. That is the part everyone repeats. The next part is the one that costs teams a quarter.
Where LLM quantization quietly breaks
Here is what the averages bury. A benchmark like MMLU-Pro asks thousands of broad knowledge and reasoning questions. When a 4-bit model loses 1.6 points across all of them, that loss is spread thin and even. But a production workload is not thousands of evenly weighted questions. It is one narrow thing done over and over: structured JSON output, a function call with five arguments, a retrieval step that has to stay faithful to a long context. Those narrow capabilities are exactly the ones that degrade first, and they barely register in a broad average.
The same Digital Applied data set shows the asymmetry without anyone having to argue it. Drop a model to INT4 weight-only and it loses about 1.6 points on MMLU-Pro. On HumanEval, the coding benchmark, the same quant costs roughly 8 points.
Coding is narrower and more brittle than general trivia, so it falls about five times harder from the identical compression. Now picture a capability narrower than coding, one no public benchmark even tracks, which is to say the capability a product is actually built on.
A quantized model rarely fails on the benchmark. It fails on the one job a team bought it for.
I keep seeing the same sequence. A team picks a model off a leaderboard, quantizes it to fit the card they already own, runs a handful of sanity prompts that look great, and ships. Three weeks later the tool-calling success rate has quietly slipped from 98% into the low 80s, the support queue is filling with notes that say “the bot did something weird,” and nobody connects it to the quant because the model still aces every demo question. The averages said everything was fine. The averages were measuring the wrong thing.
The capability that degrades first under quantization is almost never the one the leaderboard measures.
None of this means 4-bit is a mistake. It means the headline number and the deployment risk live in two different places, and the gap between them is where teams get surprised.
How to read a quantization LLM benchmark without getting fooled
The pattern underneath all of this is simple and a little humbling. A published quantization benchmark describes the model in general, and nobody deploys the model in general. The benchmark is a starting filter, not a verdict.
A few habits make the published numbers more honest to read. Watch which benchmark moved, not just how far; a small average drop hiding a large slide on a code or reasoning subscore is the tell, and it is the failure mode that ships to production. Treat the 4-bit families, GGUF, AWQ, and GPTQ, as roughly a tier below 8-bit on quality, and assume the gap surfaces on the hard, narrow tasks first rather than the easy broad ones. And notice when a model ships with native low-bit weights instead of being squeezed down after training. This spring’s quantization-aware releases, the kind Moonshot shipped with Kimi’s native INT4 weights, treat 4-bit as the intended deployment target rather than a degraded fallback, which is a genuinely different quality profile from taking an FP16 model and crushing it to fit.
The development that actually helps a team is not a better leaderboard. It is how cheap and fast a real evaluation has become. Building a held-out test of true inputs used to feel like a research luxury. Now it is an afternoon of work, and it is the single highest-leverage afternoon in this whole exercise.
Test a quant on 50 real examples first
If we were sitting down over coffee, here is the whole thing in one move. Before quantizing anything, pull 50 real examples from your actual workload, the messy ones included, and write down what a correct answer looks like for each. Run them on the full-precision model and save the results. Quantize, run the same 50, and compare. If the quant holds, you just bought half the memory back with evidence instead of hope. If it cracks, you found out on a quiet Tuesday afternoon instead of in front of real users.
That is the entire discipline. Not a bigger benchmark, not a fancier format, just the stubborn habit of testing the one task you actually run. Quantization is genuinely one of the best deals in this field right now. It only turns into a trap when a team lets a leaderboard make a promise the leaderboard was never measuring.
Sources
- Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 Performance Data - Digital Applied, 2026-04-24
- FP4 Just Landed in llama.cpp: NVFP4 vs MXFP4 Explained - InsiderLLM, 2026-04-25
- vLLM Release Notes v26.05 - NVIDIA, 2026-06-10
- Kimi K2.7-Code: Open Weights, 340GB Reality Check - Modem Guides, 2026-06-12