Quantization
In one line
Quantization compresses model weights to lower precision (say, 16-bit down to 4-bit) so the same model fits on smaller GPUs and runs more cheaply.
Going deeper
Quantization shrinks an LLM's numerical weights into fewer bits — say, 16-bit floats down to 8-bit or 4-bit integers. Memory and compute drop sharply while answer quality usually holds. With 4-bit quantization, a 70B model fits on a single 48GB workstation GPU, or runs on a 24GB consumer card with offloading.
Marketers will not configure it directly, but quantization sits underneath two trends you do feel: on-device AI (assistants running on a phone) and low-cost private deployments (in-house LLMs on modest servers). Both are quantization stories at heart.
Push quantization too far and quality slips, especially on reasoning- or code-heavy tasks. If a deployment feels mysteriously dumber than the same model elsewhere, the quant level is worth checking before anything else.
Sources
Related terms
Model Distillation
Model distillation trains a small 'student' model to imitate the outputs of a large 'teacher' model — the standard way to move expensive-model quality into a cheaper one.
LLMOpen-weight Model
An open-weight model is an LLM whose weights are publicly released so anyone can download and run it on their own infrastructure — Llama, Mistral and Qwen are the best-known examples.
LLMFine-tuning
Fine-tuning takes an already pretrained LLM and trains it further on a narrower dataset to specialise it for a domain, task or voice — the most common path for adapting an LLM to your own data.
LLMLLM
A large language model (LLM) is a neural network trained on massive text corpora to understand and generate human language — the engine behind ChatGPT, Claude, Gemini and similar products.
LLMSpeculative Decoding
Speculative decoding speeds up inference by letting a small 'draft' model propose several tokens at once and a big model verify them in one shot — a major lever on latency without losing quality.