Quantization
In one line
Quantization compresses model weights to lower precision (say, 16-bit down to 4-bit) so the same model fits on smaller GPUs and runs more cheaply.
Going deeper
Quantization shrinks an LLM's numerical weights into fewer bits — say, 16-bit floats down to 8-bit or 4-bit integers. Memory and compute drop sharply while answer quality usually holds. With 4-bit quantization, a 70B model fits on a single 48GB workstation GPU, or runs on a 24GB consumer card with offloading.
Marketers will not configure it directly, but quantization sits underneath two trends you do feel: on-device AI (assistants running on a phone) and low-cost private deployments (in-house LLMs on modest servers). Both are quantization stories at heart.
Push quantization too far and quality slips, especially on reasoning- or code-heavy tasks. If a deployment feels mysteriously dumber than the same model elsewhere, the quant level is worth checking before anything else.
Sources
Related terms
Model Distillation
Model distillation trains a small 'student' model to imitate the outputs of a large 'teacher' model — the standard way to move expensive-model quality into a cheaper one.
LLMOpen-weight Model
An open-weight model is an LLM whose weights are publicly released so anyone can download and run it on their own infrastructure — Llama, Mistral and Qwen are the best-known examples.
LLMFine-tuning
Fine-tuning takes an already pretrained LLM and trains it further on a narrower dataset to specialise it for a domain, task or voice — the most common path for adapting an LLM to your own data.
LLMLLM
A large language model (LLM) is a neural network trained on massive text corpora to understand and generate human language — the engine behind ChatGPT, Claude, Gemini and similar products.
LLMSpeculative Decoding
Speculative decoding speeds up inference by letting a small 'draft' model propose several tokens at once and a big model verify them in one shot — a major lever on latency without losing quality.
How does your brand show up in AI answers?
Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.
Get a free audit