LLMTraining & AlignmentUpdated 2026.04.29

Quantization

Also known as모델 양자화INT8INT4GGUF

In one line

Quantization compresses model weights to lower precision (say, 16-bit down to 4-bit) so the same model fits on smaller GPUs and runs more cheaply.

Going deeper

Quantization shrinks an LLM's numerical weights into fewer bits — say, 16-bit floats down to 8-bit or 4-bit integers. Memory and compute drop sharply while answer quality usually holds. With 4-bit quantization, a 70B model fits on a single 48GB workstation GPU, or runs on a 24GB consumer card with offloading.

Marketers will not configure it directly, but quantization sits underneath two trends you do feel: on-device AI (assistants running on a phone) and low-cost private deployments (in-house LLMs on modest servers). Both are quantization stories at heart.

Push quantization too far and quality slips, especially on reasoning- or code-heavy tasks. If a deployment feels mysteriously dumber than the same model elsewhere, the quant level is worth checking before anything else.

Sources

QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)

Related terms

LLM

Model Distillation

Model distillation trains a small 'student' model to imitate the outputs of a large 'teacher' model — the standard way to move expensive-model quality into a cheaper one.

LLM

Open-weight Model

An open-weight model is an LLM whose weights are publicly released so anyone can download and run it on their own infrastructure — Llama, Mistral and Qwen are the best-known examples.

LLM

Fine-tuning

Fine-tuning takes an already pretrained LLM and trains it further on a narrower dataset to specialise it for a domain, task or voice — the most common path for adapting an LLM to your own data.

LLM

A large language model (LLM) is a neural network trained on massive text corpora to understand and generate human language — the engine behind ChatGPT, Claude, Gemini and similar products.

LLM

Speculative Decoding

Speculative decoding speeds up inference by letting a small 'draft' model propose several tokens at once and a big model verify them in one shot — a major lever on latency without losing quality.

How does your brand show up in AI answers?

Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.

Get a free audit