Speculative Decoding
In one line
Speculative decoding speeds up inference by letting a small 'draft' model propose several tokens at once and a big model verify them in one shot — a major lever on latency without losing quality.
Going deeper
Speculative decoding lets a small helper model predict several upcoming tokens at once; the big main model then verifies them in a single pass, accepting the matching prefix and regenerating from the first mismatch. Same final output, less wall-clock time. It is one of the standard internal optimisations model vendors use to ship faster inference.
Marketers will not configure it, but it is a big chunk of the answer to 'why does AI keep getting faster?'. Lower latency drives more frequent and longer AI usage, which in turn raises the volume of GEO surface exposure your brand is exposed to.
Recent variants — multi-token prediction, EAGLE, Medusa — push the idea further, and some products lean on it to make streamed responses feel almost instantaneous from the first character.
Related terms
LLM
A large language model (LLM) is a neural network trained on massive text corpora to understand and generate human language — the engine behind ChatGPT, Claude, Gemini and similar products.
LLMTest-Time Compute
Test-time compute is the paradigm of spending more inference effort per query to improve accuracy — the shift made tangible by reasoning-focused models like OpenAI o1 and DeepSeek R1.
LLMModel Routing
Model routing dispatches each query to the most suitable model based on difficulty or category — the de-facto pattern for balancing cost, accuracy and latency in production AI.
LLMQuantization
Quantization compresses model weights to lower precision (say, 16-bit down to 4-bit) so the same model fits on smaller GPUs and runs more cheaply.
LLMTransformer
The Transformer is the neural network architecture behind almost every modern LLM, using self-attention to weigh relationships between all tokens in a sequence in parallel.
How does your brand show up in AI answers?
Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.
Get a free audit