Transformer
In one line
The Transformer is the neural network architecture behind almost every modern LLM, using self-attention to weigh relationships between all tokens in a sequence in parallel.
Going deeper
The Transformer architecture comes from Google's 2017 paper 'Attention Is All You Need'. Unlike the RNNs and LSTMs that came before, it processes tokens in parallel and uses self-attention to weigh how every token relates to every other token in a sequence.
Almost every LLM you have heard of — GPT, Claude, Gemini, Llama — is a variation on the Transformer. The skeleton is the same; the differences come from training data, alignment recipes and tuning know-how.
Marketers will not touch Transformers directly, but it explains two things you do feel: why LLMs handle long context fairly well, and why pricing scales with token count.
Sources
Related terms
LLM
A large language model (LLM) is a neural network trained on massive text corpora to understand and generate human language — the engine behind ChatGPT, Claude, Gemini and similar products.
LLMToken
A token is the basic unit an LLM reads and writes — usually a word or piece of a word. LLM pricing and context limits are all measured in tokens.
LLMContext Window
The context window is the maximum number of tokens an LLM can take in at once — it defines how much content the model can consider in a single prompt.
LLMPretraining
Pretraining is the initial stage where an LLM is trained on huge amounts of text to learn general language capability — the step where the model absorbs most of its 'world knowledge'.
LLMGPT
GPT (Generative Pre-trained Transformer) is OpenAI's family of Transformer-based LLMs — the engine behind ChatGPT and the de-facto baseline of the current AI market.
How does your brand show up in AI answers?
Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.
Get a free audit