LLMModels & ArchitectureUpdated 2026.04.28

Tokenization

Also known as토큰화Tokenizer

In one line

Tokenization is the preprocessing step that breaks text into the small pieces a model actually consumes — and it directly drives cost, context length and multilingual performance.

Going deeper

Tokenization breaks text into the small chunks a model actually consumes. GPT-style models lean on variants of BPE; others use SentencePiece. One token is roughly four English characters on average, but Korean and Japanese pack many more tokens per character.

It matters in practice for very concrete reasons. API pricing is per token, context windows are defined in tokens, and the same Korean message routinely costs 1.5x to 3x as many tokens as its English version. The 'same content' is not the same cost or length depending on language.

For GEO and LLMO, it is worth a beat to think about how your content tokenizes — whether it splits into clean, quotable units. Sprawling sentences and tables that resist segmentation can quietly become harder for a model to cite.

Related terms

How does your brand show up in AI answers?

Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.

Get a free audit