LLMTraining & AlignmentUpdated 2026.04.28

Pretraining

Also known as프리트레이닝기초 학습

In one line

Pretraining is the initial stage where an LLM is trained on huge amounts of text to learn general language capability — the step where the model absorbs most of its 'world knowledge'.

Going deeper

Pretraining is step one in building an LLM. The model is trained on huge piles of text — web pages, books, code, dialogue — with a simple objective: predict the next token. Along the way it picks up grammar, factual knowledge and reasoning patterns. Most of what 'ChatGPT knows' was learned here.

The brand-side implication is that being represented in pretraining data is roughly equivalent to being known by the model. Sources that get scraped repeatedly — Wikipedia, major press, review sites, GitHub — are where you want accurate information about your brand to live.

Pretraining is the most expensive stage by far, taking months and tens of millions of dollars. That is why later stages like fine-tuning and RLHF are what most teams actually touch.

Related terms

How does your brand show up in AI answers?

Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.

Get a free audit