CCBot
Common Crawl Bot
In one line
CCBot is the crawler operated by the nonprofit Common Crawl — and the dataset it produces is the starting point for the training data of many LLMs.
Going deeper
CCBot is the crawler run by the nonprofit Common Crawl. The dataset it produces is freely downloadable, and it has historically been a foundational ingredient in pretraining for many LLMs, including the early GPT family.
What makes CCBot interesting is the indirect effect: blocking it tends to reduce your share of voice across LLM training pools, while allowing it lets your content quietly flow into a wide range of model pretraining.
Blocking CCBot is not the same as opting out of all LLM training. Model providers run their own crawlers (GPTBot, ClaudeBot and others), so if training-data policy actually matters to you, the work needs to be done bot by bot.
Sources
Related terms
GPTBot
GPTBot is OpenAI's official web crawler used for ChatGPT training and search indexing — controllable via robots.txt.
GEO·AEOClaudeBot
ClaudeBot is Anthropic's web crawler used for training Claude and grounding its answers — manageable via robots.txt.
GEO·AEOGoogle-Extended
Google-Extended is the separate user agent Google uses for training Gemini and Vertex AI, letting site owners control AI training access independently from regular search indexing.
GEO·AEOApplebot-Extended
Applebot-Extended is the identifier Apple introduced to let site owners opt out of AI training separately from regular Applebot, which still powers Siri and Spotlight indexing.
GEO·AEOllms.txt
llms.txt is a proposed text file placed at the site root that tells large language models where the most important content lives — think 'sitemap, but written for LLMs'.
How does your brand show up in AI answers?
Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.
Get a free audit