Every AI crawler indexing your website in 2026
AI companies use web crawlers to collect training data and power real-time AI search. Here is a complete list of every known AI crawler, what company runs it, what it does, and how to control access through your robots.txt.
What are AI crawlers?
AI crawlers are automated bots that visit websites to collect content. Some crawlers gather training data for language models. Others power real-time search features where AI generates answers from live web content. They identify themselves through user-agent strings in their HTTP requests.
Unlike traditional search engine crawlers (Googlebot, Bingbot) that build search indexes, AI crawlers serve a different purpose: feeding content into AI models. This distinction matters because you might want Google to index your site for search results while blocking your content from being used as AI training data.
The complete list
OpenAI
| Crawler | Purpose |
|---|---|
GPTBot | Collects training data for GPT models. Also powers ChatGPT's browsing feature when searching the web. |
ChatGPT-User | Used when a ChatGPT user explicitly asks the model to visit and read a specific URL. This is browsing on demand, not bulk crawling. |
OAI-SearchBot | Powers ChatGPT Search (formerly SearchGPT). Crawls pages to generate real-time search answers. |
Anthropic
| Crawler | Purpose |
|---|---|
ClaudeBot | Collects training data for Claude models. Anthropic has committed to respecting robots.txt directives. |
anthropic-ai | Older user-agent string used by Anthropic. Some sites still reference it in robots.txt. |
| Crawler | Purpose |
|---|---|
Google-Extended | Collects data for Gemini and other AI products. Separate from Googlebot, so blocking it does not affect your Google Search rankings. |
Perplexity
| Crawler | Purpose |
|---|---|
PerplexityBot | Powers Perplexity's AI search engine. Crawls pages to generate real-time answers with source citations. |
Amazon
| Crawler | Purpose |
|---|---|
Amazonbot | Collects data for Alexa and Amazon's AI services. Respects robots.txt. |
Common Crawl
| Crawler | Purpose |
|---|---|
CCBot | Builds the Common Crawl open dataset, which is used as training data by many AI companies including those building open-source models. Blocking CCBot is a broad way to reduce training data exposure. |
Apple
| Crawler | Purpose |
|---|---|
Applebot-Extended | Collects data for Apple Intelligence features. Separate from the main Applebot used for Siri and Spotlight search. |
Meta
| Crawler | Purpose |
|---|---|
Meta-ExternalAgent | Collects data for Meta AI products. Respects robots.txt since mid-2024. |
FacebookBot | Primarily renders link previews for Facebook and Instagram. Not used for AI training. |
Other notable crawlers
| Crawler | Company | Purpose |
|---|---|---|
Bytespider | ByteDance | Training data for TikTok and ByteDance AI products |
cohere-ai | Cohere | Training data for Cohere's enterprise AI models |
Diffbot | Diffbot | Web data extraction for knowledge graphs |
Timpibot | Timpi | Decentralized search index |
YouBot | You.com | AI search engine |
How to control AI crawler access
Your robots.txt file is the standard mechanism. Add User-agent directives for each crawler you want to allow or block:
- Allow all AI crawlers: Do nothing. The default is open access.
- Block specific crawlers: Add
User-agent: GPTBotwithDisallow: / - Allow specific crawlers: If you have a blanket
Disallow, add specificAllowrules for bots you want
Tools like agentmarkup automate this at build time, patching your robots.txt without breaking existing rules and validating for conflicts. See the AI crawlers guide for configuration.
Do AI crawlers respect robots.txt?
Compliance is voluntary, not enforced. That said, the major companies have publicly committed to respecting robots.txt:
- OpenAI: Committed to respecting robots.txt for GPTBot since 2023. Published documentation with opt-out instructions.
- Anthropic: ClaudeBot respects robots.txt. Anthropic published a dedicated page for webmasters.
- Google: Google-Extended is fully controlled through robots.txt, separate from Googlebot.
- Perplexity: PerplexityBot respects robots.txt. Perplexity has faced criticism in the past but has since improved compliance.
Smaller or less-known crawlers may not comply. There is no technical enforcement mechanism for robots.txt. It is a social contract.
Training data vs real-time search
An important distinction: some crawlers collect data for model training (a one-time or periodic process), while others power real-time AI search (your content appears in live answers).
- Training crawlers: GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent. Your content becomes part of the model's knowledge.
- Search crawlers: PerplexityBot, OAI-SearchBot, ChatGPT-User. Your content is fetched and cited in real-time answers.
You might want to block training crawlers (you do not want your content used to train models) while allowing search crawlers (you do want your content cited in AI answers). This selective approach is possible because each crawler uses a different user-agent string.
The bottom line
AI crawlers are a permanent part of the web. The question is not whether they visit your site but whether you control the terms. A clear robots.txt policy, configured intentionally rather than by accident, is the minimum. Combined with llms.txt and JSON-LD structured data, you can make your site both accessible and understandable to AI systems on your terms.