By Sebastian Cochinescu · March 20, 2026 · 8 min read

Every AI crawler indexing your website in 2026

AI companies use web crawlers to collect training data and power real-time AI search. Here is a complete list of every known AI crawler, what company runs it, what it does, and how to control access through your robots.txt.

What are AI crawlers?

AI crawlers are automated bots that visit websites to collect content. Some crawlers gather training data for language models. Others power real-time search features where AI generates answers from live web content. They identify themselves through user-agent strings in their HTTP requests.

Unlike traditional search engine crawlers (Googlebot, Bingbot) that build search indexes, AI crawlers serve a different purpose: feeding content into AI models. This distinction matters because you might want Google to index your site for search results while blocking your content from being used as AI training data.

The complete list

OpenAI

Crawler	Purpose
`GPTBot`	Collects training data for GPT models. Also powers ChatGPT's browsing feature when searching the web.
`ChatGPT-User`	Used when a ChatGPT user explicitly asks the model to visit and read a specific URL. This is browsing on demand, not bulk crawling.
`OAI-SearchBot`	Powers ChatGPT Search (formerly SearchGPT). Crawls pages to generate real-time search answers.

Anthropic

Crawler	Purpose
`ClaudeBot`	Collects training data for Claude models. Anthropic has committed to respecting robots.txt directives.
`anthropic-ai`	Older user-agent string used by Anthropic. Some sites still reference it in robots.txt.

Google

Crawler	Purpose
`Google-Extended`	Collects data for Gemini and other AI products. Separate from Googlebot, so blocking it does not affect your Google Search rankings.

Perplexity

Crawler	Purpose
`PerplexityBot`	Powers Perplexity's AI search engine. Crawls pages to generate real-time answers with source citations.

Amazon

Crawler	Purpose
`Amazonbot`	Collects data for Alexa and Amazon's AI services. Respects robots.txt.

Common Crawl

Crawler	Purpose
`CCBot`	Builds the Common Crawl open dataset, which is used as training data by many AI companies including those building open-source models. Blocking CCBot is a broad way to reduce training data exposure.

Apple

Crawler	Purpose
`Applebot-Extended`	Collects data for Apple Intelligence features. Separate from the main Applebot used for Siri and Spotlight search.

Crawler	Purpose
`Meta-ExternalAgent`	Collects data for Meta AI products. Respects robots.txt since mid-2024.
`FacebookBot`	Primarily renders link previews for Facebook and Instagram. Not used for AI training.

Other notable crawlers

Crawler	Company	Purpose
`Bytespider`	ByteDance	Training data for TikTok and ByteDance AI products
`cohere-ai`	Cohere	Training data for Cohere's enterprise AI models
`Diffbot`	Diffbot	Web data extraction for knowledge graphs
`Timpibot`	Timpi	Decentralized search index
`YouBot`	You.com	AI search engine

How to control AI crawler access

Your robots.txt file is the standard mechanism. Add User-agent directives for each crawler you want to allow or block:

Allow all AI crawlers: Do nothing. The default is open access.
Block specific crawlers: Add User-agent: GPTBot with Disallow: /
Allow specific crawlers: If you have a blanket Disallow, add specific Allow rules for bots you want

Tools like agentmarkup automate this at build time, patching your robots.txt without breaking existing rules and validating for conflicts. See the AI crawlers guide for configuration.

Do AI crawlers respect robots.txt?

Compliance is voluntary, not enforced. That said, the major companies have publicly committed to respecting robots.txt:

OpenAI: Committed to respecting robots.txt for GPTBot since 2023. Published documentation with opt-out instructions.
Anthropic: ClaudeBot respects robots.txt. Anthropic published a dedicated page for webmasters.
Google: Google-Extended is fully controlled through robots.txt, separate from Googlebot.
Perplexity: PerplexityBot respects robots.txt. Perplexity has faced criticism in the past but has since improved compliance.

Smaller or less-known crawlers may not comply. There is no technical enforcement mechanism for robots.txt. It is a social contract.

Training data vs real-time search

An important distinction: some crawlers collect data for model training (a one-time or periodic process), while others power real-time AI search (your content appears in live answers).

Training crawlers: GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent. Your content becomes part of the model's knowledge.
Search crawlers: PerplexityBot, OAI-SearchBot, ChatGPT-User. Your content is fetched and cited in real-time answers.

You might want to block training crawlers (you do not want your content used to train models) while allowing search crawlers (you do want your content cited in AI answers). This selective approach is possible because each crawler uses a different user-agent string.

The bottom line

AI crawlers are a permanent part of the web. The question is not whether they visit your site but whether you control the terms. A clear robots.txt policy, configured intentionally rather than by accident, is the minimum. Combined with llms.txt and JSON-LD structured data, you can make your site both accessible and understandable to AI systems on your terms.

Every AI crawler indexing your website in 2026

What are AI crawlers?

The complete list

OpenAI

Anthropic

Google

Perplexity

Amazon

Common Crawl

Apple

Meta

Other notable crawlers

How to control AI crawler access

Do AI crawlers respect robots.txt?

Training data vs real-time search

The bottom line

Make your website machine-readable

More from the blog

How to add llms.txt, JSON-LD, and AI crawler controls to Next.js

When markdown mirrors help, and when they do not

Is your website ready for AI? Free LLM discoverability checker

Build-time markdown mirrors for agent readability: Cloudflare comparison

How to make your brand appear in AI conversations

Why LLM-optimized e-commerce websites sell more

JSON-LD structured data: the complete guide for web developers

What is GEO? Generative Engine Optimization explained for developers

Why llms.txt matters: making your website discoverable by AI