# Every AI crawler indexing your website in 2026 - agentmarkup

> Complete list of AI crawlers in 2026: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and more. What each does, who runs it, and how to control access via robots.txt.

Source: https://agentmarkup.dev/blog/ai-crawlers-2026/

By [Sebastian Cochinescu](/authors/sebastian-cochinescu/) · March 20, 2026 · 8 min read

# Every AI crawler indexing your website in 2026

AI companies use web crawlers to collect training data and power real-time AI search. Here is a complete list of every known AI crawler, what company runs it, what it does, and how to control access through your robots.txt.

## What are AI crawlers?

AI crawlers are automated bots that visit websites to collect content. Some crawlers gather training data for language models. Others power real-time search features where AI generates answers from live web content. They identify themselves through user-agent strings in their HTTP requests.

Unlike traditional search engine crawlers (Googlebot, Bingbot) that build search indexes, AI crawlers serve a different purpose: feeding content into AI models. This distinction matters because you might want Google to index your site for search results while blocking your content from being used as AI training data.

## The complete list

### OpenAI

 Crawler Purpose GPTBot Collects training data for GPT models. Also powers ChatGPT's browsing feature when searching the web. ChatGPT-User Used when a ChatGPT user explicitly asks the model to visit and read a specific URL. This is browsing on demand, not bulk crawling. OAI-SearchBot Powers ChatGPT Search (formerly SearchGPT). Crawls pages to generate real-time search answers.

### Anthropic

 Crawler Purpose ClaudeBot Collects training data for Claude models. Anthropic has committed to respecting robots.txt directives. anthropic-ai Older user-agent string used by Anthropic. Some sites still reference it in robots.txt.

### Google

 Crawler Purpose Google-Extended Collects data for Gemini and other AI products. Separate from Googlebot, so blocking it does not affect your Google Search rankings.

### Perplexity

 Crawler Purpose PerplexityBot Powers Perplexity's AI search engine. Crawls pages to generate real-time answers with source citations.

### Amazon

 Crawler Purpose Amazonbot Collects data for Alexa and Amazon's AI services. Respects robots.txt.

### Common Crawl

 Crawler Purpose CCBot Builds the Common Crawl open dataset, which is used as training data by many AI companies including those building open-source models. Blocking CCBot is a broad way to reduce training data exposure.

### Apple

 Crawler Purpose Applebot-Extended Collects data for Apple Intelligence features. Separate from the main Applebot used for Siri and Spotlight search.

### Meta

 Crawler Purpose Meta-ExternalAgent Collects data for Meta AI products. Respects robots.txt since mid-2024. FacebookBot Primarily renders link previews for Facebook and Instagram. Not used for AI training.

### Other notable crawlers

 Crawler Company Purpose Bytespider ByteDance Training data for TikTok and ByteDance AI products cohere-ai Cohere Training data for Cohere's enterprise AI models Diffbot Diffbot Web data extraction for knowledge graphs Timpibot Timpi Decentralized search index YouBot You.com AI search engine

## How to control AI crawler access

Your `robots.txt` file is the standard mechanism. Add `User-agent` directives for each crawler you want to allow or block:

- **Allow all AI crawlers:** Do nothing. The default is open access.
- **Block specific crawlers:** Add `User-agent: GPTBot` with `Disallow: /`
- **Allow specific crawlers:** If you have a blanket `Disallow`, add specific `Allow` rules for bots you want

Tools like [agentmarkup](https://github.com/agentmarkup/agentmarkup) automate this at build time, patching your robots.txt without breaking existing rules and validating for conflicts. See the [AI crawlers guide](/docs/ai-crawlers/) for configuration.

## Do AI crawlers respect robots.txt?

Compliance is voluntary, not enforced. That said, the major companies have publicly committed to respecting robots.txt:

- **OpenAI:** Committed to respecting robots.txt for GPTBot since 2023. Published documentation with opt-out instructions.
- **Anthropic:** ClaudeBot respects robots.txt. Anthropic published a dedicated page for webmasters.
- **Google:** Google-Extended is fully controlled through robots.txt, separate from Googlebot.
- **Perplexity:** PerplexityBot respects robots.txt. Perplexity has faced criticism in the past but has since improved compliance.

Smaller or less-known crawlers may not comply. There is no technical enforcement mechanism for robots.txt. It is a social contract.

## Training data vs real-time search

An important distinction: some crawlers collect data for model training (a one-time or periodic process), while others power real-time AI search (your content appears in live answers).

- **Training crawlers:** GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent. Your content becomes part of the model's knowledge.
- **Search crawlers:** PerplexityBot, OAI-SearchBot, ChatGPT-User. Your content is fetched and cited in real-time answers.

You might want to block training crawlers (you do not want your content used to train models) while allowing search crawlers (you do want your content cited in AI answers). This selective approach is possible because each crawler uses a different user-agent string.

## The bottom line

AI crawlers are a permanent part of the web. The question is not whether they visit your site but whether you control the terms. A clear robots.txt policy, configured intentionally rather than by accident, is the minimum. Combined with [llms.txt](/docs/llms-txt/) and [JSON-LD structured data](/docs/json-ld/), you can make your site both accessible and understandable to AI systems on your terms.

## Make your website machine-readable

agentmarkup is an open-source build-time toolkit for Vite, Astro, and Next.js that generates llms.txt, injects JSON-LD structured data, creates optional markdown mirrors from final HTML when raw pages need a cleaner agent-facing fetch path, manages AI crawler robots.txt rules, patches optional Content-Signal and canonical mirror headers, and validates everything at build time. Zero runtime cost.

 Learn more GitHub
```
pnpm add -D @agentmarkup/vite # or @agentmarkup/astro or @agentmarkup/next
```

Written by

[Sebastian Cochinescu](/authors/sebastian-cochinescu/) · Developer of agentmarkup

Builder of developer tools for machine-readable websites. Developer of agentmarkup. Founder of Anima Felix.

## More from the blog

### How to add llms.txt, JSON-LD, and AI crawler controls to Next.js

Use @agentmarkup/next to generate llms.txt, inject JSON-LD, manage AI crawler rules, and understand the dynamic SSR boundary in Next.js.

 March 23, 2026 · 8 min read

### When markdown mirrors help, and when they do not

A practical guide to when generated markdown mirrors add signal, when HTML is already enough, and how to avoid unnecessary downsides.

 March 20, 2026 · 7 min read

### Is your website ready for AI? Free LLM discoverability checker

Audit your website for llms.txt, JSON-LD, robots.txt, markdown mirrors, and sitemap. Free tool for e-commerce and brand websites.

 March 20, 2026 · 8 min read

### Build-time markdown mirrors for agent readability: Cloudflare comparison

Build-time markdown generation for AI readability, including when it helps and how it compares to Cloudflare runtime extraction.

 March 20, 2026 · 7 min read

### How to make your brand appear in AI conversations

Organization schema, llms.txt, and FAQ markup make your brand visible in ChatGPT, Claude, and Perplexity answers.

 March 20, 2026 · 7 min read

### Why LLM-optimized e-commerce websites sell more

Product JSON-LD, llms.txt, and AI crawler access make your store visible in AI product recommendations.

 March 20, 2026 · 8 min read

### JSON-LD structured data: the complete guide for web developers

Schema types, JSON-LD vs microdata, common mistakes, and build-time validation.

 March 20, 2026 · 10 min read

### What is GEO? Generative Engine Optimization explained for developers

What is real, what is hype, and what you can do today to make your site citeable by AI.

 March 20, 2026 · 7 min read

### Why llms.txt matters: making your website discoverable by AI

LLMs answer questions by synthesizing web content. llms.txt gives them a structured overview of your site.

 March 20, 2026 · 6 min read