How to manage AI crawlers in your robots.txt
AI companies use web crawlers to collect training data and power AI-generated answers. Your robots.txt file controls which AI bots can access your site. agentmarkup generates or patches your robots.txt with AI-specific directives at build time.
Which AI crawlers exist?
Major AI companies identify their crawlers with specific user-agent strings. agentmarkup supports the following crawlers out of the box:
| Crawler | Company | Purpose |
|---|---|---|
GPTBot | OpenAI | Training data and browsing for ChatGPT |
ClaudeBot | Anthropic | Training data for Claude |
PerplexityBot | Perplexity | Real-time web search for AI answers |
Google-Extended | Training data for Gemini (separate from Google Search) | |
CCBot | Common Crawl | Open web dataset used by many AI models |
You can also add custom crawler names for any bot not in the built-in list.
Configuration
Set each crawler to 'allow' or 'disallow'. Only configure the crawlers you care about. Missing crawlers are not added to your robots.txt.
agentmarkup({
site: 'https://example.com',
name: 'My Website',
aiCrawlers: {
GPTBot: 'allow',
ClaudeBot: 'allow',
PerplexityBot: 'allow',
'Google-Extended': 'allow',
CCBot: 'disallow',
},
})How it works
agentmarkup uses marker comments to manage its section of your robots.txt. If you already have a robots.txt, the plugin patches it without touching your existing rules. If you do not have one, it creates a new file.
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
# BEGIN agentmarkup AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Disallow: /
# END agentmarkup AI crawlersThe markers (# BEGIN agentmarkup AI crawlers / # END agentmarkup AI crawlers) allow the plugin to update its rules on every build without duplicating entries or breaking your custom rules.
Conflict detection
If your existing robots.txt has a User-agent: * with Disallow: /, and you configure a crawler to be allowed, agentmarkup warns you about the conflict during build. A broad disallow rule overrides specific allow rules for most crawlers.
This validation catches a common mistake: you intend to allow GPTBot but your existing robots.txt blocks all bots. Without this check, your allow directive would have no effect.
Should you allow or block AI crawlers?
This is a business decision, not a technical one. Consider:
- Allow if you want your content to appear in AI-generated answers, search summaries, and chatbot responses
- Disallow if you do not want your content used for AI model training or AI-powered search results
- Selective access: Allow some crawlers (like PerplexityBot for search) while blocking others (like CCBot for training data)
Combined with llms.txt, JSON-LD structured data, and markdown mirrors, crawler access is one part of a machine-readable website instead of a standalone fix.
Frequently asked questions
Does blocking an AI crawler actually work?
Most major AI companies (OpenAI, Anthropic, Google) have committed to respecting robots.txt directives for their crawlers. Compliance is voluntary but widely honored. Smaller or unknown crawlers may not comply.
What is the difference between GPTBot and ChatGPT-User?
GPTBot crawls pages for training data. ChatGPT-User is used when a ChatGPT user asks the model to browse a specific URL. They are separate user agents with separate purposes. agentmarkup supports both.
Can I add custom crawler names?
Yes. The aiCrawlers config accepts any string as a key, not just the built-in names. This lets you add rules for new or niche crawlers as they appear.