What is Robots.txt for AI?
Robots.txt for AI refers to using the robots.txt file to specifically control which AI crawlers can access your website content for training and retrieval purposes.
Orbilo Team
Definition
Robots.txt for AI refers to the practice of using a website's robots.txt file to control access by AI-specific crawlers — such as GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), and Google-Extended (Gemini). While robots.txt has been used for decades to manage search engine crawlers, the emergence of AI crawlers has introduced a new layer of decisions about what content AI platforms can access for training and real-time retrieval.
Why robots.txt AI rules matter
Your robots.txt directly determines whether AI platforms can use your content. This impacts:
- AI training — Blocking training crawlers means your content won't be included in future model updates
- Real-time retrieval — Blocking retrieval crawlers means platforms like Perplexity can't cite your content in live responses
- Brand presence — A blanket block on all AI crawlers effectively removes your brand from AI-generated answers
- Content control — Selective rules let you choose which platforms and which content they can access
Common AI robots.txt configurations
Allow all AI crawlers (recommended for AEO):
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
Block all AI crawlers:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Hybrid approach (allow retrieval, block training):
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
Known AI crawler user-agents
| User-Agent | Operator | Type |
|------------|----------|------|
| GPTBot | OpenAI | Training |
| ChatGPT-User | OpenAI | Retrieval (browsing) |
| ClaudeBot | Anthropic | Training |
| PerplexityBot | Perplexity | Retrieval |
| Google-Extended | Google | Gemini training |
| Bytespider | ByteDance | Training |
| CCBot | Common Crawl | Open dataset |
| FacebookBot | Meta | Training |
Best practices
- Audit your current robots.txt — Check if you're accidentally blocking AI crawlers (or accidentally allowing them)
- Align with your AEO strategy — If you want AI visibility, ensure AI crawlers are allowed
- Use llms.txt alongside robots.txt — Even if you allow crawling, an llms.txt file gives AI systems a curated summary of your brand
- Review regularly — New AI crawlers appear frequently; update your rules as the landscape evolves
- Consider per-directory rules — Allow crawling of marketing pages while blocking premium content
Related terms
- AI Crawler — The bots that robots.txt rules control
- LLMs.txt — A complementary file that tells AI what your brand is (while robots.txt controls access)
- Training Data — The content corpus that AI crawlers help build
Tools
- LLMs.txt Generator — Create a brand context file that works alongside robots.txt
- LLMs-ctx Generator — Extended context for AI systems crawling your site