What is Robots.txt for AI?

Robots.txt for AI refers to using the robots.txt file to specifically control which AI crawlers can access your website content for training and retrieval purposes.

Orbilo Team

Definition

Robots.txt for AI refers to the practice of using a website's robots.txt file to control access by AI-specific crawlers — such as GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), and Google-Extended (Gemini). While robots.txt has been used for decades to manage search engine crawlers, the emergence of AI crawlers has introduced a new layer of decisions about what content AI platforms can access for training and real-time retrieval.

Why robots.txt AI rules matter

Your robots.txt directly determines whether AI platforms can use your content. This impacts:

AI training — Blocking training crawlers means your content won't be included in future model updates
Real-time retrieval — Blocking retrieval crawlers means platforms like Perplexity can't cite your content in live responses
Brand presence — A blanket block on all AI crawlers effectively removes your brand from AI-generated answers
Content control — Selective rules let you choose which platforms and which content they can access

Common AI robots.txt configurations

Allow all AI crawlers (recommended for AEO):

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

Block all AI crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Hybrid approach (allow retrieval, block training):

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

Known AI crawler user-agents

| User-Agent | Operator | Type | |------------|----------|------| | GPTBot | OpenAI | Training | | ChatGPT-User | OpenAI | Retrieval (browsing) | | ClaudeBot | Anthropic | Training | | PerplexityBot | Perplexity | Retrieval | | Google-Extended | Google | Gemini training | | Bytespider | ByteDance | Training | | CCBot | Common Crawl | Open dataset | | FacebookBot | Meta | Training |

Best practices

Audit your current robots.txt — Check if you're accidentally blocking AI crawlers (or accidentally allowing them)
Align with your AEO strategy — If you want AI visibility, ensure AI crawlers are allowed
Use llms.txt alongside robots.txt — Even if you allow crawling, an llms.txt file gives AI systems a curated summary of your brand
Review regularly — New AI crawlers appear frequently; update your rules as the landscape evolves
Consider per-directory rules — Allow crawling of marketing pages while blocking premium content

AI Crawler — The bots that robots.txt rules control
LLMs.txt — A complementary file that tells AI what your brand is (while robots.txt controls access)
Training Data — The content corpus that AI crawlers help build

Tools

LLMs.txt Generator — Create a brand context file that works alongside robots.txt
LLMs-ctx Generator — Extended context for AI systems crawling your site

What is Robots.txt for AI?

Definition

Why robots.txt AI rules matter

Common AI robots.txt configurations

Known AI crawler user-agents

Best practices

Tools

Related Articles

What is an AI Crawler?

What is LLMs.txt?

What is Training Data (for AI)?

Definition

Why robots.txt AI rules matter

Common AI robots.txt configurations

Known AI crawler user-agents

Best practices

Related terms

Tools

Related Articles

What is an AI Crawler?

What is LLMs.txt?

What is Training Data (for AI)?