AI Bots Directory

Know which AI bots reach your site

AI crawlers decide whether ChatGPT, Claude, Perplexity, and Gemini can read, cite, and recommend your content. This directory catalogues the bots that matter — who runs them, what they do, and whether they respect robots.txt.

33 bots from 22 operators · Last updated June 2026

33
Bots tracked
6
Bot categories
22
Operators

The AI bot directory

Filter by what each bot is for, or search by name and operator.

GPTBot

AI Crawler

OpenAI

Crawls public web content to train OpenAI's foundation models. Content owners can block it in robots.txt to opt out of training.

Respects robots.txt GPTBot Docs →

OAI-SearchBot

AI Search

OpenAI

Indexes pages to surface and link them inside ChatGPT search results. Blocking it removes you from ChatGPT search, not training.

Respects robots.txt OAI-SearchBot Docs →

ChatGPT-User

AI Assistant

OpenAI

Fetches a specific page when a ChatGPT user (or a GPT/plugin) asks about it. Triggered by users, not bulk crawling.

Respects robots.txt ChatGPT-User Docs →

ClaudeBot

AI Crawler

Anthropic

Anthropic's primary crawler for collecting web data used to train Claude. Honours robots.txt and crawl-delay.

Respects robots.txt ClaudeBot Docs →

Claude-User

AI Assistant

Anthropic

Retrieves a page in real time when a Claude user asks a question that requires fetching live content.

Respects robots.txt Claude-User Docs →

Claude-SearchBot

AI Search

Anthropic

Indexes content so Claude can cite and link to it when answering with web search enabled.

Respects robots.txt Claude-SearchBot Docs →

anthropic-ai

AI Crawler

Anthropic

Legacy Anthropic user-agent still referenced in many robots.txt files. ClaudeBot is the current crawler.

Respects robots.txt anthropic-ai Docs →

Googlebot

Search Engine

Google

Google's core search crawler. Its index also powers AI Overviews and grounding for Gemini in Search.

Respects robots.txt Googlebot Docs →

Google-Extended

Opt-Out Token

Google

A robots.txt token, not a crawler. Disallowing it opts your site out of training Gemini and Vertex AI while keeping Google Search indexing intact.

Opt-out token Google-Extended Docs →

GoogleOther

AI Crawler

Google

A generic crawler used by internal Google teams for research and development, including AI data collection.

Respects robots.txt GoogleOther Docs →

Bingbot

Search Engine

Microsoft

Microsoft's search crawler. The Bing index grounds Copilot and other Microsoft AI answer experiences.

Respects robots.txt bingbot Docs →

Applebot

Search Engine

Apple

Powers Siri and Spotlight Suggestions. Its crawl also feeds Apple Intelligence features.

Respects robots.txt Applebot Docs →

Applebot-Extended

Opt-Out Token

Apple

A robots.txt token. Disallowing it opts your content out of training Apple's generative models while keeping Siri/Spotlight indexing.

Opt-out token Applebot-Extended Docs →

PerplexityBot

AI Search

Perplexity

Indexes pages so Perplexity can surface and cite them in answers. Documented to honour robots.txt.

Respects robots.txt PerplexityBot Docs →

Perplexity-User

AI Assistant

Perplexity

Fetches a page when a user action requires it. Perplexity states these user-initiated fetches may not follow robots.txt.

Partial Perplexity-User Docs →

Meta-ExternalAgent

AI Crawler

Meta

Meta's crawler for collecting training data for Llama and other Meta AI products.

Respects robots.txt meta-externalagent Docs →

Meta-ExternalFetcher

AI Assistant

Meta

Fetches specific links to support Meta AI assistant features when a user invokes them.

Partial meta-externalfetcher Docs →

FacebookBot

AI Crawler

Meta

Crawls pages to improve language models that power Meta products such as speech recognition.

Respects robots.txt FacebookBot Docs →

Amazonbot

AI Assistant

Amazon

Crawls the web to answer questions through Alexa and to support Amazon's AI services.

Respects robots.txt Amazonbot Docs →

DeepSeekBot

AI Crawler

DeepSeek

Crawler associated with DeepSeek for gathering web data to support its AI models. Its user-agent and robots.txt behaviour are not yet formally documented.

Respects robots.txt DeepSeekBot

Bytespider

AI Crawler

ByteDance

ByteDance's crawler used to gather training data for its AI models. Widely reported to ignore robots.txt and crawl aggressively.

Ignores robots.txt Bytespider

CCBot

AI Crawler

Common Crawl

Builds the open Common Crawl dataset that is a major source of training data for many large language models.

Respects robots.txt CCBot Docs →

cohere-training-data-crawler

AI Crawler

Cohere

Collects web data used to train Cohere's enterprise language models.

Respects robots.txt cohere-training-data-crawler

MistralAI-User

AI Assistant

Mistral AI

Retrieves pages on demand to support Le Chat and other Mistral assistant features.

Respects robots.txt MistralAI-User Docs →

DuckAssistBot

AI Assistant

DuckDuckGo

Supports DuckDuckGo's DuckAssist AI answers by fetching relevant content.

Respects robots.txt DuckAssistBot Docs →

YouBot

AI Search

You.com

Indexes content for the You.com AI search engine and assistant.

Respects robots.txt YouBot Docs →

AI2Bot

AI Crawler

Allen Institute for AI

Crawls open web content for research datasets used to train open language models such as OLMo.

Respects robots.txt AI2Bot Docs →

Diffbot

AI Crawler

Diffbot

Crawls and structures web pages into a knowledge graph that powers AI and data products.

Respects robots.txt Diffbot Docs →

PetalBot

Search Engine

Huawei

Crawler for Huawei's Petal Search and AI assistant features.

Respects robots.txt PetalBot Docs →

ImagesiftBot

AI Crawler

ImageSift (Hive)

Crawls images across the web to power visual search and AI training datasets.

Respects robots.txt ImagesiftBot Docs →

AhrefsBot

SEO Tool

Ahrefs

Crawls the web to build Ahrefs' backlink and SEO index. Often allowed for SEO/AEO analysis.

Respects robots.txt AhrefsBot Docs →

SemrushBot

SEO Tool

Semrush

Powers Semrush's backlink, keyword, and site-audit datasets.

Respects robots.txt SemrushBot Docs →

DataForSeoBot

SEO Tool

DataForSEO

Collects SERP and web data resold to many SEO and AI-visibility tools.

Respects robots.txt DataForSeoBot Docs →

Bot details are compiled from each operator's published documentation and may change as operators update their crawlers.

Blocking the wrong bot can make you invisible to AI

There are three kinds of AI bot, and they call for three different decisions.

Training crawlers

GPTBot, ClaudeBot, and Google-Extended gather data to train models. Block them if you don't want your content used for training — but know that on some platforms this also reduces your odds of being cited later.

Search indexers

OAI-SearchBot, PerplexityBot, and Claude-SearchBot decide whether an AI answer engine can cite and link to you. You almost always want these allowed — they're how you earn AI visibility.

Live fetchers

ChatGPT-User and Claude-User fetch a page the moment a user asks about it. Blocking them stops AI tools from reading your page on a user's behalf — usually a missed opportunity.

The takeaway: the goal of AEO isn't to block AI bots — it's to make sure the right ones can read you, and that your content is structured to get cited when they do. Orbilo shows you exactly which AI crawlers hit your site and how often your brand surfaces in their answers.

Frequently asked questions

AI bots are automated agents that visit websites on behalf of AI companies. Some crawl content to train models (like GPTBot and ClaudeBot), some index pages so an AI answer engine can cite them (like OAI-SearchBot and PerplexityBot), and some fetch a page live when a user asks a question (like ChatGPT-User and Claude-User).

Add a rule for its user-agent to your robots.txt. For example, "User-agent: GPTBot" then "Disallow: /" blocks OpenAI's training crawler. Each bot card above lists the exact user-agent string to use.

It can. On several platforms the same crawler that gathers training data also feeds the AI search index. Block it and you may disappear from that platform's cited answers. We recommend allowing search indexers and live fetchers, and only blocking pure training crawlers if you have a specific reason.

Most major crawlers from OpenAI, Anthropic, Google, and Perplexity document that they honour robots.txt. A few — notably ByteDance's Bytespider — have been reported to ignore it. User-initiated fetches sometimes bypass robots.txt because they act on a direct request rather than bulk crawling.

Server logs tell you a bot visited. Orbilo connects that crawl activity to outcomes — whether your brand is then mentioned and cited in ChatGPT, Claude, Perplexity, Gemini, and Grok answers — so you can see which crawls actually turn into AI visibility.

See which AI bots are reading you — and what they say about your brand

Orbilo tracks AI crawler activity on your site and monitors how ChatGPT, Claude, Perplexity, Gemini, and Grok mention your brand — so you can turn crawls into citations.