The Complete Guide to AI Crawlers and User Agents (February 2026)

A practical reference for every webmaster, SEO, and developer who wants to know which AI crawlers are reading their site, what they want, and how to control them.

The short version

AI crawlers identify themselves through a User-Agent string in HTTP requests. When you spot GPTBot, ClaudeBot, PerplexityBot — or any of the newer 2026 entries below — in your server logs, an AI model is indexing, scraping, or quoting your page.

Keep your robots.txt current and you'll guide how language models interact with your work. Ignore it and you have no idea who's reading you.

This guide reflects February 2026 reality. Three things changed recently that older lists (including our own past notes) didn't yet capture:

Anthropic split their crawler into three earlier this month — ClaudeBot, Claude-User, Claude-SearchBot, plus a new Claude-Code. The old anthropic-ai and claude-web are deprecated but still appear in logs.
Google-Extended and Applebot-Extended are not user agents. They're robots.txt opt-out tokens. The actual HTTP request still comes through as Googlebot or Applebot. Many guides got this wrong — including ours, before this update.
Agentic browsers don't have user agents. ChatGPT Atlas, OpenAI Operator, Claude for Chrome — they all use standard Chrome strings. robots.txt cannot block them. This is the fastest-growing category of AI traffic in 2026, and it requires a completely different defense strategy.

Quick definitions

AI crawler — A bot that fetches public web pages so a language model can train on them, retrieve them, or quote them in real time.
User-Agent (UA) string — A short text identifier in the HTTP request header. Used in robots.txt rules.
robots.txt — A plain-text file at the root of your site (example.com/robots.txt) that tells crawlers what they may fetch. One block per User-agent: you want to allow or block.
robots.txt token — A string in robots.txt that controls a vendor's behavior but isn't actually a UA. Google-Extended is the canonical example. The vendor still crawls with Googlebot, but checks whether you've opted out via the token.

The four kinds of AI crawler

Before reading the vendor catalog below, it helps to know that AI crawlers fall into four functional categories — and the right robots.txt decision depends on the category, not the vendor. The Kind column in every table that follows maps each User-Agent to one of these.

1. Training — fetches your pages into a corpus that trains the next model. Gives nothing back. Block these if you don't want your content baked into the next generation of GPT, Claude, Gemini, or LLaMA. Examples: GPTBot, ClaudeBot, Meta-ExternalAgent, CCBot, cohere-ai.

2. Retrieval — indexes your pages so an AI product can cite them in real-time answers. Allow these and you appear in ChatGPT search, Perplexity, Claude search, Gemini, Apple Intelligence. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot, Applebot, YouBot.

3. User-fetch — fires when a real person asks an AI to look at a specific URL ("summarize this page," "click this citation"). Effectively a human-by-proxy; usually arrives with a referer and brings traffic with it. Examples: ChatGPT-User, Claude-User, Perplexity-User, MistralAI-User.

4. Agentic — autonomous task-completers driving a real Chrome to research, shop, or transact across multiple sites on a user's behalf. Most have no distinguishing UA at all; robots.txt cannot block them and detection requires a different toolset (covered in The agentic browser problem below). Examples: ChatGPT Atlas, OpenAI Operator, Claude for Chrome, GoogleAgent-Mariner (the only one with a documented UA).

A fifth bucket, Multi-purpose, applies to a small number of crawlers whose vendor uses them for several of the above at once — FacebookBot, Amazonbot, Bytespider, Diffbot. These are flagged explicitly in the catalog.

The catalog below is organized by vendor for easy lookup, but each entry's Kind column maps it back to one of these five buckets — so you can scan for "all training crawlers" or "all retrieval crawlers" across vendors in a single pass.

Why you should care

Server logs show AI bots, which now account for a measurable and growing share of who reads your content. Understanding which agents the major AI products use lets you encourage or discourage that traffic deliberately.

Allowing helpful crawlers gets you cited and linked in AI search products — Perplexity, ChatGPT search, Gemini, Claude.
Blocking abusive scrapers stops your content from being trained into models you don't want to support.
Knowing what's possible matters: some crawlers can't be blocked at all (agentic browsers), and others ignore robots.txt entirely. Strategy beats paranoia.

Most AI crawlers respect robots.txt by default. But the space changes monthly — vendor docs update, new vendors emerge, deprecated crawlers linger. A list older than three months is already stale.

Complete AI crawler list (February 2026)

This list reflects the state as of February 2026. Categories and behaviors based on each vendor's official documentation, cross-referenced against the ai-robots-txt GitHub repository.

OpenAI

User-Agent	Kind	Purpose	robots.txt action	IP range published
`GPTBot`	Training	Trains OpenAI's models. Block this if you don't want your content in GPT-5 or beyond.	`Disallow: /` to opt out of training	openai.com/gptbot.json
`OAI-SearchBot`	Retrieval	Powers ChatGPT search and citations. Allow this so your pages appear in ChatGPT search results.	`Allow: /`	openai.com/searchbot.json
`ChatGPT-User`	User-fetch	Fetches pages on demand when a ChatGPT user clicks a citation or asks Custom GPT to read a URL. Treated as user-initiated, may not respect robots.txt.	`Allow: /` (limited effect)	openai.com/chatgpt-user.json

Anthropic

Anthropic split their crawler architecture earlier this month, replacing the older anthropic-ai and claude-web crawlers with three specialized bots. Both old UAs are deprecated but still appear in logs from older clients.

User-Agent	Kind	Purpose	robots.txt action
`ClaudeBot`	Training	Trains Claude models. The new training crawler, replacing `anthropic-ai`.	`Disallow: /` to opt out of training
`Claude-User`	User-fetch	Fetches pages when a user asks Claude to read a URL during chat.	`Allow: /`
`Claude-SearchBot`	Retrieval	Powers Claude's search and citation features.	`Allow: /`
`Claude-Code`	Agentic	Terminal-based agentic operations from Claude Code (developer tool). New this month.	`Allow: /` for public docs
`anthropic-ai` (legacy)	Training	Old training crawler, deprecated this month. Still seen in logs from older clients.	Same as `ClaudeBot`
`claude-web` (legacy)	Multi-purpose	Old general crawler, deprecated this month.	Same as `ClaudeBot/Claude-User`

Perplexity

User-Agent	Kind	Purpose	robots.txt action
`PerplexityBot`	Retrieval	Indexes the web for Perplexity's answer engine.	`Allow: /`
`Perplexity-User`	User-fetch	Fetches pages when users click citations. Has been reported to ignore robots.txt in some cases — escalate to firewall rules if you need hard blocking.	`Allow: /` (with caveats)

Google

Google uses several specialized agents in addition to the main Googlebot. Note that Google-Extended is not a user agent — see the next section.

User-Agent	Kind	Purpose	robots.txt action
`Google-Agent`	Agentic	Generic agentic operations.	`Allow: /`
`Google-NotebookLM`	User-fetch	Fetches pages for NotebookLM's research feature.	`Allow: /`
`Google-Read-Aloud`	User-fetch	Audio rendering of pages.	`Allow: /`
`Google-CloudVertexBot`	Retrieval	Vertex AI enterprise web crawling.	`Allow: /`
`GoogleAgent-Mariner`	Agentic	Project Mariner, Google's agentic browser. Available to AI Ultra subscribers ($249.99/month) since 2026. Runs on cloud VMs.	`Allow: /`

Apple

User-Agent	Kind	Purpose	robots.txt action
`Applebot`	Retrieval	Indexes content for Siri, Spotlight, and Safari suggestions.	`Allow: /`

User-Agent	Kind	Purpose	robots.txt action
`Meta-ExternalAgent`	Training	Trains Meta's AI models (LLaMA family).	`Disallow: /` to opt out
`Meta-ExternalFetcher`	User-fetch	User-triggered fetches from Meta platforms.	`Allow: /`
`FacebookBot`	Multi-purpose	Generates link previews for Facebook and Instagram, plus AI features.	`Allow: /`

Other notable AI crawlers

User-Agent	Kind	Vendor	Purpose	robots.txt action
`CCBot`	Training	Common Crawl	Open dataset used by many AI projects.	`Allow: /` (or `Disallow` if opting out of upstream training)
`Bytespider`	Multi-purpose	ByteDance	Powers TikTok search and ByteDance AI features. Historically known to ignore robots.txt.	`Disallow: /` (escalate to firewall if abuse continues)
`Amazonbot`	Multi-purpose	Amazon	Alexa, Fire OS AI features, product recommendations.	`Allow: /`
`DuckAssistBot`	Retrieval	DuckDuckGo	Powers DuckAssist AI answers.	`Allow: /`
`MistralAI-User`	User-fetch	Mistral	Fetches citations for Le Chat.	`Allow: /`
`cohere-ai`	Training	Cohere	Training data for Cohere models.	`Disallow: /` to opt out
`Diffbot`	Multi-purpose	Diffbot	Structured data extraction for ML pipelines.	`Allow: /`
`AI2Bot`	Retrieval	Allen Institute	Academic crawler for Semantic Scholar and AI research.	`Allow: /`
`YouBot`	Retrieval	You.com	Powers You.com's AI search and browser assistant.	`Allow: /`

What's NOT a User-Agent: robots.txt opt-out tokens

This is the most-misunderstood corner of the AI crawler space. Google-Extended and Applebot-Extended are not user agents. Many guides — including ours, before February 2026 — listed them in the UA table. They're not.

Here's what they actually are:

Google-Extended — A robots.txt token that controls whether Googlebot's crawled content is used for Gemini training. The actual HTTP request still comes from Googlebot. Adding User-agent: Google-Extended / Disallow: / to your robots.txt tells Google "you can index me for search, but don't train Gemini on me."
Applebot-Extended — Same pattern. Controls whether Applebot's crawled content is used for Apple AI training. The HTTP request still comes from Applebot.

You will never see Google-Extended or Applebot-Extended in your server logs as a User-Agent. If a guide tells you to "block Google-Extended," what they mean is "add this token to robots.txt to opt out of Gemini training."

robots.txt example for opting out of training while keeping search:

# Allow Google search indexing
User-agent: Googlebot
Allow: /

# Opt out of Gemini training
User-agent: Google-Extended
Disallow: /

# Allow Apple search
User-agent: Applebot
Allow: /

# Opt out of Apple AI training
User-agent: Applebot-Extended
Disallow: /

The agentic browser problem

This is the elephant in the room of 2026 AI traffic.

Agentic browsers don't have identifying user agents. ChatGPT Atlas uses a standard Chrome UA string. OpenAI Operator runs in a remote Chrome-like browser with no published UA. Claude for Chrome (the Anthropic browser extension for Max subscribers) uses real Chrome. xAI's Grok crawler reportedly switches to iPhone UAs to avoid blocks.

In your server logs, all of this looks like a regular Chrome user. robots.txt cannot block them. UA-based firewall rules don't apply.

Agent	UA visibility	Can robots.txt block?
ChatGPT Atlas (OpenAI browser)	Standard Chrome — indistinguishable	No
OpenAI Operator (ChatGPT agent mode)	No known UA, runs in remote browser	No
Claude for Chrome (Anthropic, Max subscription)	Real Chrome	No
Anthropic Computer Use (API, Claude 3.5+)	Varies (often headless Chrome)	No
xAI Grok	Documented UAs (`GrokBot/1.0`, `xAI-Grok/1.0`) rarely seen; iPhone UA reported instead	Unreliable
Google Project Mariner	`GoogleAgent-Mariner` (the exception)	Yes

This category is the fastest-growing AI traffic type in 2026. Users invoke these agents to research, shop, summarize, or transact on their behalf — and the agents arrive at your site looking like a normal browser visit. Google Analytics counts them as humans. Shopify counts them as Shopify-channel sessions if they convert. You have no idea what share of your "human" traffic is actually agentic.

If you need to detect or control this category, robots.txt is the wrong tool. You'll need:

IP-range allowlists/blocklists — OpenAI, Anthropic, and xAI publish IP ranges for some of their agents. Cloudflare bot management can match against these.
Behavioral fingerprinting — request timing patterns, JavaScript execution signatures, session correlation. This requires either a CDN with bot management features or a tracking pixel that captures these signals.
Account-level access controls — for content behind login, use rate limits and abuse detection at the application layer.

Protal's Phase 2 product, Protal Analytics, is specifically designed to detect and report on this category — agents that don't show up via UA. It's in development; if this problem is on your radar, join the early-access list.

Robots.txt examples for common goals

Goal: Allow AI search, block AI training

This is the most common 2026 setup. You want to be cited by ChatGPT, Perplexity, and Claude — but you don't want your content baked into the next model.

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: CCBot
Disallow: /

# Opt out of Gemini and Apple AI training (these are tokens, not UAs)
User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow search and citation crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: YouBot
Allow: /

# Allow user-triggered fetches (real visitor on the other end)
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: MistralAI-User
Allow: /

# Default for everything else
User-agent: *
Allow: /

Goal: Allow everything (let AI read freely)

For sites that want maximum AI distribution — public docs, marketing content, brand-building blogs:

User-agent: *
Allow: /

That's it. The default for any unlisted crawler is allow.

Goal: Block everything

For private content, staging environments, or sites that don't want any AI access:

User-agent: *
Disallow: /

Note: this won't stop bots that ignore robots.txt (Bytespider has a track record of ignoring it). Use firewall rules for hard blocking.

Best practices

Specify every AI agent you care about explicitly. A wildcard User-agent: * catches most, but some bots only check specific blocks. Listing them by name is more reliable.
Always pair User-agent: with at least one Allow: or Disallow: line. A User-agent: declaration without a directive does nothing.
Use blank lines between blocks. Some parsers merge adjacent blocks if you don't.
Re-test after major LLM model releases. New model versions sometimes ship new crawler UAs or change behavior. The space moves monthly.
Don't rely on robots.txt for hard security. It's a politeness protocol, not access control. For real blocking, use firewall rules or auth.

How to check what's actually visiting you

The best way to know which AI crawlers reach your site is to read your server logs.

grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|claude-user|claude-searchbot|perplexitybot|bytespider|google-agent|amazonbot|googleagent-mariner" access.log

You'll see hits like:

203.0.113.10 - - [15/Apr/2026:08:14:22 -0500] "GET /blog/ HTTP/1.1" 200 15843 "-" "GPTBot/1.2"

What this tells you:

An expected bot is missing? Either you've blocked it (check robots.txt), it hasn't discovered your site yet, or it's not crawling your category.
An unexpected bot is hitting you hard? Check the legitimacy — some crawlers misrepresent themselves. Verify against published IP ranges where available.
Patterns of "Chrome" traffic with no JavaScript execution and no referer? Could be agentic browsers. UA alone won't tell you.

Want to know if your robots.txt is configured correctly?

We built Protal Audit specifically to check this — and dozens of other AI-readiness factors — in one pass.

Run a free Protal Audit →

Free. No signup required. Reports include your full robots.txt analysis, schema.org structured data validation, llms.txt presence, MCP discovery, and a battery of other rules across 9 categories. Every rule is documented in our public methodology.

FAQs

Do AI crawlers have to follow robots.txt?

No. robots.txt is a politeness protocol, not a legal mandate. Most reputable crawlers respect it. Some — Bytespider has the most documented track record — ignore it. For hard blocking, use firewall rules or password-protect the content.

What about IP-based blocking?

OpenAI, Anthropic, and a few others publish IP ranges for their crawlers. You can use these in Cloudflare firewall rules or your CDN's bot management features. This is more reliable than UA-based rules but more work to maintain. Note that IP ranges change — subscribe to vendor announcements or use a service that tracks them.

My logs show "Chrome" requests that don't look human. What are they?

Likely candidates: agentic browsers (ChatGPT Atlas, Operator, Claude for Chrome), headless browsers running on cloud IPs (Selenium / Playwright scrapers), or actual humans with privacy extensions. Distinguishing them requires JavaScript execution analysis, IP range correlation, or behavioral fingerprinting — none of which are visible in UA alone.

Should I block GPTBot to opt out of training?

Depends on your business. If your content is your product (a publication, a SaaS docs site, a research org), opting out preserves your competitive moat. If your content is your marketing (a blog promoting a service), being trained into ChatGPT might increase your discoverability. There's no universal right answer.

Why do you say `Google-Extended` isn't a user agent?

Because it isn't. Google's documentation is explicit: Google-Extended is a robots.txt token that controls how Googlebot's crawled content is used for Gemini training. The HTTP request still comes through as Googlebot. You will never see Google-Extended in a User-Agent header. Many guides — including ours, prior to February 2026 — listed it in their UA tables. We've corrected that.

How often does this list change?

Roughly once a quarter, sometimes faster. Anthropic's recent crawler split (this month) is the most recent major change. We update this guide whenever vendor documentation changes, new significant crawlers emerge, or community data (especially the ai-robots-txt repo) reflects shifts in the ecosystem.

Spotted an error or missing crawler? Reach us at hello@protal.ai — we monitor this list as part of our public methodology and update it regularly.

This guide is part of Protal's effort to provide neutral, accurate, technical references for the AI-era web. Protal builds tools that help websites prepare for and observe AI traffic. Learn more about Protal Audit and our forthcoming Protal Analytics.