How we score your siteNine categories, no guesswork.
Protal's audit is a deterministic pipeline grouped into 9 weighted categories. Every rule has an ID, a written definition, and a reproducible check. This page is the complete reference.
Four things that don't change.
Evidence, not vibes
Every rule produces a concrete artifact — an HTTP response, a JSON-LD block, a header value — that engineers can re-fetch and verify themselves.
Agent-perspective
We measure what AI crawlers and agents actually encounter — robots.txt tokens, initial HTML before JS, well-known files, anti-bot challenges, action affordances. Not a generic browser experience, and not a perfect-rendering ideal.
Weighted by impact
A broken robots.txt costs more than a missing twitter:card. Category weights reflect how much each failure mode actually degrades retrieval quality.
Public and versioned
Rule definitions, weights, and changelog live on this page. No secret scoring — if we change a weight, you'll see it here first.
Category weights.
Each category contributes a fixed share of the 100-point overall score. Weights lean on signals with public evidence of driving AI retrieval and agent interaction — robots.txt enforcement, anti-bot mitigation, Schema.org, semantic HTML, SSR rendering, and MCP discovery for the agent economy. Smaller weights go to conventions where adoption is real but consumption by AI products is still maturing (llms.txt, skill.md).
Every rule, by category.
Each section opens with what the category covers and why it matters, then lists every rule inside. Click a rule to expand its full doc — what we check, the verdict bands, how to fix it. Use the jump nav below to skip to a category, or filter by severity.
Crawler Policy — Who you let in.
`robots.txt` is a plain text file at the root of every website (yoursite.com/robots.txt) that tells crawlers — both regular search crawlers and AI bots — what they're allowed to fetch. Every major AI vendor — OpenAI, Anthropic, Google, Perplexity, Meta — documents the names of its crawlers in its own developer docs and publicly commits to honoring this file. This category checks whether you've used `robots.txt` to make deliberate decisions per AI crawler instead of defaulting to “let everyone do whatever.” Note: there's a fourth kind of AI bot — agentic browsers like ChatGPT Atlas, OpenAI Operator, and Claude for Chrome — that uses standard Chrome user-agent strings and ignores `robots.txt`. Their interaction-readiness is checked separately in §03 Agent Readiness; their runtime behavior on your site is tracked by Protal Analytics. The three robots.txt-honoring kinds (training, retrieval, user-triggered) are explained on each rule below. Boundary: this category checks the policy you declare via robots.txt. Server-side enforcement behavior — anti-bot mitigation, WAF, IP-based rate limits — is checked in §03 Agent Readiness (R3.1).
- Audience
- AI crawlers — training, retrieval, and user-triggered fetchers
- What we look at
/robots.txtUser-agent: blocksSitemap: directive
AI Semantic Files — Static files written for AI to read.
Two file conventions that AI agents read for context about your site. `/llms.txt` is a 2024 proposal — a markdown index of your most important pages with one-line descriptions, designed for an LLM to read in a single fetch. `/skill.md` describes what AI agents can do with your product — the actions, inputs, and constraints they should know about before taking action. These files matter most when live agents and user-triggered fetchers come to your site — ChatGPT-User answering a user's question about you, an agentic browser orienting itself before a task. A structured semantic file lets them understand your site in a single fetch instead of parsing full HTML and guessing from layout. Publisher-side adoption is real and growing — Anthropic, OpenAI, Cloudflare, Vercel, Mintlify, Cursor, and Zed all publish them — but no mainstream AI platform has formally committed to consuming these files in retrieval. We score this category lightly to reflect that gap, and will raise the weight as consumption matures. Boundary: this category covers static text files AI agents read for context. Programmatic interfaces for agents to call (MCP discovery, MCP endpoints) are in §03 Agent Readiness.
- Audience
- Language models doing first-pass indexing of your site
- What we look at
/llms.txt/llms-full.txt/skill.md
Agent Readiness — Whether agents can actually complete actions on your site.
Agentic browsers like ChatGPT Atlas, OpenAI Operator, and Claude for Chrome can read and act on websites through a real browser interface, but they're sensitive to practical blockers: bot protection (e.g., Cloudflare bot fight mode), JavaScript-only forms, non-semantic clickable elements (`<div onClick>` instead of `<button>`), and icon-only controls without accessible names. §01–§02 and §04–§09 mainly evaluate whether AI systems can discover and understand your site. This category evaluates whether agents can successfully navigate, submit, trigger, and complete actions on it. This category also includes MCP discovery — the well-known file that exposes a structured tool/API path for agents, so they don't have to rely solely on page scraping. MCP belongs here because it's part of action-readiness, not just static AI-facing metadata. In practice, this category targets a rapidly growing class of AI traffic that traditional analytics often measures poorly. Protal Audit asks whether a site is agent-ready in principle; Protal Analytics shows where agents complete tasks and where they get stuck in production. This category overlaps deliberately with several others: §01 declares crawler policy while this category checks server-side enforcement of it; §06 handles structural HTML semantics (`<nav>`, `<main>`, `<article>`) while this category checks action-element semantics (`<button>`, `<a>`, `<form>`); §07 checks initial-HTML content visibility while this category checks form submission paths; §02 covers static AI-facing files while this category covers the programmatic agent interface (MCP).
- Audience
- Agentic browsers (ChatGPT Atlas, OpenAI Operator, Claude for Chrome) and agent-driven API consumers
- What we look at
server response patterns<form> attributes<button>/<a> semanticsaria-label/.well-known/mcp.json
Schema.org Structured Data — Telling crawlers what each page is.
Schema.org is a shared vocabulary for describing webpage content — “this is an Article by this author, published on this date” or “this is a Product with this price and these reviews.” You embed it as `<script type="application/ld+json">` blocks in your HTML, in a format called JSON-LD. Google has officially recommended JSON-LD since 2015 because the `<script>` block stays cleanly separated from visible HTML — easy for machines to extract, easy for authors to keep correct. Schema is the most direct mechanism for making your content machine-readable. Instead of an AI crawler guessing whether a number on your page is a price or a date, you hand it a labeled field (`offers.price: "9999"`). It reads the field, doesn't guess from prose. We weight this category in the higher tier because that mechanism is real and observable.
- Audience
- Major AI vendors (Google, OpenAI, Anthropic) + knowledge-graph builders
- What we look at
<script type="application/ld+json">@context@type
Metadata — The link-preview contract.
The `<head>` section at the top of every HTML page is the part visitors don't see on the page itself — browsers parse it as page metadata, not as content to display in the body. But every line inside `<head>` is something a machine reads. When ChatGPT or Claude cites your page, the headline they show comes from your `<title>` and the snippet under it comes from your `<meta name="description">`. When someone shares your link on Slack or LinkedIn, the preview card that pops up is built from your Open Graph (`og:*`) tags. None of this appears on the page UI; all of it controls how machines and sharing platforms see your page. Open Graph is one metadata signal we can directly see AI tools using. Paste your link into ChatGPT, Claude, or Perplexity, and the preview card that appears is built from your `og:*` tags. That's not an assumption — it's visible behavior. Get them wrong and your site shows up as a broken-looking citation.
- Audience
- AI agents rendering link previews when citing your site
- What we look at
<title>meta descriptionog:* tagstwitter:* tags<link rel="canonical">
HTML Semantics — Reading the page like a human.
Whether your HTML uses meaningful tags (`<h1>`, `<main>`, `<article>`, `<nav>`, alt text on images) or wraps every element in `<div>` tags. AI crawlers don't see your CSS — they read the raw HTML structure to figure out which part of the page is the main content, which is navigation, which is the footer. A page built with semantic HTML is much easier for an AI to extract information from than a page where every element is wrapped in `<div>` tags with no semantic meaning. Boundary: this category checks structural semantic tags (`<nav>`, `<main>`, `<article>`, heading hierarchy, image `alt`). Action-element semantics (`<button>`, `<a>`, `<form>` semantic quality) and accessible names on interactive elements are checked in §03 Agent Readiness (R3.3 and R3.4).
- Audience
- Non-rendering fetchers + screen readers
- What we look at
<h1>–<h6><main>, <nav>, <article>alt attributeslink text
Rendering — What a non-JS fetcher actually sees.
Whether your content is visible in the initial HTML response, or only after JavaScript runs. Most AI crawlers — GPTBot, ClaudeBot, Bytespider, CCBot — don't run JavaScript. They fetch your page once and read whatever HTML comes back. The most common case where the HTML comes back empty is single-page apps (React / Vue / Svelte without server-side rendering), which deliver an empty `<div id="root"></div>` for the crawler to find. But the same problem can happen on partially server-rendered sites — Next.js or Astro pages with `ssr: false` components, server-rendered pages with critical widgets injected via JavaScript, or A/B testing tools that swap content client-side. This category checks what the AI actually sees in the initial HTML, regardless of the framework you used to get there. Boundary: this category checks whether content is visible in the initial HTML response, before JavaScript runs. Whether forms can be submitted without complex JS is a separate, agent-focused check at §03 R3.2.
- Audience
- GPTBot, ClaudeBot, Bytespider, CCBot — all non-JS crawlers
- What we look at
initial HTML responsepre-hydration DOMSSR / SSG output
Performance — How fast and efficient your server is.
How fast your server responds and how efficiently it sends content. AI crawlers don't have unlimited patience — every site they visit gets a time and bandwidth budget. If your server takes too long to start replying, the crawler moves on. If your responses are big and uncompressed, the crawler eats through its daily budget on you and stops earlier. If your caching is misconfigured, every visit re-fetches everything from scratch. This category covers five things: how fast the server starts sending data (Time to First Byte, or TTFB), whether responses are compressed (using gzip or brotli), what HTTP version the server uses (modern HTTP/2 or HTTP/3 vs older HTTP/1.1), whether caching is set up reasonably (Cache-Control headers), and how big the HTML file is. None of these alone hides your site from AI (that's the rendering category's job). But together they decide how often AI comes back and how much of your site it finishes reading.
- Audience
- All crawlers — they decide how often to come back based on how fast you respond
- What we look at
TTFBContent-EncodingCache-ControlHTTP/2 or HTTP/3
Discoverability — Findable and trustable at the edge.
How crawlers find and trust your URLs. Four small pieces that compound: 1. Pick a single canonical hostname (`www.yoursite.com` or `yoursite.com` — not both). Without consolidation, AI sees two copies of your site and splits citations between them. 2. Force HTTPS and add HSTS. AI crawlers heavily distrust HTTP-only sites — Googlebot, GPTBot, and ClaudeBot all rank them lower or skip them entirely. HTTPS also prevents content tampering: if anything between your server and the crawler modifies the response (an ISP injecting ads, captive wifi rewriting pages, a hostile network on the path), the AI ends up quoting the tampered version. HSTS keeps crawlers locked into HTTPS so they can't be downgraded back to HTTP on the next request. 3. Publish a sitemap at `/sitemap.xml`. A list of every page on your site — without one, crawlers fall back to following links from your homepage and miss anything not linked there. 4. Add an RSS or Atom feed (content sites only). A real-time stream that AI aggregators like Perplexity and newsreaders subscribe to. None of these is individually huge, but together they decide whether you're easy to find at all.
- Audience
- All crawlers + content discovery pipelines
- What we look at
canonical hostnameTLS + HSTS headers/sitemap.xmlRSS / Atom feed
Common questions.
How are weights chosen?
Empirically. We give the most weight to signals with the strongest verifiable evidence — major AI vendors (OpenAI, Anthropic, Google, Perplexity) publicly commit to honoring robots.txt, Open Graph metadata is concretely consumed by AI agents for link previews, and Schema.org is the most direct mechanism for making content machine-readable. Agent Readiness (§03) carries equal weight to Crawler Policy (§01) because agentic browsers are the fastest-growing class of AI traffic in 2026 and most agent failures are invisible to existing analytics. Smaller weights go to publishing conventions where adoption is real but consumption by AI products is still maturing (llms.txt, skill.md). MCP discovery carries a small weight rather than zero — discovery is still user-configured rather than file-discovered today, but the gap is closing fast.
Why is Category 2 (AI Semantic Files) only 5%?
Because llms.txt and skill.md are publishing conventions where adoption on the publisher side is real (Anthropic, OpenAI, Cloudflare, Vercel, Mintlify, Cursor, Zed all publish them), but mainstream AI platforms haven't formally committed to consuming these files in retrieval. We'll raise the weight as that changes. MCP rules live in §03 Agent Readiness rather than here, because MCP is fundamentally about giving agents a programmatic interface to act, not a static file to read.
Why might rule weights add up differently in my audit report?
The percentages shown on this page are baseline weights — what each rule contributes when ALL rules in a category apply to your site. In practice, some rules can't always be evaluated. For example, R2.2 (llms.txt format) only runs when R2.1 (llms.txt exists) passes. R2.3 (the llms-full.txt bonus) is skipped entirely when the file isn't there — it never fails. When a rule is skipped, it's removed from both the numerator AND denominator of your category score, and the remaining rules' effective weights renormalize automatically. This is by design: skipping is fairer than scoring zero for something that doesn't apply to your site.
How often do rules change?
AI moves fast. New crawlers ship every few months, retrieval behavior shifts, what counts as 'AI-readable' is a moving target — that's why this is a versioned methodology rather than a fixed checklist. We monitor adoption signals and vendor documentation continuously, and update rules whenever the evidence changes. Point releases (v2.0 → v2.1) sharpen individual rules; major versions rebalance weights across categories. Every report shows the methodology version it was scored against, so you can always tell.
Can I appeal a rule result?
Every result ships with the raw evidence (headers, body, timing) so you can check the call against the data we used. If something still looks off, email us with the URL and the rule ID — we read every report carefully, and concrete cases are how the methodology actually improves.
Do you respect robots.txt?
Yes. ProtalBot honors Disallow directives for its own user-agent and only probes files a site operator would reasonably expect an auditor to fetch. See /bot for the full crawler reference.
Do you scan behind auth?
No. The audit targets publicly-fetchable surfaces — the same surface an AI crawler would see. Private areas are out of scope by design.
What about JavaScript-heavy sites?
The Rendering category (§07) explicitly tests what's visible in the initial HTML response, without JS execution. Pure CSR SPAs usually fail rules in this category — that's intentional. Most AI crawlers don't run JavaScript. Agentic browsers do (and the related action-readiness checks live in §03 Agent Readiness).
Why am I seeing the same result as last time?
Each URL is cached for 24 hours — the same URL returns the same result during that window. Well-known sites (github.com, stripe.com, openai.com, and a handful of others) cache for up to 7 days. AI-readiness signals don't shift minute-to-minute, so the cache keeps results stable while you work through fixes. If you've just made changes and want to verify them, give the cache time to expire or scan a different URL on the same site (e.g. a sub-page).