Meet ProtalBot.The crawler behind every audit.
ProtalBot is a well-behaved, identifiable web crawler. This page is the official reference — user-agent string, IP provenance, robots.txt directives, and how to verify, allow, rate-limit, or block it.
Verify it's really us
ProtalBot traffic originates from Vercel us-east-1 egress. A list of allowlistable CIDR ranges will publish to /.well-known/protalbot.json before public launch — until then, email us if you see traffic and want to confirm.
# starting in production
host <ip> | grep "vercel"
# ✓ vercel-infrastructure.comOne string we send.
ProtalBot identifies itself with a single, stable user-agent. Future fleets (continuous monitoring, replay-as) are planned but not operational — when they ship they'll get distinct UAs so you can scope policy independently.
Region pinned, ranges TBD.
All ProtalBot traffic egresses from a single region (us-east-1) for reproducibility — same site scanned twice gets the same latency profile. The exact CIDR list is finalized before public launch.
(published before launch)———Canonical list will land at https://protal.ai/.well-known/protalbot.json · auto-updated when egress changes.
Three configurations, copy-paste.
Drop one of these into your /robots.txt. ProtalBot re-fetches robots at the start of every scan.
Welcome ProtalBot
Default for most sites — let Protal audit when scanned by you or someone you've shared the URL with.
# Allow Protal's auditor User-agent: ProtalBot Allow: / Sitemap: https://example.com/sitemap.xml
Slow it down
Useful if your origin is sensitive. ProtalBot already self-throttles — this just makes it more conservative.
# Audit OK, but pace yourself User-agent: ProtalBot Allow: / Crawl-delay: 5
Opt out
Protal honors this immediately — audits will report "blocked" and stop probing.
# No thanks User-agent: ProtalBot Disallow: /
How politely it fetches.
ProtalBot is designed to be invisible in your logs. If it isn't, email us — we treat runaway audits as bugs.
Per-host concurrency
One concurrent scan per target domain — never parallelize a site against itself. Multiple users requesting the same site share a 24h cache.
Hourly cap
Maximum 10 scans per target host per hour, globally across all requesters. Hot targets (github.com, stripe.com) extend to a 7-day cache.
Cache lifetime
Rule results cached 24 hours unless the user explicitly re-runs. robots.txt is re-fetched at the start of every scan.
No JS execution
ProtalBot is mostly a pure fetcher. Category 6 (Rendering) does invoke a headless Chromium to test what's missing for non-JS crawlers, but the static path stays static.
The usual questions.
Is ProtalBot used to train models?
No. Protal does not sell, license, or use fetched content to train language models. Responses are analyzed for audit rules only, then discarded within 24 hours.
What if I block ProtalBot?
We immediately stop probing your site. The next scan request reports a blocked status with the matching Disallow as evidence.
Does it hit /admin or /api?
Only if they're linked from pages an AI agent would crawl, and only if robots.txt permits. You can always carve them out with a scoped Disallow.
How do I contact you about traffic?
Email bot@protal.ai with the user-agent and a timestamp — we respond within five business days and can pause audits mid-flight.
Can I whitelist by IP?
Soon. The published IP-range list is being finalized for production — see the IP ranges section below. Reverse-DNS verification is the spoof-resistant path once that's up.
Does ProtalBot follow nofollow?
Yes. Link-level rel="nofollow" and meta robotsdirectives are respected per the standard — we audit, we don't index a link graph.