AI Crawler Access
ai-crawler-access
What this datapoint measures
Whether the brand’s robots configuration permits known AI crawlers — GPTBot, ClaudeBot, PerplexityBot, GoogleBot, GoogleOther, ChatGPT-User, anthropic-ai, and similar AI-system crawler user-agents — to access the brand’s content.
The datapoint is straightforward but consequential. A brand whose robots.txt blocks AI crawlers, or whose server returns errors to AI crawler user-agents, is a brand that has structurally excluded itself from AI-mediated discovery. Some brands do this deliberately as a content-protection or business-model decision; most do it accidentally as a side effect of legacy robots configurations.
What high looks like
- robots.txt allows access from major AI crawlers (no Disallow directives blocking GPTBot, ClaudeBot, PerplexityBot, etc.)
- Server-level configurations do not block AI crawler user-agents at the firewall, CDN, or application level
- robots meta tags on pages do not include
noai,noimageaidirectives unless intentional - llms.txt (where present) does not block crawler access
What low looks like
- robots.txt blocks some AI crawlers but not others (inconsistent treatment)
- Some AI crawlers receive 200 responses while others receive 403 or 429 due to bot-detection middleware
- robots meta
noaidirectives present on some pages but not others (inconsistent) - AI crawler access works for some content but not others (e.g., main site accessible but blog subdomain blocked)
What at floor looks like
A brand at floor on ai-crawler-access has actively or accidentally blocked AI crawlers from substantial portions of the site. The brand has structurally excluded itself from AI-mediated discovery. AI systems either cannot index the content at all or have stale indexes from before the blocking occurred.
This is one of the most consequential failure modes at AS ≈ 0. It often goes undetected because traditional SEO tools do not flag AI-crawler blocking; the brand’s traditional search performance can remain intact while AVO performance is structurally zero.
The path off floor is to update robots.txt, server configurations, and bot-detection rules to permit AI crawlers. This work is typically engineering-quick but requires coordination across whoever owns the robots.txt, the CDN, and any bot-detection systems.
What affects this datapoint
- robots.txt directives (User-agent and Disallow rules)
- Server-level user-agent filtering at the application or middleware layer
- CDN-level bot-detection rules
- WAF (Web Application Firewall) rules
- robots meta tags on pages
- llms.txt declarations
- Rate-limiting rules that may produce 429 responses to high-volume crawler traffic
OMG actions that influence this datapoint
| Action | Influence |
|---|---|
| O-4 Technical Infrastructure, Performance & International Foundation | Direct, primary. Crawler access is a core component of O-4 work. |
| O-7 Compliance & Trust Infrastructure | Indirect. O-7 work sometimes surfaces blocking decisions made for compliance reasons that may need reconsideration. |
Multilingual considerations
ai-crawler-access is language-neutral. The crawler reads bytes regardless of language. However, multilingual brands often have multiple subdomains or country-code top-level domains, and the crawler-access configuration must be consistent across all of them. A brand whose en.brand.com permits AI crawlers but ja.brand.com blocks them has degraded multilingual AVO performance specifically in Japanese.
For brands using internationalization via path (brand.com/ja/) rather than subdomain, the robots.txt is shared across languages — but server-level configurations may still differ.
Common failure modes
- Legacy robots.txt blocking GPTBot or similar based on outdated content-protection advice
- CDN bot-detection (Cloudflare Bot Management, similar) blocking AI crawlers as part of broader bot-blocking rules
- Application-level rate limiting treating AI crawler traffic as abuse
- Server logs showing 403 or 429 responses to AI crawler user-agents while the brand stakeholder believes the site is accessible
- Brand stakeholder having explicitly opted out of AI training (a legitimate business decision) but not having considered whether they also want to opt out of AI retrieval and citation (related but distinct)
- Subdomain or country-domain configurations that diverge from the main domain
Diagnostic interpretation
ai-crawler-access at floor is a stop-everything finding. No other AVO work matters until AI crawlers can reach the content. The first action in any engagement that surfaces this finding is to remediate the access issue.
ai-crawler-access at low (some crawlers blocked, others permitted) indicates inconsistent treatment that should be unified. The remediation is straightforward but requires deliberate decisions about which crawlers to permit and which to block, with the brand stakeholder’s input on content-protection preferences.
ai-crawler-access at high but other V1.2 datapoints low indicates that crawlers can reach the site but find broken or unperforming infrastructure. The remedy is the rest of V1.2 work.