Audit your site the way AI crawlers see it
Most SEO tools fetch a page once, as a browser, and grade the HTML. @agentmarkup/audit fetches the same URL as GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended, diffs each response against a normal browser, and reports where AI systems get a different, often worse, view than your human visitors. It is the command-line companion to the website checker, built for local runs and CI.
Usage
# Audit any live URL
npx @agentmarkup/audit https://example.com
# JSON output for CI or league tables
npx @agentmarkup/audit https://example.com --json
# Bare domains are normalized to https://
npx @agentmarkup/audit example.com --timeout 15000It is deterministic (pass / warn / error, no invented scores). The exit code is 1 when any error-level finding is present (a CI gate), 0 otherwise, and 2 on a usage error.
What it checks
| Area | What it does |
|---|---|
| Crawler access | Fetches as each AI crawler user-agent and diffs against a browser control. Flags challenges, differential blocks, rate limits, origin errors, and when an accessible crawler gets materially less content than a browser (JS-gated or cloaked pages). |
| JS dependence | Measures whether the raw, un-executed HTML actually contains content, or is an empty shell that only fills in after JavaScript runs. |
| robots.txt | Detects whether the crawlers you likely want are shadowed by a wildcard Disallow, and whether a canonical Content-Signal policy is present. |
| llms.txt | Fetches /llms.txt (guarding against HTML soft-404s), validates it, and checks whether the homepage links it for discovery. |
| JSON-LD | Extracts the JSON-LD blocks and flags only unparseable or type-less ones; parseable structured data, including @graph, passes. |
| Markdown mirror | Detects a fetchable markdown mirror or a text/markdown alternate link, the clean low-noise version agents prefer. |
| Sitemap | Checks for /sitemap.xml, a Sitemap: directive in robots.txt, or common non-standard sitemap paths. |
| Page metadata | Checks for a title, meta description, and canonical link that AI systems use to attribute the page. |
An honest note on "blocked" crawlers
The audit spoofs a crawler's user-agent from an ordinary IP. That is exactly what a browser extension or a curious developer can do, and it is not what the real, verified bot does. So a 403 for a spoofed GPTBot user-agent is genuinely ambiguous:
- it can be a user-agent WAF rule, which also blocks the real GPTBot (a real problem), or
- it can be IP allowlisting, where the verified GPTBot, coming from OpenAI's published IP ranges, is let through just fine (no problem at all).
From a spoofed request the tool cannot tell these apart, so it reports them as warnings with both explanations and the raw evidence, never as a bare "your site blocks AI" error. Error-level findings are reserved for things provable from the response itself: a robots.txt that literally disallows the crawler, an empty JavaScript shell, or invalid llms.txt / JSON-LD.
Use it as a CI gate
Because the exit code is non-zero only on provable errors, the audit is safe to run in CI without false failures from the ambiguous cases:
# .github/workflows/ci.yml (excerpt)
- run: npx @agentmarkup/audit https://example.comProgrammatic use
The same audit is available as a library:
import { audit, renderText } from '@agentmarkup/audit'
const report = await audit('https://example.com', {
fetchedAt: new Date().toISOString(),
})
console.log(report.summary) // { pass, warn, error, checks, passed, worst }
process.stdout.write(renderText(report))The exported analyzers (analyzeCrawlerAccess, analyzeRobots, analyzeJsDependence, analyzeMachineReadable) and the SSRF-safe safeFetch are available for building custom pipelines.
How it relates to the rest of agentmarkup
The build-time adapters and the CLI generate machine-readable output; @agentmarkup/audit verifies what a live site actually serves to AI crawlers. It pairs naturally with the llms.txt, JSON-LD, and AI crawler guides: use those to fix what the audit finds.
Frequently asked questions
Does a 403 for GPTBot mean my site blocks AI?
Not necessarily. The audit spoofs the user-agent from a generic IP, so a 403 can be a user-agent WAF rule (which does block the real bot) or IP allowlisting (where the verified bot, from the vendor's published IP ranges, is fine). The audit reports this as a warning with both explanations, not as a definitive block.
Is it safe to point at any URL?
Requests use an SSRF-safe fetch: localhost, private, loopback, link-local, CGNAT, and IPv6-bypass address forms are refused, redirects are followed manually and re-validated per hop, and responses are size- and time-bounded. The blocklist mirrors the hosted checker.
How is this different from the website checker?
They run the same idea. The checker is the hosted, browser-based version for a quick lookup; @agentmarkup/audit is the command-line version for local runs, scripting, and CI, with a non-zero exit code on provable errors.