# How to audit your site the way AI crawlers see it - agentmarkup

> Use @agentmarkup/audit to fetch any live URL as GPTBot, ClaudeBot, PerplexityBot, and other AI crawlers, diff each response against a browser, and catch machine-readability issues in CI.

Source: https://agentmarkup.dev/docs/audit/

# Audit your site the way AI crawlers see it

Most SEO tools fetch a page once, as a browser, and grade the HTML. `@agentmarkup/audit` fetches the **same URL as GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended**, diffs each response against a normal browser, and reports where AI systems get a different, often worse, view than your human visitors. It is the command-line companion to the [website checker](/checker/), built for local runs and CI.

## Usage

```
# Audit any live URL
npx @agentmarkup/audit https://example.com

# JSON output for CI or league tables
npx @agentmarkup/audit https://example.com --json

# Bare domains are normalized to https://
npx @agentmarkup/audit example.com --timeout 15000
```

It is deterministic (pass / warn / error, no invented scores). The exit code is `1` when any error-level finding is present (a CI gate), `0` otherwise, and `2` on a usage error.

## What it checks

 Area What it does Crawler access Fetches as each AI crawler user-agent and diffs against a browser control. Flags challenges, differential blocks, rate limits, origin errors, and when an accessible crawler gets materially less content than a browser (JS-gated or cloaked pages). JS dependence Measures whether the raw, un-executed HTML actually contains content, or is an empty shell that only fills in after JavaScript runs. robots.txt Detects whether the crawlers you likely want are shadowed by a wildcard Disallow , and whether a canonical Content-Signal policy is present. llms.txt Fetches /llms.txt (guarding against HTML soft-404s), validates it, and checks whether the homepage links it for discovery. JSON-LD Extracts the JSON-LD blocks and flags only unparseable or type-less ones; parseable structured data, including @graph , passes. Markdown mirror Detects a fetchable markdown mirror or a text/markdown alternate link, the clean low-noise version agents prefer. Sitemap Checks for /sitemap.xml , a Sitemap: directive in robots.txt, or common non-standard sitemap paths. Page metadata Checks for a title, meta description, and canonical link that AI systems use to attribute the page.

## An honest note on "blocked" crawlers

The audit spoofs a crawler's **user-agent** from an ordinary IP. That is exactly what a browser extension or a curious developer can do, and it is *not* what the real, verified bot does. So a `403` for a spoofed `GPTBot` user-agent is genuinely ambiguous:

- it can be a **user-agent WAF rule**, which also blocks the real GPTBot (a real problem), **or**
- it can be **IP allowlisting**, where the verified GPTBot, coming from OpenAI's published IP ranges, is let through just fine (no problem at all).

From a spoofed request the tool cannot tell these apart, so it reports them as **warnings with both explanations and the raw evidence**, never as a bare "your site blocks AI" error. Error-level findings are reserved for things provable from the response itself: a `robots.txt` that literally disallows the crawler, an empty JavaScript shell, or invalid `llms.txt` / JSON-LD.

## Use it as a CI gate

Because the exit code is non-zero only on provable errors, the audit is safe to run in CI without false failures from the ambiguous cases:

```
# .github/workflows/ci.yml (excerpt)
- run: npx @agentmarkup/audit https://example.com
```

## Programmatic use

The same audit is available as a library:

```
import { audit, renderText } from '@agentmarkup/audit'

const report = await audit('https://example.com', {
 fetchedAt: new Date().toISOString(),
})

console.log(report.summary) // { pass, warn, error, checks, passed, worst }
process.stdout.write(renderText(report))
```

The exported analyzers (`analyzeCrawlerAccess`, `analyzeRobots`, `analyzeJsDependence`, `analyzeMachineReadable`) and the SSRF-safe `safeFetch` are available for building custom pipelines.

## How it relates to the rest of agentmarkup

The build-time adapters and the [CLI](https://www.npmjs.com/package/@agentmarkup/cli) *generate* machine-readable output; [@agentmarkup/audit](https://www.npmjs.com/package/@agentmarkup/audit) *verifies* what a live site actually serves to AI crawlers. It pairs naturally with the [llms.txt](/docs/llms-txt/), [JSON-LD](/docs/json-ld/), and [AI crawler](/docs/ai-crawlers/) guides: use those to fix what the audit finds.

## Frequently asked questions

 Does a 403 for GPTBot mean my site blocks AI?

Not necessarily. The audit spoofs the user-agent from a generic IP, so a 403 can be a user-agent WAF rule (which does block the real bot) or IP allowlisting (where the verified bot, from the vendor's published IP ranges, is fine). The audit reports this as a warning with both explanations, not as a definitive block.

 Is it safe to point at any URL?

Requests use an SSRF-safe fetch: localhost, private, loopback, link-local, CGNAT, and IPv6-bypass address forms are refused, redirects are followed manually and re-validated per hop, and responses are size- and time-bounded. The blocklist mirrors the hosted checker.

 How is this different from the website checker?

They run the same idea. The [checker](/checker/) is the hosted, browser-based version for a quick lookup; `@agentmarkup/audit` is the command-line version for local runs, scripting, and CI, with a non-zero exit code on provable errors.
