# We ran 500 of America's biggest companies through an AI-crawler audit - agentmarkup

> We fetched 500 corporate homepages the way ChatGPT, Claude, and Perplexity do. Most serve readable HTML, but 46% have no usable structured data, 86% have no llms.txt, and seven serve crawlers a blank page. Built for Google, not yet for AI agents.

Source: https://agentmarkup.dev/blog/ai-crawler-audit-500-companies/

By [Sebastian Cochinescu](/authors/sebastian-cochinescu/) · July 2, 2026 · 5 min read

# We ran 500 of America's biggest companies through an AI-crawler audit

America's biggest companies built their websites for Google's crawler. Almost none have built anything for the AI agents now reading the web on people's behalf. We fetched 500 corporate homepages the way ChatGPT, Claude, and Perplexity do. Most serve readable HTML, but the layer that makes a page reliably machine-legible is mostly missing, and seven serve a crawler nothing but a blank page.

Structured data

 46 %

have no usable JSON-LD

nothing machine-readable saying what the page is

Discovery

 86 %

have no llms.txt

the emerging AI-discovery file

Usage rules

 99 %

set no AI-usage signal

Content-Signal, the new opt-in standard

Broken

serve crawlers a blank page

content hidden behind JavaScript they don't run

## The AI-era layer is almost empty

These are the signals that let an AI agent parse a page with confidence, point back to it, and respect how you want it used. Adoption falls off a cliff, and you can inspect any single company below.

Each square is one of the 370 companies, inked in if it has the signal. Click any square, or search, to see that company's full result.

 Structured data 201 / 370 · 54%

 llms.txt 50 / 370 · 14%

 Content-Signal 3 / 370 · under 1%

has the signaldoes not · click a square for details

## It is not just what is missing

Some of what we found is not an empty field but a broken one: a defect provable from the response itself, the kind that fails a CI check.

robots.txt

disallow an AI crawler

block GPTBot or a peer outright

Structured data

ship broken JSON-LD

markup an agent cannot parse

Bait and switch

show crawlers less than a browser

the bot gets a thinner page than you

Near-empty

serve thin HTML

barely enough for a crawler to use

And not one earned a clean bill of health. Every site tripped at least one check, most often the missing llms.txt; 27 tripped a hard, build-breaking error.

 27 hard errors 343 with warnings 0 fully clean

## The pages are readable. The signals are missing.

Can an AI just read the raw HTML? For most of these sites, yes: 94% server-render real content and 87% publish a sitemap, so a crawler can reach the words on the page. That is table stakes, and these companies clear it. Crawlability was never the hard part.

The gap is everything that turns readable text into *reliable* machine input: [structured data](/docs/json-ld/) that declares what a page is, an [llms.txt](/docs/llms-txt/) that points an agent at the canonical summary, and Content-Signal headers that state how the content may be used. Those are mostly absent. And for seven companies even the baseline fails: the homepage is an empty JavaScript shell, so a crawler that does not run JS sees nothing at all.

None of this is exotic to fix. These are static files and a few tags, the AI-era equivalent of the sitemap every one of these companies already ships. The winners are the ones who add them first. A few already have: Target, NVIDIA, Adobe, American Express, and Dell all serve a valid `llms.txt` today.

## How we measured (and what we are not claiming)

- We audited 500 of the largest US public companies in July 2026 with [@agentmarkup/audit](https://www.npmjs.com/package/@agentmarkup/audit): one browser fetch as a baseline, then fetches as GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended, plus deterministic structured-data, llms.txt, robots, and sitemap checks.
- We ran from a single IP. 130 sites challenged even the browser fetch, which is an unknown state, not a failure, so we **discarded them entirely**. Every number above is over the **370** sites we could read cleanly.
- A crawler user-agent getting a different response is ambiguous (a firewall rule versus IP allowlisting), so we treat it as a **warning, never a "they block AI" accusation**. The headline numbers are the unambiguous, provable ones.
- Every finding is reproducible in one command against any site.

## Check your own site in one command

You do not have to take our word for any of this. Point the same audit at your homepage:

```
npx @agentmarkup/audit https://yourdomain.com
```

Prefer a browser? Run the hosted [website checker](/checker/). When it finds gaps, the [llms.txt](/docs/llms-txt/), [JSON-LD](/docs/json-ld/), and [AI crawler](/docs/ai-crawlers/) guides show how to close them at build time, and the [audit guide](/docs/audit/) explains every check.

## Make your website machine-readable

agentmarkup is an open-source build-time toolkit for Vite, Astro, Next.js, and Nuxt (plus a framework-agnostic CLI) that generates llms.txt, injects JSON-LD structured data, creates optional markdown mirrors from final HTML when raw pages need a cleaner agent-facing fetch path, manages AI crawler robots.txt rules, patches optional Content-Signal and canonical mirror headers, and validates everything at build time. Zero runtime cost.

 Learn more GitHub
```
pnpm add -D @agentmarkup/vite # or @agentmarkup/astro, @agentmarkup/next, @agentmarkup/nuxt, @agentmarkup/cli
```

Written by

[Sebastian Cochinescu](/authors/sebastian-cochinescu/) · Developer of agentmarkup

Builder of developer tools for machine-readable websites. Developer of agentmarkup. Founder of Anima Felix.

## More from the blog

### See your website the way AI crawlers do

Use @agentmarkup/audit to fetch any live URL as GPTBot, ClaudeBot, PerplexityBot, and other AI crawlers, diff each response against a browser, and catch machine-readability issues in CI.

 July 2, 2026 · 6 min read

### How to add llms.txt, JSON-LD, and AI crawler controls to Nuxt

Use @agentmarkup/nuxt to generate llms.txt, inject JSON-LD, create markdown mirrors, and manage AI crawler rules from prerendered Nuxt output.

 June 21, 2026 · 7 min read

### Run agentmarkup on any static site with the CLI

Use @agentmarkup/cli to run llms.txt, JSON-LD, markdown mirrors, and AI crawler controls over any built static output, with a CI check command.

 June 21, 2026 · 6 min read

### How to add llms.txt, JSON-LD, and AI crawler controls to Next.js

Use @agentmarkup/next to generate llms.txt, inject JSON-LD, manage AI crawler rules, and understand the dynamic SSR boundary in Next.js.

 March 23, 2026 · 8 min read

### When markdown mirrors help, and when they do not

A practical guide to when generated markdown mirrors add signal, when HTML is already enough, and how to avoid unnecessary downsides.

 March 20, 2026 · 7 min read

### Is your website ready for AI? Free LLM discoverability checker

Audit your website for llms.txt, JSON-LD, robots.txt, markdown mirrors, and sitemap. Free tool for e-commerce and brand websites.

 March 20, 2026 · 8 min read

### Build-time markdown mirrors for agent readability: Cloudflare comparison

Build-time markdown generation for AI readability, including when it helps and how it compares to Cloudflare runtime extraction.

 March 20, 2026 · 7 min read

### How to make your brand appear in AI conversations

Organization schema, llms.txt, and FAQ markup make your brand visible in ChatGPT, Claude, and Perplexity answers.

 March 20, 2026 · 7 min read

### Why LLM-optimized e-commerce websites sell more

Product JSON-LD, llms.txt, and AI crawler access make your store visible in AI product recommendations.

 March 20, 2026 · 8 min read

### Every AI crawler indexing your website in 2026

Complete list: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and more. What each does and how to control access.

 March 20, 2026 · 8 min read

### JSON-LD structured data: the complete guide for web developers

Schema types, JSON-LD vs microdata, common mistakes, and build-time validation.

 March 20, 2026 · 10 min read

### What is GEO? Generative Engine Optimization explained for developers

What is real, what is hype, and what you can do today to make your site citeable by AI.

 March 20, 2026 · 7 min read

### Why llms.txt matters: making your website discoverable by AI

LLMs answer questions by synthesizing web content. llms.txt gives them a structured overview of your site.

 March 20, 2026 · 6 min read
