By Sebastian Cochinescu · July 2, 2026 · 5 min read

We ran 500 of America's biggest companies through an AI-crawler audit

America's biggest companies built their websites for Google's crawler. Almost none have built anything for the AI agents now reading the web on people's behalf. We fetched 500 corporate homepages the way ChatGPT, Claude, and Perplexity do. Most serve readable HTML, but the layer that makes a page reliably machine-legible is mostly missing, and seven serve a crawler nothing but a blank page.

Structured data

46%

have no usable JSON-LD

nothing machine-readable saying what the page is

Discovery

86%

have no llms.txt

the emerging AI-discovery file

Usage rules

99%

set no AI-usage signal

Content-Signal, the new opt-in standard

Broken

serve crawlers a blank page

content hidden behind JavaScript they don't run

The AI-era layer is almost empty

These are the signals that let an AI agent parse a page with confidence, point back to it, and respect how you want it used. Adoption falls off a cliff, and you can inspect any single company below.

Each square is one of the 370 companies, inked in if it has the signal. Click any square, or search, to see that company's full result.

Structured data201 / 370 · 54%

llms.txt50 / 370 · 14%

Content-Signal3 / 370 · under 1%

has the signaldoes not · click a square for details

It is not just what is missing

Some of what we found is not an empty field but a broken one: a defect provable from the response itself, the kind that fails a CI check.

robots.txt

disallow an AI crawler

block GPTBot or a peer outright

Structured data

ship broken JSON-LD

markup an agent cannot parse

Bait and switch

show crawlers less than a browser

the bot gets a thinner page than you

Near-empty

serve thin HTML

barely enough for a crawler to use

And not one earned a clean bill of health. Every site tripped at least one check, most often the missing llms.txt; 27 tripped a hard, build-breaking error.

27 hard errors343 with warnings0 fully clean

The pages are readable. The signals are missing.

Can an AI just read the raw HTML? For most of these sites, yes: 94% server-render real content and 87% publish a sitemap, so a crawler can reach the words on the page. That is table stakes, and these companies clear it. Crawlability was never the hard part.

The gap is everything that turns readable text into reliable machine input: structured data that declares what a page is, an llms.txt that points an agent at the canonical summary, and Content-Signal headers that state how the content may be used. Those are mostly absent. And for seven companies even the baseline fails: the homepage is an empty JavaScript shell, so a crawler that does not run JS sees nothing at all.

None of this is exotic to fix. These are static files and a few tags, the AI-era equivalent of the sitemap every one of these companies already ships. The winners are the ones who add them first. A few already have: Target, NVIDIA, Adobe, American Express, and Dell all serve a valid llms.txt today.

How we measured (and what we are not claiming)

We audited 500 of the largest US public companies in July 2026 with @agentmarkup/audit: one browser fetch as a baseline, then fetches as GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended, plus deterministic structured-data, llms.txt, robots, and sitemap checks.
We ran from a single IP. 130 sites challenged even the browser fetch, which is an unknown state, not a failure, so we discarded them entirely. Every number above is over the 370 sites we could read cleanly.
A crawler user-agent getting a different response is ambiguous (a firewall rule versus IP allowlisting), so we treat it as a warning, never a "they block AI" accusation. The headline numbers are the unambiguous, provable ones.
Every finding is reproducible in one command against any site.

Check your own site in one command

You do not have to take our word for any of this. Point the same audit at your homepage:

npx @agentmarkup/audit https://yourdomain.com

Prefer a browser? Run the hosted website checker. When it finds gaps, the llms.txt, JSON-LD, and AI crawler guides show how to close them at build time, and the audit guide explains every check.

We ran 500 of America's biggest companies through an AI-crawler audit

The AI-era layer is almost empty

It is not just what is missing

The pages are readable. The signals are missing.

How we measured (and what we are not claiming)

Check your own site in one command

Make your website machine-readable

More from the blog

See your website the way AI crawlers do

How to add llms.txt, JSON-LD, and AI crawler controls to Nuxt

Run agentmarkup on any static site with the CLI

How to add llms.txt, JSON-LD, and AI crawler controls to Next.js

When markdown mirrors help, and when they do not

Is your website ready for AI? Free LLM discoverability checker

Build-time markdown mirrors for agent readability: Cloudflare comparison

How to make your brand appear in AI conversations

Why LLM-optimized e-commerce websites sell more

Every AI crawler indexing your website in 2026

JSON-LD structured data: the complete guide for web developers

What is GEO? Generative Engine Optimization explained for developers

Why llms.txt matters: making your website discoverable by AI