We ran 500 of America's biggest companies through an AI-crawler audit
America's biggest companies built their websites for Google's crawler. Almost none have built anything for the AI agents now reading the web on people's behalf. We fetched 500 corporate homepages the way ChatGPT, Claude, and Perplexity do. Most serve readable HTML, but the layer that makes a page reliably machine-legible is mostly missing, and seven serve a crawler nothing but a blank page.
The AI-era layer is almost empty
These are the signals that let an AI agent parse a page with confidence, point back to it, and respect how you want it used. Adoption falls off a cliff, and you can inspect any single company below.
Each square is one of the 370 companies, inked in if it has the signal. Click any square, or search, to see that company's full result.
has the signaldoes not · click a square for details
It is not just what is missing
Some of what we found is not an empty field but a broken one: a defect provable from the response itself, the kind that fails a CI check.
And not one earned a clean bill of health. Every site tripped at least one check, most often the missing llms.txt; 27 tripped a hard, build-breaking error.
The pages are readable. The signals are missing.
Can an AI just read the raw HTML? For most of these sites, yes: 94% server-render real content and 87% publish a sitemap, so a crawler can reach the words on the page. That is table stakes, and these companies clear it. Crawlability was never the hard part.
The gap is everything that turns readable text into reliable machine input: structured data that declares what a page is, an llms.txt that points an agent at the canonical summary, and Content-Signal headers that state how the content may be used. Those are mostly absent. And for seven companies even the baseline fails: the homepage is an empty JavaScript shell, so a crawler that does not run JS sees nothing at all.
None of this is exotic to fix. These are static files and a few tags, the AI-era equivalent of the sitemap every one of these companies already ships. The winners are the ones who add them first. A few already have: Target, NVIDIA, Adobe, American Express, and Dell all serve a valid llms.txt today.
How we measured (and what we are not claiming)
- We audited 500 of the largest US public companies in July 2026 with @agentmarkup/audit: one browser fetch as a baseline, then fetches as GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended, plus deterministic structured-data, llms.txt, robots, and sitemap checks.
- We ran from a single IP. 130 sites challenged even the browser fetch, which is an unknown state, not a failure, so we discarded them entirely. Every number above is over the 370 sites we could read cleanly.
- A crawler user-agent getting a different response is ambiguous (a firewall rule versus IP allowlisting), so we treat it as a warning, never a "they block AI" accusation. The headline numbers are the unambiguous, provable ones.
- Every finding is reproducible in one command against any site.
Check your own site in one command
You do not have to take our word for any of this. Point the same audit at your homepage:
npx @agentmarkup/audit https://yourdomain.comPrefer a browser? Run the hosted website checker. When it finds gaps, the llms.txt, JSON-LD, and AI crawler guides show how to close them at build time, and the audit guide explains every check.