Build-time markdown mirrors for agent readability: how they compare to Cloudflare's approach
When an AI agent visits your website, it gets HTML. On some sites that is fine. On JS-heavy or layout-heavy pages, the content is buried in noise. Build-time markdown mirrors can give agents a cleaner fetch target without changing the canonical HTML page.
Not every site needs a markdown mirror
If your pages already ship substantial, well-structured HTML, the raw page may already be a good enough fetch target for agents. Markdown mirrors are most useful when the raw HTML is thin, heavily templated, or dominated by layout chrome.
That is the more honest framing for this feature: markdown mirrors are an optional machine-facing artifact for the pages that benefit from them, not a blanket rule that every site should publish a public .md companion for every page.
The problem: some HTML is a bad fetch target
Many agents can extract useful text from HTML, but the quality of the result still depends on what your raw response looks like. A typical web page can be heavy with navigation, cookie banners, analytics tags, scripts, and layout wrappers that have nothing to do with the main body content.
When the raw HTML is mostly shell and very little body content, fetch-based agents either miss the important text or have to guess too much. That is the case markdown mirrors try to fix.
What are markdown mirrors?
A markdown mirror is a .md file that contains the same content as your HTML page, but stripped of layout, navigation, and scripts. Just the content, in clean markdown format.
For example, /blog/my-post/index.html gets a companion file at /blog/my-post.md. An AI agent can fetch the markdown version directly instead of parsing the HTML.
Your pages also get a<link rel="alternate" type="text/markdown"> tag in the HTML head, so crawlers can discover the markdown version automatically when you enable the feature.
How agentmarkup generates markdown mirrors
Enable the feature in your config and it runs at build time on every HTML page in your output:
// vite.config.ts or astro.config.mjs
agentmarkup({
site: 'https://example.com',
name: 'My Site',
markdownPages: {
enabled: true,
},
})The converter:
- Extracts the page title, meta description, and canonical URL from the HTML head
- Finds the main content area (
<main>,<article>, or<body>) - Strips navigation, headers, footers, sidebars, scripts, styles, SVGs, and forms
- Converts headings, lists, links, bold, italic, code, and blockquotes to markdown syntax
- Preserves code blocks intact
- Normalizes whitespace and deduplicates the page title
- Injects a
<link rel="alternate">tag into the HTML for discovery
The result is a clean markdown file that an agent can read without wading through layout chrome.
Cloudflare's approach: runtime readability extraction
Cloudflare offers a readability extraction feature that strips HTML to readable content at request time. It is based on Mozilla's Readability library and runs on Cloudflare's edge network.
The key difference is runtime versus build time. Cloudflare processes pages on every request. You do not control the exact output. The extraction algorithm decides what is content and what is noise using heuristics.
Build-time vs runtime: why it matters
| agentmarkup (build-time) | Cloudflare (runtime) | |
|---|---|---|
| When it runs | Once, during build | Every request |
| Output control | You see the .md files in your build output | Opaque, algorithm decides |
| Consistency | Deterministic, same output every build | May vary with algorithm updates |
| Performance cost | Zero runtime cost | Added latency per request |
| Works with SPAs | Yes, uses noscript fallback or pre-rendered HTML | Depends on SSR availability |
| Discovery | Link tag in HTML head + static .md URL | Special URL parameter or header |
| Vendor lock-in | None, output is static files | Requires Cloudflare |
| Customization | Choose which pages, preserve existing .md files | All or nothing |
Why build-time can be a good fit for your own content
Cloudflare's runtime extraction makes sense for consuming other people's content, like a reader mode. For your own website, build-time generation can be a better fit because:
- You control the output. If the markdown is wrong, you can debug it. You see the actual .md files in your build directory.
- It works with client-rendered apps. agentmarkup checks for noscript fallback content in SPAs and uses it when the rendered body is thin. Runtime extractors often get empty content from JavaScript-rendered pages.
- No vendor dependency. The markdown files are static. Deploy them anywhere. They work on Cloudflare Pages, Netlify, Vercel, S3, or any static host.
- Integrated with the rest of the stack. Markdown mirrors work alongside llms.txt, JSON-LD, and robots.txt. One config, one build, everything consistent.
How agentmarkup reduces the downside
Public markdown mirrors do create tradeoffs. The main risks are duplicate fetches, indexing ambiguity, and output drift if the markdown becomes a second source of truth.
agentmarkup tries to keep those risks contained by generating the mirrors from final built HTML, preserving HTML as the canonical page, and writing canonical headers from each .md file back to the HTML route. If your raw HTML is already substantial, you can also keep llms.txt pointing at HTML by settingllmsTxt.preferMarkdownMirrors to false.
What the output looks like
For a blog post with a title, description, headings, and paragraphs, the generated markdown looks like:
# Why llms.txt matters
> LLMs answer questions by synthesizing web content. llms.txt gives them a structured overview.
Source: https://example.com/blog/why-llms-txt-matters/
## The shift from search engines to AI answers
For two decades, the path to online visibility was clear: optimize for Google...
## What is llms.txt?
llms.txt is a proposed standard that gives LLMs a structured overview of your website...Clean, readable, no HTML artifacts. An AI agent reading this file understands the page quickly.
Getting started
Add markdownPages: { enabled: true } to your agentmarkup config when your raw HTML needs a cleaner machine-facing fetch path. On the next build, every HTML page in your output gets a companion .md file. When markdown mirrors are enabled, same-site page entries in llms.txt also default to the generated markdown URLs so cold agents discover the cleaner fetch path first. Check the llms.txt guide for the opt-out if you want HTML-first links instead.
If your site already serves rich raw HTML, you do not need to treat markdown mirrors as mandatory. They are a tactical option, not the whole product.
pnpm add -D @agentmarkup/vite # or @agentmarkup/astro