ai · Jun 1, 2026 · 5 min read

How to Convert an Entire Website to Clean Markdown for RAG in 2026

A practical guide to crawling a site and extracting boilerplate-free Markdown and plain text for LLM training, RAG pipelines, embeddings and AI agents — one row per page.

Every RAG pipeline starts with the same unglamorous problem: getting clean text out of HTML. Raw web pages are full of navigation menus, cookie banners, footers, ad slots, and script tags — all of which poison embeddings and waste context window when fed to an LLM. What you actually want is the main content of each page, stripped of boilerplate, converted to Markdown that preserves structure (headings, lists, links) while throwing away the chrome. This guide covers how a content-extraction crawl works, why Markdown beats raw HTML for AI, and the per-page economics of converting a whole site.

What’s worth extracting

For a knowledge base or training set, you want the substance of each page and just enough metadata to chunk and cite it:

Cleaned plain text — boilerplate-free body content, ready for embedding.
Converted Markdown — the same content with headings, lists, and links preserved, so structure survives into the LLM context.
Optional cleaned main-content HTML — when you need the markup, minus the chrome.
Page address — the canonical URL, essential for citations and dedup.
Title and meta description.
Primary heading and language.
Canonical link.
Word count — for chunking decisions and quality filtering.

The defining feature is boilerplate removal. The crawler isolates the primary article/main content and discards navigation, headers, footers, ads, and scripts — the difference between feeding your model a clean document and feeding it a page’s worth of menu links.

Why Markdown beats raw HTML and raw text for AI

It’s tempting to just strip all tags and embed the plain text. Don’t. Markdown is the sweet spot for LLM input for concrete reasons:

Structure survives. A heading stays a heading (## Pricing), a list stays a list. The model uses that structure to understand the document. Flatten it to plain text and you lose the hierarchy that tells the model what’s a section and what’s a bullet.
It’s token-efficient. Markdown carries structure with far fewer tokens than HTML’s tag soup. You fit more real content per context window.
Links are preserved cleanly. Relative links and image URLs are rewritten to absolute, so a chunk pulled into a RAG answer still has working references.
It chunks well. Heading boundaries are natural chunk boundaries. Clean Markdown makes a chunker’s job easy.

The crawler emits both Markdown and plain text, so you pick per pipeline — Markdown for structure-aware chunking and LLM context, plain text for pure embedding where structure doesn’t matter.

▶ Run the Website to Markdown & Text Crawler — crawls a whole site and returns boilerplate-free Markdown and plain text, one row per page, with absolute links and full metadata. Ready for RAG, embeddings and fine-tuning. No login, no browser.

How the crawl works

From a single seed URL the crawler discovers the whole site by following internal links across the domain. On each page it isolates the main content, removes the chrome, rewrites relative links and image URLs to absolute paths, converts to Markdown, and emits a per-page record. It’s a high-concurrency HTTP crawler — no headless browser — which is what lets it turn a 1,000-page docs site into a clean corpus in one run.

The no-browser tradeoff applies here too: it extracts server-rendered content. Documentation sites, blogs, and help centers — the bread and butter of RAG knowledge bases — are almost always server-rendered, so this is the right tool. A fully client-rendered SPA would need rendering; know your target.

Output schema

{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started",
  "meta_description": "Set up Example in five minutes.",
  "primary_heading": "Getting Started",
  "language": "en",
  "canonical": "https://docs.example.com/getting-started",
  "word_count": 640,
  "text": "Getting Started\n\nSet up Example in five minutes...",
  "markdown": "# Getting Started\n\nSet up Example in five minutes...",
  "crawled_at": "2026-06-01T09:00:00Z"
}

One row per page maps cleanly to one document in your vector store. The url becomes your citation source; word_count drives your chunking; markdown is what you split and embed.

Use cases

Build a RAG knowledge base from documentation, blogs, and help centers — the most common reason teams run this.
Create LLM fine-tuning datasets by collecting high-quality web text at scale.
Feed AI agents and chatbots structured, current site content so answers reflect the live docs.
Migrate or archive a site into portable Markdown for content portability.
Populate a vector database with clean text for semantic search and embeddings.

Cost math

Pay-per-event, small per-run start fee, zero per result, one row per page. Corpus building is high-volume — a knowledge base is hundreds to thousands of pages.

1,500-page docs site, one clean Markdown row per page.
One run, results free.
Cost is the Actor start plus HTTP compute.

Then compare the downstream economics: the embedding and LLM costs of your pipeline are driven by token count. Clean, boilerplate-free Markdown means you’re not paying to embed navigation menus and cookie banners 1,500 times over. The free-per-result crawl is cheap; the real savings is every downstream token you don’t spend on garbage. Re-crawling to keep the knowledge base fresh is, likewise, free per result — so you can re-sync on a schedule.

Common pitfalls

Client-rendered content. JS-only pages won’t yield body text to an HTTP crawl. Most docs/blogs are fine; SPAs are not.
Over-aggressive boilerplate stripping. Occasionally a page’s real content lives in a sidebar the extractor treats as chrome. Spot-check a sample before trusting the whole corpus.
Duplicate / near-duplicate pages. Versioned docs and paginated archives create duplicates that bloat your vector store. Dedup on content hash before embedding.
Mixed languages. The language field helps, but a multilingual site can mix languages within the corpus. Filter if your pipeline is monolingual.
Chunking after the fact. The crawler gives you clean per-page Markdown; it doesn’t chunk for you. Chunk on heading boundaries downstream using the Markdown structure.

Wrapping up

Clean input is the whole game in RAG, and clean input starts with boilerplate-free Markdown. Crawl the site once into a per-page corpus, dedup, chunk on headings, and embed. With free per-result pricing you can convert a whole documentation site and re-sync it on a schedule — and every menu and footer you don’t embed is a downstream token you don’t pay for.

▶ Open the Website to Markdown & Text Crawler on Apify — boilerplate-free Markdown and plain text, one row per page, ready for RAG and embeddings. Start with Apify’s free monthly credit.