seo-tools · May 22, 2026 · 5 min read

How to Extract JSON-LD Schema & Meta Tags at Scale in 2026

A technical-SEO guide to pulling JSON-LD/Schema.org, meta tags, OpenGraph and Twitter Cards from any URL list — for audits, schema validation, AEO research and AI datasets.

If you do technical SEO, you already know the structured-data problem: you can’t fix what you can’t see across thousands of pages. Google’s Rich Results Test checks one URL at a time. Your CMS swears every product page emits valid Product schema, but you’ve been burned before. And in 2026, with answer engines and AI overviews leaning hard on Schema.org markup, “do we have clean structured data sitewide” went from a nice-to-have to a ranking input.

This guide covers how to extract JSON-LD/Schema.org blocks, meta tags, OpenGraph and Twitter Cards from an arbitrary URL list at scale — and what to do with the output once you have it.

What gets extracted per URL

This is an HTTP-only extractor — it fetches the raw HTML and parses it, no JavaScript execution. For each URL it emits one normalized record containing:

JSON-LD / Schema.org — every <script type="application/ld+json"> block on the page, safe-parsed (a malformed block on one page doesn’t kill the run), with the parsed Schema.org objects preserved.
Core meta — the HTML <title> and <meta name="description">.
Open Graph — all og:* properties (og:title, og:image, og:type, og:url, …) flattened into key-value pairs.
Twitter Cards — all twitter:* properties (twitter:card, twitter:image, …).
Timing — a scrape timestamp for drift tracking.

Because it normalizes OpenGraph and Twitter into flat key-value records and keeps the parsed JSON-LD as structured objects, the output is query-ready — you can filter for “pages where @type includes Product” or “pages missing og:image” without re-parsing HTML downstream.

Why HTTP-only is the right call here

A lot of scrapers reach for a headless browser by reflex. For structured-data extraction that’s usually wrong:

JSON-LD is in the source HTML. Schema.org markup is almost always server-rendered into a <script> tag — that’s the whole point of it being machine-readable. You rarely need JS execution to see it.
HTTP-only is fast and cheap. No Chromium, no render wait. You can sweep a 50,000-URL sitemap with high concurrency in a fraction of the time and cost a browser would take.
Proxy-ready when you need it. Some sites rate-limit aggressive crawls; the extractor supports proxy rotation so a sitemap-wide sweep doesn’t get throttled.

The one caveat: a minority of SPAs inject JSON-LD client-side after hydration. Those pages will show empty structured data in an HTTP fetch. If you’re auditing a known JS-heavy site, spot-check a few URLs in a browser first to confirm the markup is server-rendered.

A realistic input/output shape

You feed it a URL list — a sitemap export, a crawl frontier, a list of competitor pages — and it returns one record per URL:

{
  "url": "https://example.com/products/trail-runner-x",
  "title": "Trail Runner X — Lightweight Trail Shoe | Example",
  "meta_description": "The Trail Runner X is a 240g trail shoe built for...",
  "json_ld": [
    {
      "@context": "https://schema.org",
      "@type": "Product",
      "name": "Trail Runner X",
      "brand": { "@type": "Brand", "name": "Example" },
      "offers": {
        "@type": "Offer",
        "price": "149.00",
        "priceCurrency": "USD",
        "availability": "https://schema.org/InStock"
      },
      "aggregateRating": { "@type": "AggregateRating", "ratingValue": "4.6", "reviewCount": "212" }
    }
  ],
  "open_graph": {
    "og:title": "Trail Runner X",
    "og:type": "product",
    "og:image": "https://example.com/img/trx.jpg",
    "og:url": "https://example.com/products/trail-runner-x"
  },
  "twitter": {
    "twitter:card": "summary_large_image",
    "twitter:image": "https://example.com/img/trx.jpg"
  },
  "scraped_at": "2026-05-22T09:14:00Z"
}

▶ Run the JSON-LD Schema & Meta Tag Extractor — feed it a URL list, get back parsed Schema.org objects, meta description, OpenGraph and Twitter Cards as clean JSON. HTTP-only, high concurrency, proxy-ready for big sweeps.

What to do with the output

The records become useful the moment you run aggregate queries against them:

Schema coverage audits — group by @type and count. Instantly see how many product pages are missing Product schema, how many articles lack Article, which sections have zero structured data.
Validation / QA before publish — flag JSON-LD blocks missing required properties (a Product with no offers, a Recipe with no recipeIngredient). Catch it in CI, not in Search Console three weeks later.
Schema drift monitoring — re-run on a schedule and diff against last week’s snapshot. A deploy that silently dropped your BreadcrumbList markup shows up as a regression.
Social-preview QA — find pages with a missing or broken og:image, wrong twitter:card type, or absent og:description — the stuff that makes your links look broken when shared.
Competitor / AEO research — sweep top-ranking competitor pages and reverse-engineer their schema strategy. With AI overviews favoring well-marked-up entities, knowing which @types your competitors deploy is genuine intel.
Knowledge-graph / AI dataset building — extract Organization, Person, Product, Event entities at scale for enrichment pipelines or RAG corpora that prefer clean, schema-backed data over scraped prose.

Cost math

This actor is priced per dataset item — you pay for the records you keep, with no proxy bandwidth baked in unless you opt into proxying. Because it’s HTTP-only with high concurrency, throughput is high and per-URL compute is tiny.

For a concrete sense of scale: auditing a 10,000-URL sitemap is a single run that completes quickly and costs on the order of a few dollars at per-item pricing. A weekly schema-drift sweep of the same site is the same cost weekly — trivially affordable as monitoring infrastructure. Compare to the alternative of pasting URLs into a single-URL validator by hand, which doesn’t scale past a few dozen pages and produces no machine-readable history.

Common pitfalls

Empty JSON-LD on SPA pages. As noted, client-side-injected markup won’t appear in an HTTP fetch. Confirm server-rendering before trusting a “0 schema blocks” result.
Multiple JSON-LD blocks per page. Pages frequently have several (e.g. Organization + BreadcrumbList + Product). The output is an array — don’t assume index 0 is the one you care about; filter by @type.
Malformed JSON-LD is common. A trailing comma or unescaped quote in one block is normal in the wild. The safe-parse means the run survives it, but treat “block failed to parse” as itself a finding worth reporting.
OG vs JSON-LD title mismatch. og:title, <title>, and the JSON-LD name often disagree. That’s a real SEO smell — surface all three rather than picking one.
Don’t confuse presence with correctness. A page having Product schema doesn’t mean it’s valid. Layer a required-properties check on top of the extraction.

Wrapping up

Structured-data auditing is a problem of scale and repeatability, not of any single hard-to-fetch page. The hard part is doing it across thousands of URLs, on a schedule, with output you can query and diff. An HTTP-only extractor that normalizes JSON-LD, meta, OpenGraph and Twitter Cards into one clean record per URL is exactly the right tool — cheap to run, fast across big sitemaps, and easy to wire into a CI or monitoring loop.

▶ Open the JSON-LD & Meta Tag Extractor on Apify — bulk schema + metadata extraction, per-item pricing, proxy-ready. Run a sitemap-wide audit on Apify’s free monthly credit.