L logiover
social-media · May 23, 2026 · 5 min read

How to Scrape Substack Newsletters and Authors in 2026

Discover Substack publications by category and leaderboard, then pull every post, author and subscription tier via Substack's public JSON API — no auth, with a clean posts-and-pubs schema.

Substack has quietly become one of the largest open repositories of long-form writing and independent podcasting on the web. Every publication exposes a public JSON API — archives, post metadata, author info, subscription tiers — and there’s a category-leaderboard system you can walk to discover publications you don’t already know about. That combination (discovery + extraction, both keyless) is what makes Substack unusually valuable to scrape. This guide covers the two-phase model, the public endpoints, the posts-vs-publications schema split, and how to build a refreshed newsletter dataset at scale.

What’s worth extracting

Substack data naturally splits into two related record types.

Per publication:

  • Identity — publication name, subdomain (e.g. stratechery.substack.com), custom domain when set (e.g. stratechery.com), primary URL.
  • Branding — logo and cover images.
  • Author — display name and language.
  • Monetization — subscription tier descriptions and benefits (free / paid / founding), founding date.
  • Community signals — comment/community flags.
  • Podcast — podcast feed URL when the publication has audio.

Per post:

  • Content — title, subtitle, content type (newsletter / podcast / thread).
  • Audience tier — free, paid, or founding-only.
  • Timing — publication timestamp.
  • Engagement — reaction counts and restack counts.
  • Media — cover image; for podcasts, the audio URL and duration.
  • Addressing — canonical URL.
  • Length — lightweight word-count / length metadata when available.

The natural output is two joinable streams: a publications table and a posts table, related on subdomain.

The two-phase model: discover, then extract

The thing that sets Substack apart from a single-site scraper is discovery.

Phase 1 — discovery. Substack runs category leaderboards across 30+ topic slugs — technology, business, finance, crypto, news, culture, health, politics, science, design, podcast, and more. You can walk these leaderboards in three modes: top-ranked overall, top paid, or all listings per category. The scraper auto-paginates each category’s listing, then deduplicates publications across categories (a finance newsletter often appears under both finance and business). The output of this phase is a clean set of publications you didn’t have to know about in advance.

Phase 2 — extraction. Once you have a publication’s subdomain (from discovery, or because you supplied it directly), you hit its public archive API, auto-paginate through every post, and emit normalized post records enriched with the publication-level metadata. For custom-domain publications, the scraper probes candidate hosts to resolve the real Substack backend behind a vanity domain like stratechery.com.

You can run either phase alone: discovery-only to build a directory of publications, or extraction-only against a list of subdomains you already care about.

The public API

Substack exposes JSON without authentication. The archive endpoint per publication looks like:

https://SUBDOMAIN.substack.com/api/v1/archive
  ?sort=new
  &offset=0
  &limit=25

You page by incrementing offset until you stop getting posts. Publication-level metadata comes from a sibling endpoint on the same host. Because there’s no auth and the payloads are stable JSON, this is one of the lower-risk targets in the catalog — the work is in the discovery walk, the cross-category dedup, the custom-domain resolution, and the relational join between posts and publications.

Run the Substack Scraper — discover newsletters across 30+ categories and leaderboards, or point it at subdomains directly. Returns every post (title, audience tier, reactions, restacks) plus publication metadata (subdomain, custom domain, author, subscription tiers). Public API, no auth.

Schema design for downstream use

A denormalized per-post row that carries the key publication fields is the most query-friendly shape:

{
  "publication_name": "Stratechery",
  "subdomain": "stratechery",
  "custom_domain": "stratechery.com",
  "author_name": "Ben Thompson",
  "language": "en",
  "subscription_tiers": ["free", "paid"],
  "post_title": "The AI Unbundling",
  "post_subtitle": "Aggregation theory meets generative models",
  "content_type": "newsletter",
  "audience": "paid",
  "published_at": "2026-05-19T11:00:00Z",
  "reaction_count": 842,
  "restack_count": 119,
  "podcast_audio_url": null,
  "podcast_duration_sec": null,
  "canonical_url": "https://stratechery.com/2026/the-ai-unbundling/",
  "word_count": 2840,
  "scraped_at": "2026-05-23T09:00:00Z"
}

Choices worth making early:

  • Keep audience (free/paid/founding) as a column. Paywall tier is the single most useful filter for both content intelligence and sponsorship targeting.
  • Store both subdomain and custom_domain. The subdomain is the stable join key; the custom domain is what humans recognize and what you’d use for outreach.
  • Keep content_type. Newsletters, podcasts, and threads are different products — separating them lets you build a podcast-only or essay-only view.
  • Preserve reaction and restack counts with a scraped_at. Engagement is a moving target; you need the timestamp to compute growth.

Typical use cases

  • Content intelligence platforms — track newsletters across topical categories with scheduled refreshes; spot rising publications via leaderboard movement.
  • Newsletter sponsorship outreach — assemble author names, custom domains, and subscription-tier descriptions into prospect lists for ad/sponsorship sales.
  • RAG / LLM training data — harvest long-form public posts with author, date, and topic metadata as a high-quality fine-tuning or vector-index corpus.
  • Migration tooling — bulk-export an author’s full archive for a Substack-to-Beehiiv (or similar) migration.
  • Competitive monitoring — watch a rival’s publishing cadence, paywall strategy, and engagement trend.
  • Investor / VC research — discover founders and operators publishing in fintech, crypto, and SaaS at scale by walking those categories.
  • Build a newsletter directory — generate publication-only catalogs across many categories as a standalone product.

Cost math

This actor is pay-per-event with results priced at $0.002 each plus the tiny per-run start fee. That means cost scales with how much you actually pull: discovering and cataloging a few thousand publications is a few dollars; doing a full-archive extraction of high-volume publishers (hundreds of posts each) across a category runs higher, in proportion to total posts emitted. Because Substack needs no proxy and no browser, there’s no bandwidth bill hiding underneath — the per-result price is essentially the whole cost. Budget by estimating total posts: publications × average archive depth.

Common pitfalls

  • Custom domains hide the backend. A publication on example.com still runs on Substack; you need host-probing to find the real API host, or you’ll get nothing.
  • Dedup across categories. The same publication appears under multiple topic slugs — without cross-category dedup your directory double-counts.
  • Paid-post bodies are gated. Public metadata (title, subtitle, audience tier, engagement) is available for paid posts, but the full body text behind a paywall is not — don’t expect to harvest paywalled content.
  • Reaction counts drift. They’re live; a post’s reactions today differ from last week. Always stamp scraped_at.
  • Archive depth varies wildly. A two-month-old newsletter has 8 posts; an established one has 1,500. Cap retrieval depth if you’re cost-sensitive.
  • Language is per publication, not per post. Treat the publication’s language field as the corpus language tag.

Wrapping up

Substack’s combination of keyless discovery and keyless extraction is rare — you can find publications you’ve never heard of and pull their full public archives without a login. The hard parts are the leaderboard walk, cross-category dedup, custom-domain resolution, and keeping posts and publications cleanly joinable. A managed actor handles all four, so you can go straight to building a directory, a sponsorship list, or a training corpus.

Open the Substack Scraper on Apify — category and leaderboard discovery, full-archive extraction, posts joined to publications. Pay per result. Start with Apify’s free monthly credit.

Related guides