social · May 19, 2026 · 6 min read

Reddit Data Export Without the API in 2026

How to extract Reddit posts, comments, and user activity at scale without an OAuth app, rate-limited tokens, or paid Reddit API tier.

Reddit’s official API moved to a paid tier in 2023 and has stayed expensive ever since. For teams that want a few thousand posts and comments for sentiment analysis, market research, or a new product feature, the official rate-limited API is overkill — and the cost adds up fast. The good news is that Reddit still exposes a public, unauthenticated .json endpoint on every page. This is how you use it cleanly, without an OAuth app, without rate-limited tokens, and without a paid contract.

The `.json` trick

Append .json to any Reddit URL and you get the same data the page renders, in clean JSON form:

https://www.reddit.com/r/programming.json
https://www.reddit.com/r/programming/comments/abc123.json
https://www.reddit.com/user/spez.json
https://www.reddit.com/search.json?q=apify&sort=new

No auth. No app registration. No tokens. The endpoint is older than the modern Reddit redesign and has stayed quietly available even as the OAuth tier got squeezed.

What you can pull

The public JSON exposes a generous data shape per item:

For posts: title, author, score, subreddit, post text (selftext), URL, image/video URLs, comments count, awards, flairs, NSFW flag, timestamps, gilded count.
For comments: body, author, score, parent comment ID, depth, replies tree, timestamps.
For users: account age, karma breakdown (post vs comment), trophies, recent activity, verified email flag.
For search: full post objects matching a keyword, sorted by new, top, relevance, hot, or comments.

Combine these and you can build a complete picture of any subreddit, user, or keyword conversation — without ever touching the API.

Rate limits and how to live with them

The unauthenticated endpoint is rate-limited by IP, roughly:

Around 60 requests per minute per IP, in practice — Reddit doesn’t publish exact numbers and they tune the limit dynamically.
Spikes above this get a soft 429 (Too Many Requests) with a Retry-After header.
Sustained abuse from a single IP can get a 12–24 hour block.

What this means in practice: you can scrape Reddit data quickly, but you can’t blast it. Three rules of thumb:

Throttle to 30 requests/min on a single session. You’ll never see a 429.
Respect Retry-After when you do hit one — back off, don’t retry immediately.
Use a rotating IP pool if you need more throughput than one IP can sustain.

You don’t need a proxy at all for small jobs (a few thousand items). You do need one if you’re pulling hundreds of thousands of items in a single run.

Pagination patterns

Reddit’s .json endpoint paginates with a cursor system using after and before params, plus a limit of 100 per page:

https://www.reddit.com/r/programming.json?limit=100&after=t3_abc123

The after cursor comes from the data.after field of the previous response. When after is null, you’ve reached the end of the available history (Reddit keeps roughly the last 1,000 posts per subreddit accessible — older posts exist but are gated behind search, not listing).

For search:

https://www.reddit.com/search.json?q=apify&sort=new&limit=100&after=...

For user profile pages:

https://www.reddit.com/user/{username}/submitted.json?limit=100&after=...
https://www.reddit.com/user/{username}/comments.json?limit=100&after=...

▶ Try the Reddit Search Scraper — handles pagination, throttling, and optional top-comment fetching for you. Returns clean rows ready for ETL. No auth required.

Pulling comments efficiently

The trickiest part of a Reddit pipeline is comments. You can get the top-level post object cheaply, but each post’s comment tree is a separate request, and the tree can be deep.

Strategies:

Fetch only top-level comments — fastest, cheapest, captures most of the conversation. Use ?depth=1.
Fetch the full tree but cap depth at 5 — covers 95% of meaningful threads.
Skip “more” continuations — when a comment chain has 50+ replies, Reddit collapses some into a “load more” reference. Following these turns one request into many. For aggregate sentiment work, skip them.

For a typical sentiment-analysis job over a subreddit, fetching post + top 20 comments per post is the right cost-quality tradeoff. Going deeper rarely changes the signal.

What clean output looks like

Rows you’d want in your data warehouse:

{
  "id": "t3_abc123",
  "type": "post",
  "subreddit": "programming",
  "author": "u/example",
  "title": "Show: I built an Apify actor for X",
  "selftext": "...",
  "score": 184,
  "upvote_ratio": 0.92,
  "num_comments": 47,
  "url": "https://github.com/example/repo",
  "permalink": "https://reddit.com/r/programming/comments/abc123/show_i_built/",
  "flair": "Show",
  "awards": 1,
  "is_nsfw": false,
  "created_utc": 1721232000,
  "scraped_at": "2026-05-19T12:00:00Z"
}

For comments:

{
  "id": "t1_def456",
  "type": "comment",
  "parent_post_id": "t3_abc123",
  "parent_comment_id": "t1_xxx789",
  "depth": 1,
  "author": "u/example",
  "body": "...",
  "score": 22,
  "created_utc": 1721232300,
  "scraped_at": "2026-05-19T12:00:00Z"
}

Keep id, parent_post_id, parent_comment_id so you can reconstruct trees in SQL with a single recursive CTE.

Use cases

What teams actually do with Reddit data:

Sentiment monitoring — alert when a brand, product, or competitor is being discussed at scale.
Trend extraction — find which keywords are rising fastest in a subreddit over the past week.
AI training data — Reddit’s text is a treasured corpus for LLM training and fine-tuning. The post + comment structure is uniquely valuable.
Customer feedback mining — pull every comment mentioning a specific product feature.
Influencer / community mapping — identify the highest-karma accounts in a subreddit, track who’s actually shaping conversation.

The common need is breadth × freshness: the value isn’t in any one comment, it’s in being able to refresh thousands of comments daily across dozens of relevant subreddits.

Build it yourself vs. use a managed scraper

Reddit is one of the easier sites to scrape from scratch — no aggressive bot detection, JSON endpoint is publicly documented in old Reddit’s source, no JavaScript challenge layer. So you might wonder why you’d use a managed actor instead of writing a 50-line script.

The honest answer: the 50-line script gets you the first 1,000 rows. Then you hit pagination quirks, throttle problems, weird data shapes for crossposts vs. self-posts vs. media posts, deleted-author handling, and the “more” comment continuations. By the time you’ve handled all of these, you’ve spent two days on what should have been a half-hour pull.

A managed actor handles the edge cases once and lets you focus on what you’re actually building with the data.

Common pitfalls

A few things that will trip up a Reddit pipeline:

Deleted vs. removed content — [deleted] means the user deleted it, [removed] means a mod removed it. The text is gone in both cases but the row still appears in listings.
Author churn — u/[deleted] accounts can’t be re-fetched. Capture the author at row creation, not later.
Crosspost objects are different shape from native posts. Either normalize them or filter them out.
Search relevance drift — Reddit’s sort=relevance tuning changes silently. If you need deterministic order over time, use sort=new and apply your own filtering.
NSFW flag — affects what default scrapers can see without a logged-in session. The .json endpoint usually surfaces NSFW content but some communities are gated.

Wrapping up

If you need a Reddit feed today, the .json endpoint is open and a managed scraper handles every edge case so you can start pulling data in five minutes. If you’re trying to build a research-grade corpus over months, you can roll your own — but expect to spend the first week debugging quirks the managed version already handles.

▶ Open the Reddit Search Scraper on Apify — search by keyword, subreddit, or user; pulls posts, comments, and user profiles. Pay per row.