social-media · May 22, 2026 · 5 min read

How to Scrape YouTube Comments for Sentiment Analysis in 2026

Export every comment and threaded reply from any YouTube video in bulk — author, text, likes, reply counts — for sentiment analysis, research and AI training.

Comment sections are where YouTube’s most honest signal lives — the feature requests, the complaints, the language your customers actually use. The problem is getting it out. The Data API hands you comments behind a quota wall and quietly truncates deep reply threads, and scraping the watch page yourself means wrestling with continuation tokens and a rendering layer designed to resist exactly that. This guide covers what’s extractable from a comment section in 2026, how YouTube paginates it internally, and how to pull tens of thousands of comments cleanly for analysis.

What’s worth extracting

Per comment — and per reply, which most tools drop — the useful fields are:

Content — the full comment text, unicode and emoji intact.
Author — display name, channel identifier, and flags for verified and creator (so you can separate the uploader’s own replies from the crowd).
Engagement — like count and reply count, normalized to integers.
Recency — relative publish time (“2 days ago”).
Thread structure — whether a record is a top-level comment or a reply, with parent linkage so you can reconstruct the conversation tree.
Source — the video ID/metadata the comment belongs to, so bulk runs across many videos stay attributable.

That thread structure is the part casual scrapers get wrong. A flat list of comments loses the conversation; parent linkage lets you rebuild who replied to whom — which matters enormously for sentiment and discourse work.

How the data is exposed (InnerTube + continuation tokens)

YouTube doesn’t ship the comment section in the initial HTML. It loads lazily from the internal data layer as you scroll, handing back a batch plus a continuation token that fetches the next batch. Replies are themselves a nested continuation behind each comment’s “View replies” affordance.

How a maintained scraper handles that in 2026:

No login, no API key. A fresh anonymous access key is fetched per run; there’s no OAuth flow and no Data API quota to exhaust.
Auto-pagination on continuation tokens. The actor follows top-level tokens to the end and resolves the nested reply tokens, so threads come back complete rather than just the first few replies.
Sort control. YouTube exposes “Top comments” and “Newest first”; the actor encodes both, so you can grab the highest-signal comments or the most recent ones.
Built for volume. Popular videos have tens of thousands of comments. The actor handles that scale with retry logic so a single dropped page doesn’t end the run.

The contrast with the Data API is stark: the API’s commentThreads endpoint is quota-expensive at scale and notoriously incomplete on deep threads. Reading the continuation layer the way the website does has no quota ceiling and reaches the replies the API skips.

▶ Run the YouTube Comments Scraper — every comment and threaded reply from any video, with author, likes, reply counts and verified/creator flags. No login, no API key, sortable by top or newest.

Schema design for downstream use

For sentiment and research pipelines, you want a flat record per comment that still knows its place in the thread:

{
  "video_id": "dQw4w9WgXcQ",
  "comment_id": "Ugxabc123",
  "parent_id": null,
  "is_reply": false,
  "author": "Jane D.",
  "author_channel_id": "UCxxxxxxxxxxxx",
  "is_verified": false,
  "is_creator": false,
  "text": "Honestly the new UI is way cleaner, but the search is broken now.",
  "likes": 412,
  "reply_count": 18,
  "published_text": "2 days ago",
  "scraped_at": "2026-05-22T10:00:00Z"
}

Schema choices worth making:

Keep parent_id. It’s the join key that turns a flat dump back into threads. Replies carry their parent’s comment ID; top-level comments carry null.
Store is_creator and is_verified. When you’re measuring audience sentiment, the creator’s own pinned reply shouldn’t count as audience opinion — these flags let you filter it out.
Normalize likes to an integer up front. “1.2K” is useless for ranking; the actor parses it for you.
Don’t discard emoji. They’re real sentiment signal. Keep the text as-is and let your model handle them.

Typical use cases

What teams do with bulk comment data:

Sentiment analysis of product launches, trailers and reviews — measure how reception shifts comment by comment.
Brand and reputation monitoring across your own videos and competitors’.
Audience and market research — mine feature requests, complaints and the exact phrasing customers use.
Creator analytics — measure engagement quality, not just count: are replies substantive or spam?
Moderation and spam analysis — pull the full stream to train filters or audit what’s getting through.
Conversational datasets for LLM training — threaded, parent-linked comments are well-suited to dialogue modeling.
Academic research on discourse, virality and community dynamics.

The common thread: comments are unstructured until you structure them. Once each record is flat, typed and thread-aware, the analysis is the easy part.

Cost math for the managed approach

Pricing is pay-per-event: a tiny per-run start fee and no charge per comment returned. A video with 30,000 comments costs essentially the start fee plus the compute to paginate it — fractions of a cent. Sweep 50 videos a week for ongoing brand monitoring and you stay in low single-digit dollars per month, with cost driven by compute time rather than row count.

Versus rolling your own, you skip:

Data API quota math — commentThreads calls add up fast; here there’s no ceiling.
Reply-token plumbing — resolving nested reply continuations correctly is fiddly and breaks when YouTube tweaks the format.
The retry harness — large comment sections drop pages; you’d have to build the resume logic yourself.

Common pitfalls

Before you commit a comment pipeline to production:

“Top” sorting is YouTube’s ranking, not chronological. If you need a true timeline, use newest-first and sort by scraped_at plus relative time.
Comments can be disabled. Some videos turn comments off; expect zero-row results and don’t treat them as failures.
Relative timestamps are approximate. “2 days ago” is fine for recency bucketing; don’t build exact time-series off it.
Held/pending comments aren’t public. You get what an anonymous viewer sees — moderation-held comments won’t appear.
Reply depth varies. Most threads are shallow, but a few balloon; let pagination run rather than capping reply fetches too aggressively.

Wrapping up

If you need one video’s comments once, the page and patience get you there. If you need complete, thread-aware comment data across many videos — sorted, typed and ready for a sentiment model — let a maintained actor handle the continuation tokens and hand you flat rows.

▶ Open the YouTube Comments Scraper on Apify — bulk video lists, threaded replies resolved, exports to JSON, CSV or Excel. Start with Apify’s free monthly credit.