videos · Jun 1, 2026 · 6 min read

How to Scrape Letterboxd Films & Reviews in 2026

A guide to extracting Letterboxd film metadata, ratings, cast, genres and user reviews — across film pages, watchlists, lists and search — for NLP, recommendation and film research.

Letterboxd is where film fans actually write. Unlike the thumbs-up noise on the big streaming platforms, Letterboxd reviews are long-form, opinionated, and paired with a clean 5-star (half-star) rating — which makes the site one of the best public sources for film-sentiment NLP, recommendation training, and reception research. The catch: Letterboxd serves static HTML but defends it with rate-limiting that punishes datacenter IPs, so you need residential proxies and patience. The upside: because it’s static HTML, you don’t need a headless browser at all. This guide covers what to extract, the scrape modes, and how to pull it without getting blocked.

What’s worth extracting

There are two distinct payloads here — film metadata and review content — plus user watch-history. The scraper captures all of them from static HTML (no browser, no JS execution required):

Film metadata (per film page):

Title and release year, director, top cast, genres.
Runtime, original language, country.
Synopsis / tagline, poster image URL.
Average rating and total ratings count.

User reviews (paginated, per film or per user):

The review text, HTML-stripped to clean prose.
Reviewer identity (username).
The reviewer’s star rating for that film.
Review date, like count, and a spoiler flag.

Watch history (per user profile):

The films a user logged, with their personal rating for each.

That pairing — clean review prose next to a numeric star rating — is exactly what you want for supervised sentiment work: text features, numeric label, no manual annotation.

Scrape modes and inputs

The actor supports four modes, so you target exactly the slice you need:

film detail     a film URL/slug   -> metadata + paginated reviews for that film
user history    a username        -> that user's logged films + personal ratings
curated list    a list URL        -> the films in a community/editorial list
keyword search  a search term     -> films matching a query

All modes paginate automatically for multi-page results, and you can cap how many reviews per film and how many films to pull — important, because a popular film has thousands of reviews and you rarely want all of them.

The anti-bot reality: static HTML, but residential-only

Here’s the key operational fact: Letterboxd is static HTML, so a headless browser is unnecessary — every field is in the server-rendered markup. But the site rate-limits hard, and datacenter IPs get throttled or blocked quickly. Two consequences:

Residential proxy is required. A datacenter proxy (or no proxy) will get you a few requests before the 429s and blocks start. Residential IPs blend in with normal traffic and survive a sustained crawl. The scraper is built to run through a residential pool.
Pacing matters more than concurrency. Because the limit is request-rate per IP, blasting 50 concurrent requests from one session gets you blocked faster than steady, paced requests across a rotating residential pool. The scraper handles the pacing and rotation; if you roll your own, this is the part you’ll spend days tuning.

The reason a plain curl “works” once and then dies is exactly this: the HTML is trivially parseable, but the rate-limiter is the wall. The value isn’t in rendering the page — it’s in the proxy + pacing layer that lets you collect thousands of reviews without tripping the limiter.

▶ Run the Letterboxd Film & Review Scraper — film pages, user watchlists, curated lists and search; clean review text paired with star ratings; residential-proxy-backed, no browser. Configurable review and film caps.

Schema design for downstream use

A clean per-review record (the most common target for NLP):

{
  "film_title": "Past Lives",
  "film_year": 2023,
  "film_slug": "past-lives",
  "director": "Celine Song",
  "genres": ["Drama", "Romance"],
  "avg_rating": 4.2,
  "ratings_count": 184213,
  "reviewer": "cinephile_92",
  "review_rating": 4.5,
  "review_text": "A quiet film about the lives we don't live...",
  "review_date": "2026-04-11",
  "likes": 38,
  "spoiler": false,
  "scraped_at": "2026-06-01T10:00:00Z"
}

Schema choices worth making early:

Denormalize film fields onto each review row (or keep a clean join key like film_slug). For NLP you want each review self-contained with its film’s metadata; don’t force a join at train time.
Keep review_rating and avg_rating separate. The reviewer’s personal star is your training label; the film’s average is a feature. Conflating them ruins a sentiment model.
Respect the spoiler flag. It’s a real, user-set field. For a public-facing product you’ll want to hide spoiler reviews by default; for research you may want to study them separately.
HTML-strip but preserve structure cues. The text comes pre-stripped of HTML; if you later need paragraph breaks for readability, keep the cleaned text rather than re-fetching.
Always store scraped_at. Ratings and review counts drift; a film’s average today isn’t its average next year.

Typical use cases

Recommendation training data — community ratings plus film metadata as input to a recommender.
Sentiment / NLP — the review-text-plus-star-rating pairing is a ready-made labeled dataset for fine-tuning or evaluation.
Film research & journalism — rating distributions, genre trends, how directors are received over time.
List intelligence — scrape community and editorial lists for catalogue aggregation or editorial projects.
Audience profiling — user watch histories and personal ratings to model viewing patterns.

The common thread: the value is in volume of paired text + rating. A few hundred reviews is a toy; tens of thousands across many films is a dataset you can actually train on.

Cost math

Pay-per-event with a tiny run-start fee and a per-film event. Reviews come along with the films you pull, so the cost scales with how many films you scrape and how deep you go on reviews per film (which you cap).

The real cost driver here, versus the API-only scrapers, is residential proxy bandwidth — static HTML pages are small, but a residential pool isn’t free. Still, a focused dataset (say, the top few hundred films with a few hundred reviews each) lands in modest single-to-low-double-digit dollars for a substantial labeled corpus. Compare to building your own residential-proxy rotation and pacing logic, which is days of work and an ongoing proxy contract before you collect a single clean row.

Common pitfalls

Skipping the residential proxy. This is the number-one failure. Datacenter IPs get blocked fast; budget for residential or expect a tiny, throttled dataset.
Over-concurrency. More parallel requests from one IP triggers the limiter sooner. Steady pacing across a rotating pool beats a burst.
Pulling all reviews on a blockbuster. A hugely popular film has thousands of reviews; uncapped, you’ll spend a lot for diminishing NLP value. Cap reviews per film.
Half-star ratings. Letterboxd uses a 0.5–5.0 scale in half-star steps. Treat the rating as a float, not an int, or you’ll lose half the resolution of your labels.
Spoiler reviews leaking into a product. Honor the flag — surfacing spoiler reviews unfiltered is a fast way to annoy users.
Assuming a stable average. Film averages and rating counts move; timestamp every snapshot if you care about temporal analysis.

Wrapping up

Letterboxd is the richest public source of long-form film reviews paired with clean numeric ratings — ideal for sentiment, recommendation, and reception research. The HTML is easy; the rate-limiter is the wall, and the answer is residential proxies plus disciplined pacing, not a heavyweight browser. If you want thousands of clean, labeled review records without building the proxy-and-pacing layer yourself, a maintained actor that runs residential-backed and caps your reviews and films is the fast path.

▶ Open the Letterboxd Scraper on Apify — films, reviews, ratings and watch histories across four scrape modes, residential-proxy-backed. Pay-per-event. Start on Apify’s free monthly credit.

Related guides

How to Scrape a YouTube Channel's Full Video Catalog in 2026

Pull every video, Short and live stream from any YouTube channel — full view counts, durations and publish dates — without a login or API key, at scale.

How to Scrape YouTube Video Metadata, Tags and View Counts in 2026

Hydrate any list of YouTube video IDs or URLs into full metadata — exact views, likes, descriptions, hidden tags, category and duration — without a login or API key.