L logiover
jobs · May 22, 2026 · 5 min read

How to Scrape Hacker News Who Is Hiring Jobs in 2026

Turn the monthly HN 'Who is Hiring?' thread into structured job data — parse company, role, salary, remote policy, tech stack and contact email from free-text comments.

Once a month, a single Hacker News thread becomes one of the best startup-hiring signals on the internet. The “Ask HN: Who is Hiring?” thread fills with hundreds of job postings written by the companies themselves — no recruiter middle-layer, no job-board SEO spam, just founders and engineering leads saying who they need. The problem is the format: it’s a flat list of free-text comments with no structure at all. This guide is about turning that wall of prose into a clean, queryable jobs dataset in 2026.

Why this thread is different from a job board

Most job-board scrapers extract structured fields the site already exposes. Here there’s nothing to extract — every posting is a blob of text a human typed into a comment box. One posting starts with Acme Corp | Senior Backend | Remote (EU) | $140k-180k; the next is three paragraphs of narrative that buries the salary in the last sentence. There’s no schema. The work is parsing, not crawling.

That’s the whole value proposition: the scraper applies header parsing, regex scanning, and keyword detection to convert free text into the same structured record every time, so you can filter and sort across hundreds of postings the way you would on a real ATS.

How discovery works

Underneath, this still rides the key-free Algolia Hacker News API. The flow:

  1. Find the monthly “Who is Hiring?” thread (or you supply a specific thread ID, or run a full-text HN search).
  2. Fetch the top-level comments of that thread — each is one candidate posting.
  3. Strip HTML entities and tags from the comment body.
  4. Apply heuristics to discard non-postings (meta-comments, “great thread!” replies).
  5. Regex-parse each real posting into structured fields.

No proxy, no API key, no headless browser, no AI model in the loop — just deterministic parsing. That matters for cost and reliability: there’s no LLM token bill and no nondeterministic output drift between runs.

Run the Hacker News Who Is Hiring Scraper — converts monthly HN hiring threads into structured rows: company, role, location, salary, remote policy, tech stack and contact email. No AI, no API key, no proxy.

What gets parsed out of each posting

From a single free-text comment, the parser pulls:

  • Company and role — from the first-line header, the near-universal Company | Role | Location | Comp convention.
  • Location and remote policy — classified into Remote / Hybrid / Onsite.
  • Salary — the raw compensation text when present (many postings omit it; that’s a data point too).
  • Visa sponsorship — whether the posting mentions it.
  • Tech stack — detected against a dictionary of 40+ languages, frameworks, databases, and cloud/DevOps/AI tools.
  • Apply URL and contact email — the first application link and email address found in the body.
  • Full posting text — the complete plain-text comment, so nothing is lost.
  • Source — posting timestamp and HN permalink.

A clean per-posting schema

{
  "company": "Acme Corp",
  "role": "Senior Backend Engineer",
  "location": "Remote (EU timezones)",
  "remote_policy": "Remote",
  "salary": "$140k–180k + equity",
  "visa_sponsorship": false,
  "tech_stack": ["Python", "PostgreSQL", "Kubernetes", "AWS"],
  "apply_url": "https://acme.example/jobs/be-senior",
  "contact_email": "jobs@acme.example",
  "posted_at": "2026-05-01T16:22:00Z",
  "source_url": "https://news.ycombinator.com/item?id=39500001",
  "full_text": "Acme Corp | Senior Backend | Remote (EU) | $140k-180k ...",
  "scraped_at": "2026-05-22T10:00:00Z"
}

Schema notes:

  • Keep salary as raw text, not parsed numbers. Formats vary wildly (“competitive”, “$140-180k”, “€90k”, “DOE”). Normalize downstream where you can see the distribution.
  • tech_stack is detected, not declared. It reflects keyword matches in the prose, so treat it as high-recall but imperfect — good for trend aggregation, not a contractual skills list.
  • Empty salary is signal. The share of postings that hide comp is itself a useful market metric.
  • full_text is your safety net. When the parser misses a field, you can re-derive it later without re-scraping.

Use cases

  • Job search and alerts — filter postings by tech stack, remote-only, or salary floor; run it the day each new monthly thread drops and get only your matches.
  • Recruiter and HR market intelligence — track which competitors are hiring, for what roles, and how aggressively, month over month.
  • Salary benchmarking — build a historical dataset across many months to chart comp ranges by role and stack.
  • In-demand skills research — aggregate detected tech stacks to see which frameworks and tools startups are actually staffing for.
  • Job-alert bots and pipelines — feed structured rows into a spreadsheet, database, or Slack/Telegram bot.
  • Candidate sourcing — the companion “Who wants to be hired?” threads parse the same way for inbound talent.

The defining trait of this dataset is trust: postings come straight from hiring companies, so there’s no aggregator noise, expired-listing decay, or duplicate cross-posting that plagues general job boards.

Multi-month historical scraping

The single most valuable mode is the backfill. Each month’s thread is its own item; point the scraper at a range of months and it assembles a longitudinal hiring dataset. That’s where salary trends, remote-policy shifts, and the rise of specific AI tooling become visible — patterns invisible in any single thread.

Cost math

Pricing is pay-per-event with a negligible per-run start fee and no per-result charge on this actor — you pay essentially only for compute, which is trivial because there’s no browser and no proxy. A full month’s thread is a few hundred postings parsed in seconds. Backfilling a year of threads is still cheap enough to treat as a one-off, and a monthly scheduled run to catch each new thread costs effectively nothing.

Against a DIY build, you’re not avoiding infrastructure — you’re avoiding the parser. Header parsing, remote-policy classification, salary extraction, and a maintained 40+ tool keyword dictionary are the entire job, and they’re exactly the brittle parts that break when posting conventions drift.

Common pitfalls

  • Not every comment is a job. Threads contain meta-replies and follow-up questions; rely on the posting heuristics rather than treating every top-level comment as a listing.
  • Header conventions drift. Most postings use Company | Role | Location, but some don’t. Keep full_text so misparsed rows are recoverable.
  • Salary is frequently absent or vague. Don’t drop postings that lack it — model the absence.
  • HTML entities in bodies. Comments arrive with &#x27; and <p> tags; the scraper strips them, but if you parse raw yourself, decode first.
  • One thread per month. Don’t confuse “Who is Hiring?”, “Who wants to be hired?”, and “Freelancer? Seeking freelancer?” — they’re separate monthly threads with different shapes.

Wrapping up

The HN “Who is Hiring?” thread is a goldmine wrapped in unstructured text. If you only want this month’s matches for a personal search, skimming the thread by hand is fine. If you want filterable alerts, salary benchmarks, or a year-long startup-hiring trend dataset, use a scraper that already handles the parsing and hands you one clean row per posting.

Open the Who Is Hiring Scraper on Apify — structured monthly HN jobs with salary, remote policy, tech stack and contact email. Schedule it monthly. Pay-per-event, start on Apify’s free credit.

Related guides