business · May 19, 2026 · 6 min read

Y Combinator Startup Directory — A Full Export Guide

How to extract every Y Combinator-backed company from the public directory — 5,900+ startups since 2005 — for B2B sales, investor research, and competitive intelligence.

Y Combinator has funded more than 5,900 companies since 2005, and it lists every single one of them in a public directory at ycombinator.com/companies. This is one of the highest-signal datasets in the entire startup world — every entry was vetted by the most influential accelerator on the planet, and the filters (batch, industry, status, team size) let you slice the data with surprising precision. This guide walks through what’s in the directory, how to extract it cleanly, and what people actually do with the data once they have it.

What’s in the YC directory

The public directory exposes roughly 25 fields per company, including:

Identity — company name, slug (URL handle), one-line description.
Batch — the YC batch it was part of (e.g., “S22” for Summer 2022, “W24” for Winter 2024). Going back to S05.
Status — Active, Acquired, Public, or Inactive.
Industry — primary tag (e.g., “AI”, “Fintech”, “Healthcare”).
Sub-industry — finer-grained tag set, often 2–4 per company.
Location — primary office city/country, plus a remote flag.
Team size — current employee count, bucketed.
Year founded — calendar year of incorporation.
Website — current company URL.
Tags — free-text product tags (“B2B”, “SaaS”, “Open Source”, etc.).
Founders — names, sometimes LinkedIn or Twitter handles.

The directory is the canonical source — YC keeps it current. If a company gets acquired, its status flips. If it dies, its status becomes Inactive. New batches show up the day after Demo Day.

How the public directory works under the hood

The YC team built the directory on a paginated JSON API behind a React front-end. Two things matter:

The HTML page is rendered client-side, so a plain curl of ycombinator.com/companies returns mostly empty HTML. The data lives in API calls.
The API is unauthenticated but rate-limited at the IP level. With reasonable pacing (~1 request per second), you can sweep the full directory without trouble.

The internal endpoint paginates with ?page=N and returns 50–100 companies per page. Going through all of them is on the order of 100 sequential requests for the full ~5,900-company catalog.

What clean output looks like

A row per company with the YC directory’s full field set:

{
  "yc_id": "abridge",
  "name": "Abridge",
  "batch": "S18",
  "status": "Active",
  "industry": "Healthcare",
  "sub_industries": ["AI", "B2B"],
  "tags": ["AI", "Healthcare", "B2B", "SaaS"],
  "year_founded": 2018,
  "one_liner": "AI-powered medical conversation tools.",
  "long_description": "...",
  "team_size": 250,
  "location": "Pittsburgh, PA, USA",
  "is_remote_friendly": true,
  "website": "https://abridge.com",
  "yc_url": "https://www.ycombinator.com/companies/abridge",
  "founders": [
    { "name": "Shivdev Rao", "title": "CEO", "linkedin": "https://linkedin.com/in/shivdevrao" }
  ],
  "scraped_at": "2026-05-19T12:00:00Z"
}

Schema choices to make early:

Always store the yc_id (slug) as your natural key. Company names change as companies pivot; the YC slug is stable.
Keep tags as a denormalized array, not a join table. You’ll filter by tag combinations constantly.
Store scraped_at so you can detect status flips. The interesting signal is “company moved from Active to Acquired since last scrape.”
Founders are an array of objects, not a comma-separated string. Treat them as first-class records you can query independently.

▶ Run the Y Combinator Directory Scraper — full directory export, filters by batch, industry, status, location. Optional per-company founder enrichment. Pay per company returned.

Why people pull this data

The YC directory is one of the most-used B2B datasets on the open web. Common use cases:

B2B sales targeting — every YC company is a high-conviction prospect. Filter by batch (W23 companies are fundraising and hiring; S15 companies are at scale and buying enterprise tools), industry, team size.
Investor research — VCs track new batches the day after Demo Day. Knowing the founders, the pitch, and the team backgrounds within hours is competitive advantage.
Talent sourcing — recruiters target post-IPO YC companies (acquihires) or recently-funded YC companies (growth-stage hiring sprees).
Competitive intelligence — when your competitor is YC-backed, knowing their batch (and therefore their fundraising posture and growth stage) is signal.
Market mapping — for any sub-vertical (e.g., “AI for healthcare”), the YC directory gives you a high-signal subset to study before broadening to the whole market.
Press / journalism — the directory is the master list of who to cover, who to reach out to, who’s making news.

The common thread: the freshness matters. A directory snapshot from a year ago misses two batches, every Demo Day in between, and every status flip from companies that exited or died.

Pulling enriched data per company

The directory page lists each company’s basics, but each company also has a deep page at ycombinator.com/companies/<slug> that adds:

Longer description (2–3 paragraphs vs. the one-liner)
LinkedIn and Twitter handles for each founder
Specific job openings (when the company has jobs posted)
Press logos and notable mentions
Sometimes a podcast or video appearance

If you want full enrichment, you scrape both the listing page (cheap, fast, paginated) and then the per-company deep page (one request per company). For 5,900 companies at ~1 req/sec that’s about 100 minutes of pulling — well within a single Apify run.

Build it yourself vs. use a managed scraper

YC’s directory is not aggressively defended. You can write your own scraper in a few hours and pull the full dataset on day one.

The reasons to use a managed actor instead:

Schema is already normalized — the YC site shifts column meanings every few quarters; the managed actor tracks those changes.
Per-company enrichment is built in — pulling founder LinkedIn handles requires a second-level scrape that’s annoying to add later.
Status-change detection — most managed scrapers can be set to dataset-mode (incremental updates), so you only get rows that changed since the last run. Useful for status-flip alerts.

For a one-time research dump, build it yourself. For an ongoing pipeline that feeds your sales tool or investor dashboard, use a managed actor.

Pitfalls

A few traps when building your own:

Batch naming: YC has tried multiple batch-naming conventions. S05 (“Summer 2005”) is consistent up through S25, but spring/winter batches got introduced in 2022 (e.g., X25 for some special programs). Don’t assume the format is fixed.
Status definitions changed mid-history: “Inactive” is a relatively recent addition to the directory; older companies that quietly died are marked Inactive but their year-of-death isn’t surfaced.
Multiple founders, one role: about 30% of YC companies have co-founders both listed as CEO. Your schema should not assume role uniqueness.
Location format is inconsistent: “San Francisco”, “San Francisco, CA”, and “San Francisco, CA, USA” all appear for the same city. Normalize at ingestion or pay for that at every query.
Acquired company data drifts — once acquired, YC sometimes stops updating the company’s deep page. Capture data at scrape time, expect it to age out.

Legal notes

YC’s directory is explicitly public — they want people to find their portfolio. The data is fine to ingest for internal use, sales prospecting, and research. The things to avoid:

Republishing the directory verbatim on a competing site (copyright concern).
Mass-emailing founders using contact data scraped from the directory (CAN-SPAM, GDPR if any EU founders).
Reselling the dataset wholesale as a standalone product (might run afoul of YC’s terms).

Using the data internally — to inform sales, to map markets, to build a private dashboard for your team — is squarely in bounds.

Wrapping up

The YC directory is one of the most useful public startup datasets in existence, but pulling it cleanly takes a bit of care because the data lives behind a client-rendered React app, not raw HTML. A small scraper handles it in an afternoon; a managed actor handles it in five minutes and keeps the schema current as YC tweaks its directory.

▶ Open the Y Combinator Directory Scraper on Apify — every YC company since 2005, filterable by batch / industry / status. Pay per company.