L logiover
developer-tools · May 30, 2026 · 6 min read

How to Scrape GitHub Repository Data in 2026

Bulk-extract public GitHub repos by search query — stars, forks, language, topics, owner, license and activity dates — via the official Search API, normalized to flat rows.

GitHub is the de facto registry of open-source software, and its public metadata — stars, forks, languages, topics, activity dates — is a goldmine for OSINT, deal-flow scouting, dependency tracking, and developer-talent sourcing. The catch is that GitHub’s official Search API, while clean and well-documented, has hard ceilings: aggressive rate limits and a 1,000-result cap per query. Getting a complete dataset is less about access and more about working within those limits intelligently. This guide covers what the API exposes, how to extract at scale despite the ceilings, and how to model the output.

What’s worth extracting

The GitHub Search API returns repository-level metadata. Flattened, each repo row carries:

  • Identity — canonical owner/name, repository ID, description, and homepage URL.
  • Owner — login and account type (user vs. organization).
  • Language and topics — primary language plus the topic tag array.
  • Licensing and status — license (SPDX) and archived flag.
  • Activity — creation date, last metadata update, and last code push.
  • Signals — star count, fork count, watcher count, open-issue count.

That’s the core surface for ranking, trend-spotting, and target-list building. Owner identity, license, and topics arrive as nested objects from the API; the actor normalizes and flattens them into spreadsheet- and warehouse-friendly columns.

The extraction reality: rate limits and the 1,000-result ceiling

There’s no anti-bot wall — GitHub wants you using the API. The real constraints are structural:

  • Rate limits. The Search API has its own, tighter rate limit than the core API (authenticated search is roughly 30 requests/minute). The actor respects rate-limit and retry headers, backing off when GitHub signals it, so a long sweep doesn’t get throttled into failure.
  • The 1,000-result cap. This is the big one. Any single search query returns at most 1,000 results, no matter how many repos actually match. A query like language:python matches millions but yields only the first 1,000.
  • The workaround: query windowing. To pull more than 1,000 of anything, you slice the search space into windows that each return under 1,000 — typically by created or pushed date ranges, or by star/fork bands — and union the results. The actor’s “unlimited mode” automates this windowing to pull all available results per query window.
  • Pagination to the ceiling. Within each window, it pages results to the cap, with sorting and ordering applied server-side.

Understanding the 1,000-cap is the single most important thing about scraping GitHub. Anyone who ignores it silently gets a truncated dataset and doesn’t know it.

Run the GitHub Repository Scraper — search repos by language, topic, stars, forks and dates; flat rows with owner, license, topics and activity signals. Unlimited mode windows around the 1,000-result cap.

How querying works

The inputs mirror GitHub’s search qualifiers:

query:        free text + qualifiers, e.g.
              "topic:rag language:python stars:>500 pushed:>2026-01-01"
language:     primary language filter
topic:        topic tag filter
stars/forks:  range qualifiers (>, <, ranges)
created:      date range
pushed:       last-push date range
filters:      archived / org / user scoping
sort:         stars | forks | updated | best-match
order:        desc | asc
mode:         capped (one window) | unlimited (auto date-windowed)

A practical pattern for a complete dataset: pick a broad qualifier set (say, language:rust stars:>50), enable unlimited mode, and let the actor slice by created date windows so each slice stays under 1,000 and the union covers everything. For trend monitoring, schedule a recurring sort:updated run scoped to a topic and diff against last run to catch newly trending projects.

Schema design for downstream use

A flat, query-friendly repo row:

{
  "full_name": "vercel/next.js",
  "repo_id": 70107786,
  "owner_login": "vercel",
  "owner_type": "Organization",
  "description": "The React Framework",
  "homepage": "https://nextjs.org",
  "language": "JavaScript",
  "topics": ["react", "nextjs", "ssr", "framework"],
  "license": "MIT",
  "archived": false,
  "stars": 132410,
  "forks": 28350,
  "watchers": 132410,
  "open_issues": 2841,
  "created_at": "2016-10-05T23:31:14Z",
  "updated_at": "2026-05-29T08:12:00Z",
  "pushed_at": "2026-05-30T06:44:00Z",
  "scraped_at": "2026-05-30T09:00:00Z"
}

Schema choices that matter:

  • Use repo_id, not full_name, as the join key. Repos get renamed and transferred; the numeric ID is permanent.
  • Distinguish updated_at from pushed_at. updated_at changes on any metadata edit (a star, a setting); pushed_at is real code activity. For “is this project alive?” use pushed_at.
  • Keep topics as an array. Don’t comma-join it; you’ll want to query and facet on individual topics.
  • Store scraped_at. Star and fork counts move daily; growth-rate analysis needs the capture time.
  • Note that watchers mirrors stars. GitHub’s API returns watcher count equal to stars for historical reasons — don’t treat them as independent signals.

Typical use cases

  • OSINT and VC deal-flow scouting — surface fast-growing projects by topic, star threshold, and recent push activity.
  • Dependency and supply-chain tracking — map who depends on a library by topic and language.
  • Security and vulnerability research — build target lists for CVE scanning, prioritized by popularity.
  • Technical talent sourcing — find active maintainers by language and project prominence.
  • Devtool competitive intelligence — find repos using a competitor’s library and measure growth over time.
  • Package popularity tracking — monitor star/fork growth for your own or watched projects.
  • Trending discovery and curation — power newsletters, lists, and dashboards.
  • Ecosystem mapping — cross-reference languages, owners, dates, and license types for research and journalism.

Cost math

This actor is priced per dataset item (price-per-result), so cost scales directly with how many repos you pull. A targeted query — one topic, a star floor, a few hundred matching repos — is cents. A full ecosystem sweep using unlimited-mode date windowing can return tens of thousands of repos; budget accordingly and use qualifiers (stars, language, date ranges) to keep the set to what you’ll actually analyze. Because there’s no proxy and no browser, compute is minimal; the result count is the cost driver.

Rolling your own is genuinely feasible — the API is well documented — but you’d own the rate-limit backoff, the date-windowing logic that defeats the 1,000-cap, and the nested-object flattening. The windowing in particular is fiddly to get right (off-by-one date boundaries either drop or double-count repos). That logic is exactly what the managed actor solves once.

Common pitfalls

  • The silent 1,000-result truncation. The number-one mistake. If your query matches more than 1,000 repos and you don’t window by date or star band, you get a partial dataset with no error. Always use unlimited/windowed mode for broad queries.
  • updated_at ≠ activity. Filtering “active projects” on updated_at includes repos that just got a star. Use pushed_at for real code activity.
  • Rename/transfer breakage. Keying on full_name breaks when repos move. Use repo_id.
  • Rate-limit storms. Hammering the Search API past 30 req/min gets you throttled. Respect the retry headers — the actor does this automatically.
  • Topic gaps. Not every repo sets topics; absence of a topic tag doesn’t mean the repo is unrelated. Combine topic with language and description signals.

Wrapping up

GitHub’s Search API is clean and open — the difficulty is structural: tight rate limits and a 1,000-result cap that quietly truncate any naive large pull. For a small, targeted query you can hit the API directly. For a complete, growth-tracked dataset across a language or topic, a managed actor that already automates date-windowing around the cap, respects rate limits, and flattens nested objects gets you a full, clean dataset without the silent-truncation trap.

Open the GitHub Repository Scraper on Apify — search by language, topic, stars and dates; unlimited windowed mode; flat warehouse-ready rows. Priced per result, free monthly credit to start.

Related guides