ai · Jun 3, 2026 · 6 min read

How to Scrape the Hugging Face Hub in 2026

Export every model, dataset, Space and daily paper from the Hugging Face Hub — filter by task, library, license and author, sort by downloads or trending, via the public API with no token.

The Hugging Face Hub is the center of gravity for open AI — roughly a million-plus models, 200,000+ datasets, 500,000+ Spaces, plus daily research papers and curated collections. If you’re seeding an AI tool directory, building RAG pipelines, doing VC or talent intelligence on the open-model ecosystem, or just monitoring releases in a niche, the Hub is the dataset. And unlike a defended commercial site, it ships a clean public API. The real challenge is breadth and normalization: enumerating millions of heterogeneous entities into one flat, comparable schema. This guide covers how to do that at scale in 2026.

The Hub API reality

Hugging Face exposes a genuinely good public Hub API — no token required for public entities, cursor-based pagination, rich filtering. There’s no anti-bot wall, no headless browser, no proxy arms race. So the difficulty isn’t access; it’s three other things:

Scale. Cursor-walking a million models politely, without dropping pages or re-fetching, is a real engineering task.
Heterogeneity. Models, datasets, Spaces, papers and collections are five different entity shapes. Turning them into one comparable schema is the value.
Enrichment cost. The list endpoints are cheap and shallow; the rich per-item metadata (full tags, sibling files, model-card README) requires a second request per item. Knowing when to enrich is the cost discipline.

What you can filter and sort on

The Hub’s filtering is what makes targeted extraction possible. You can slice by:

Task / pipeline — text-generation, embeddings, ASR, TTS, vision, and the rest of the pipeline taxonomy.
Library — transformers, diffusers, sentence-transformers, GGUF, MLX, ONNX. (This is how you isolate, say, every GGUF quant or every diffusers image model.)
Language, license, base-model lineage, author/organization.

And sort by downloads, likes, or trending — the three signals that separate the 50 models people actually use from the long tail.

▶ Run the Hugging Face Hub Scraper — export models, datasets, Spaces, daily papers and collections. Filter by task, library, language, license and author; sort by downloads, likes or trending. Public API, no token.

What’s worth extracting

Across all five entity types, the scraper normalizes into one flat schema. Per item you can get:

Identity — Hub ID, direct Hub URL, entity type, author/org.
Taxonomy — task/pipeline tags, library tags, license, language tags, the full tag list.
Lineage — base-model and dataset lineage (what a model was fine-tuned from).
Engagement — download and like counts (the usage signal).
Time — created and last-modified timestamps (the release-monitoring signal).
Deep metadata (enrichment) — sibling file listings, model-card data, full README content.
Papers — for Daily Papers: authors, abstract, upvotes, date.

A clean per-item schema

{
  "type": "model",
  "id": "meta-llama/Llama-4-Scout-17B",
  "author": "meta-llama",
  "url": "https://huggingface.co/meta-llama/Llama-4-Scout-17B",
  "pipeline_tag": "text-generation",
  "libraries": ["transformers", "safetensors"],
  "license": "llama4",
  "languages": ["en", "es", "de"],
  "tags": ["text-generation", "conversational", "llama"],
  "base_model": "meta-llama/Llama-4-Scout-17B-base",
  "downloads": 1842300,
  "likes": 5120,
  "created_at": "2026-04-05T00:00:00Z",
  "last_modified": "2026-05-28T00:00:00Z",
  "siblings": ["config.json", "model-00001-of-00012.safetensors", "..."],
  "readme": "# Llama 4 Scout ...",
  "scraped_at": "2026-06-03T09:00:00Z"
}

Schema choices worth making early:

Keep type as the discriminator. One flat schema across models/datasets/Spaces/papers only works if every consumer can branch on entity type.
Store downloads and likes with scraped_at. Both are snapshots; the trend (download velocity, like growth) is the interesting metric, and you can only compute it if you timestamp every pull.
Persist base_model lineage. Mapping the fine-tune tree is half of any open-model landscape analysis.
Treat siblings and readme as optional/heavy. They come from enrichment; don’t assume they’re populated on a list-only run, and don’t store full READMEs you won’t use.
Keep tags as the full array. The pipeline tag is the headline, but downstream classification often needs the long tail.

Typical use cases

AI tool discovery / marketplace seeding — daily refresh and bulk export of models in a niche (a specific task, library, or license) to seed a directory.
RAG and fine-tuning pipelines — discover datasets by task category and language for training or retrieval.
VC and talent intelligence — aggregate and monitor models by author/organization and track engagement trends to spot who’s shipping.
AI release monitoring — detect newly created or recently modified models in a watchlist (filter by last_modified).
Hub indexing / semantic search — ingest README and metadata into a vector database for search over the Hub.
Competitive landscape mapping — sort by downloads/likes within a library to map a domain like image generation.
Daily paper digests — collect Daily Papers with upvotes, authors and abstracts.
Spaces discovery — enumerate hosted Gradio/Streamlit/Docker demos and capture runtime metadata.
Cross-entity leaderboards — the unified schema enables comparisons across models, datasets, Spaces, papers and collections in one query.

The throughline: the Hub moves fast, so the value is in a refreshed, filtered, normalized slice — not a one-time million-row dump nobody can query.

Cost math

Pricing is pay-per-event with a small per-run start fee plus a few tenths of a cent per result. At $0.002 per item, a focused slice — say, every trending text-generation model, a few thousand rows — costs a few dollars. The dominant cost lever is enrichment: a list-only walk is one request per page and cheap; enabling per-item enrichment (sibling files, README, full metadata) adds a request per item and multiplies both time and cost. The discipline is to list broadly, then enrich only the subset you actually need (e.g. the top 200 by downloads), rather than enriching the entire long tail.

Against a DIY build you avoid: robust cursor pagination across a million-plus entities, the cross-entity normalization into one flat schema, the enrichment request management, and the filter/sort plumbing across all five entity types.

Common pitfalls

Don’t enrich everything. The single biggest cost mistake is turning on full enrichment for a million-row walk. List first, enrich the relevant slice.
Downloads are a 30-day rolling number on many entities — not lifetime. Read the metric definition before benchmarking, and always store your scraped_at.
Trending is volatile. A “sort by trending” snapshot is valid for hours, not weeks; schedule it if you care about the trend.
READMEs are large and noisy. Model cards include badges, tables and HTML; sanitize before indexing into a vector store.
Gated and private repos. Public API + no token means gated models return limited or no metadata — expect gaps and don’t treat them as deletions.
One schema, five shapes. Fields like pipeline_tag or base_model are meaningful for models but null for datasets/papers — branch on type before assuming a field exists.

Wrapping up

The Hugging Face Hub is an open, well-documented API, so the hard part isn’t getting in — it’s enumerating a fast-moving, million-entity ecosystem into one normalized schema without over-spending on enrichment. For a quick top-50 snapshot in one task you could hit the API by hand. For a refreshed, filtered, cross-entity feed powering a directory, a RAG pipeline, release monitoring, or AI-ecosystem intelligence, use a scraper that already handles the cursor pagination, normalization, and enrichment discipline.

▶ Open the Hugging Face Hub Scraper on Apify — models, datasets, Spaces, papers and collections in one flat schema, filterable by task, library, license and author. Pay-per-event, start on Apify’s free credit.