L logiover
jobs · Jun 1, 2026 · 6 min read

How to Scrape WorkIndia Candidate Profiles in 2026

Extract candidate profiles from WorkIndia — India's largest blue & grey collar hiring platform — by job title, city and industry. Skills, experience, education and match scores at scale.

WorkIndia is India’s largest blue- and grey-collar hiring platform, with 300K+ active candidates — drivers, delivery riders, technicians, sales, security, housekeeping, teaching and entry-level IT. For staffing agencies and recruiters sourcing at volume in Indian metros, it’s a uniquely deep talent pool. The platform exposes a public candidate-search API, which means you can pull structured candidate profiles by role, city and industry without a headless browser. This guide covers how that API-based extraction works, what each candidate record contains, and the responsibilities that come with sourcing personal data.

What’s worth extracting

WorkIndia’s candidate search returns flat, structured profiles. Per candidate you get:

  • Identity — candidate identifier and personal details, plus basic demographics.
  • Location — current city and area, used for proximity sorting.
  • Qualifications — education level and field.
  • Experience — total experience and tenure, previous roles and employers.
  • Skills & sectors — listed skills and the industry sectors the candidate fits.
  • Languages & assets — spoken languages and assets (e.g. owns a two-wheeler — material for delivery/driver roles).
  • Signals — activity and join timestamps, a relevance/match score, lead-priority indicators, and verification/contact-availability flags.

For pipeline building you center on role-fit (skills, sectors, experience) and location. The match score and lead-priority flags let you rank a raw search into a worklist. The verification/contact-availability signals tell you which profiles are actionable.

The API is public — the work is search modeling and scale

WorkIndia exposes a candidate-search API the app itself calls, so no browser rendering is required. The friction is in driving it well:

  • Search dimensions — searches run across job title, city and industry category. There’s optional industry inference from the job title, so a search for “delivery boy” can map to the right sector automatically. Modeling your sourcing need into the right title/city/industry triple is the first design task.
  • Sorting — results can be ordered by recency, activity or location proximity. Which you pick changes who lands at the top of your worklist.
  • Pagination, parallelism and pacing — to pull a city’s pool you page deeply; you can tune parallelism and request delay for polite, scalable operation. Too aggressive and you risk throttling; too slow and a metro pull takes hours.
  • Flat normalization — the raw API payload nests; turning it into one clean row per candidate (skills as an array, employers as a list) is on you.

A managed actor encapsulates the search modeling, the pagination/parallelism tuning and the flattening, so you specify role + city + industry and get clean candidate rows back.

Run the WorkIndia Candidate Scraper — searches WorkIndia’s candidate API by job title, city and industry, with recency/activity/proximity sorting and tunable pacing. Returns flat profiles: skills, experience, education, languages, match score and contact-availability.

How the search works

Conceptually a search is a title + city + industry query with sort and pagination:

job_title:  "Delivery Boy"
city:       "Bengaluru"
industry:   (inferred from title -> Logistics/Delivery)
sort_by:    "recency"        # or "activity" / "proximity"
page:       1                 # then 2, 3, ... with tuned delay

To build a full pipeline you fan out across the role variants and cities you recruit for — “delivery boy”, “driver”, “field sales”, “security guard” across Bengaluru, Pune, Hyderabad — and merge the deduplicated results into one ranked worklist.

Build it yourself vs. use a managed scraper

  • Roll your own — once you’ve found the API call, one search is easy. The tail: industry inference, sort handling, deep pagination with polite pacing and parallelism tuning, flattening nested profiles, dedup across overlapping searches, and re-checking the endpoint when WorkIndia updates its app.
  • Managed actor — running in minutes, search dimensions and pacing handled, output flat and rankable.

For a single city/role probe, a script works. For a recurring multi-city, multi-role sourcing pipeline, the fan-out, pacing and flattening are the parts worth offloading.

Schema design for downstream use

A clean per-candidate row:

{
  "candidate_id": "wi-9087654",
  "city": "Bengaluru",
  "area": "Whitefield",
  "education": "12th Pass",
  "total_experience_years": 3,
  "previous_roles": ["Delivery Executive", "Warehouse Helper"],
  "skills": ["two-wheeler", "navigation", "customer handling"],
  "sectors": ["Logistics", "Delivery"],
  "languages": ["Hindi", "Kannada", "English"],
  "assets": ["two-wheeler", "smartphone"],
  "match_score": 0.82,
  "lead_priority": "high",
  "contact_available": true,
  "active_at": "2026-05-30T14:00:00Z",
  "joined_at": "2025-11-02",
  "scraped_at": "2026-06-01T09:00:00Z"
}

Schema choices worth making early:

  • Key on candidate_id and dedupe across overlapping searches — the same person matches multiple role variants.
  • Keep skills, sectors, languages and assets as arrays. For blue-collar roles, asset ownership (two-wheeler, smartphone) is often a hard filter; don’t bury it.
  • Persist match_score and lead_priority so downstream screening can rank without re-querying.
  • Carry contact_available — a profile you can’t reach is not a workable lead; let it gate your outreach list.
  • Store active_at — recency of activity is the strongest predictor of responsiveness in high-churn blue-collar pools.

Compliance and ethics: this is personal data

Unlike scraping product prices, this is candidate PII, and that carries obligations:

  • India’s DPDP Act — the Digital Personal Data Protection regime governs processing of personal data of individuals in India. Have a lawful basis, minimize what you store, and honor erasure/withdrawal requests.
  • Purpose limitation — source for genuine recruitment. Don’t repurpose candidate data for unrelated marketing.
  • Respect platform terms and the candidate’s intent — these individuals listed themselves to be hired; contacting them for relevant roles is the intended use. Spamming irrelevant offers is not.
  • Secure storage — candidate records are sensitive. Store them with access controls and a retention limit, not in an open spreadsheet.

Treat the scraper as a sourcing tool for a legitimate recruitment funnel, not a bulk PII harvester.

Typical use cases

  • Talent pipeline building — assemble role-and-city candidate pools (delivery, driver, technician, sales, security, housekeeping, teaching, entry IT).
  • Staffing agency sourcing at scale — fan out across metros to fill high-volume requisitions fast.
  • Workforce analytics — study experience, qualification and skill distribution in the Indian blue-collar market.
  • Supply/demand market research — measure candidate availability by role and city.
  • Hiring benchmarking — gauge talent supply for specific roles in target metros.
  • Hiring-system integration — feed structured profiles into your ATS for downstream screening (with compliance in place).

The value is depth and recency: a ranked, deduplicated, recently-active candidate pool for the exact roles and cities you recruit for.

Cost math for the managed approach

API-based, no browser, no proxy — extraction is fast and cheap; cost is compute. A multi-city sourcing pull lands in single-digit dollars per run. The expense you avoid is the build: search modeling, pagination/pacing tuning, flattening and dedup — plus re-fixing the integration when the app’s API shifts. For an agency billing per placement, the data cost is a rounding error against a single filled role.

Common pitfalls

  • Ignoring contact-availability — sourcing 5,000 profiles you can’t actually reach is busywork. Gate on contact_available.
  • Sorting by the wrong key — proximity sort fills your list with nearby-but-inactive candidates; recency/activity surfaces responsive ones. Match sort to goal.
  • No dedup across searches — overlapping role queries return the same person repeatedly; dedupe on candidate_id.
  • Over-aggressive pacing — too much parallelism trips throttling and truncates the pull. Tune the delay.
  • Treating PII casually — the biggest risk here isn’t technical, it’s compliance. Have a lawful basis, minimize, secure and honor deletion.

Wrapping up

WorkIndia opens a uniquely deep Indian blue-collar talent pool via a public candidate-search API, so the technical work is search modeling, polite deep pagination and flattening — not anti-bot. The harder discipline is treating candidate PII responsibly under DPDP. For a single probe, a script works. For a recurring multi-city sourcing pipeline, let a managed actor handle the fan-out and normalization while you keep the compliance tight.

Open the WorkIndia candidate scraper on Apify — search by job title, city and industry; ranked, deduplicated candidate profiles with skills, experience and contact-availability. Build your talent pipeline at scale.

Related guides