How to Scrape Emails, Phones & Social Links From Any Website in 2026
Extract emails, phone numbers and social profiles from company websites at scale — auto-crawling Contact and About pages — for B2B lead gen, CRM enrichment and outreach.
You have a list of domains. You need the emails, phones and social profiles behind them — for an outreach campaign, a CRM enrichment pass, or a competitive map. Visiting each site by hand, hunting for the Contact page, copy-pasting the address and dodging the obfuscated info [at] company [dot] com is the kind of work nobody should be doing in 2026. This guide covers how website contact extraction actually works — where the data hides, how to avoid garbage matches, and how deliverability concerns shape what’s worth keeping.
What’s worth extracting
Per domain, a contact crawl can surface a surprisingly complete picture:
- Emails — every valid address found across the homepage and contact/about/team pages, deduplicated.
- Phone numbers — from
tel:links (highest confidence) and international-format text matches. - Social profiles — LinkedIn, X/Twitter, Instagram, Facebook and YouTube URLs, detected by known domain patterns.
- Page metadata — page title and meta description, useful for qualifying the prospect.
- Attribution — the root domain, which page each contact came from (homepage vs. contact vs. about), and a scrape timestamp.
For B2B lead gen the email is the headline, but the LinkedIn URL is often the more durable asset — emails bounce, LinkedIn profiles persist. Phones matter most for local-business and high-ticket outreach.
Where contact data hides — and why naive extraction fails
A regex over the homepage misses most of the data and grabs a lot of junk. The real problems:
- Contacts aren’t on the homepage — they live on
/contact,/about,/team,/impressum(mandatory in DACH),/contact-us. You have to discover and crawl those pages, typically at depth 1 from the root. - Client-rendered contacts — modern sites inject the email via JavaScript or hide it behind a React component. A pure HTML fetch returns an empty shell. You need a browser fallback for those domains.
- Obfuscation —
name [at] domain [dot] com, entity-encoded addresses, image-rendered emails. Some are recoverable with normalization; image-only emails aren’t (and that’s by design on the site’s part). - False positives — regex email matching happily grabs
2x@3.pngfrom a CSS sprite,user@2xretina image refs, Sentry DSNs, and version strings. Without false-positive filtering your list is half garbage. - Phone format chaos —
+44 20...,(212) 555...,0800-....tel:links are reliable; free-text matches need international-format validation or you’ll capture order numbers and dates.
The reliable approach is: discover contact pages by URL pattern, crawl them with HTTP first and a JavaScript-enabled browser as a fallback, regex-scan with aggressive false-positive filtering, dedupe, and attribute each hit to its source page. A managed actor does all of this — including the browser fallback with asset-blocking and fingerprinting — so you don’t rebuild it per project.
▶ Run the Website Contact Scraper — feed it domains, it auto-detects Contact/About pages (depth 1), extracts emails, phones and socials with false-positive filtering, and returns clean JSON per domain. HTTP-first with a Playwright fallback for JS-rendered sites.
How the crawl works
The flow per input domain:
1. Fetch homepage (HTTP).
2. Discover candidate contact pages by URL pattern:
/contact, /contact-us, /about, /team, /impressum, /kontakt ...
3. Crawl those pages (depth 1). If a page is JS-rendered and empty,
retry it with a headless browser (assets/analytics blocked).
4. Regex-scan all fetched HTML for emails + phones; detect socials
by domain pattern (linkedin.com/company, twitter.com, ...).
5. Filter false positives, deduplicate, attribute to source page.
6. Emit one record per domain.
The depth-1 discovery is the key design choice: deep enough to find the contact page, shallow enough not to crawl the entire site and rack up cost on a blog archive.
Build it yourself vs. use a managed scraper
- Roll your own — an afternoon to regex the homepage. Then the long tail: contact-page discovery, the browser fallback for JS sites, the false-positive filter (this alone is iterative — you keep finding new garbage patterns), phone validation, social detection, and per-page attribution. Plus proxy and concurrency management for any real volume.
- Managed actor — running in minutes, browser fallback and false-positive filtering already tuned, output clean per domain, priced per result so cost scales with leads not effort.
For ten domains, a script is fine. For a few thousand domains where list quality directly affects sender reputation, the false-positive filtering and browser fallback are exactly what you don’t want to half-build.
Schema design for downstream use
A clean per-domain record:
{
"domain": "acme-robotics.com",
"emails": ["sales@acme-robotics.com", "info@acme-robotics.com"],
"phones": ["+1-415-555-0142"],
"socials": {
"linkedin": "https://www.linkedin.com/company/acme-robotics",
"twitter": "https://twitter.com/acmerobotics",
"instagram": null,
"facebook": "https://facebook.com/acmerobotics",
"youtube": null
},
"page_title": "Acme Robotics — Industrial Automation",
"meta_description": "Acme Robotics builds...",
"source_pages": {
"sales@acme-robotics.com": "https://acme-robotics.com/contact"
},
"scraped_at": "2026-05-26T10:00:00Z"
}
Schema choices worth making early:
- Keep
emailsas an array, ranked. A genericinfo@is a weaker lead than a role-specificsales@; preserve all and let downstream scoring pick. - Store
source_pagesattribution. An email found on/teamis higher-confidence than one scraped from a footer that might be a vendor’s address. - Null out absent socials explicitly. Don’t omit the key — downstream enrichment logic is cleaner when the shape is stable.
- Never treat a found email as verified. Extraction proves the address was published, not that it’s deliverable (more below).
Deliverability: extraction is not verification
This is the part most lead-gen pipelines get wrong. A scraped email is a candidate, not a confirmed mailbox. Before you load it into a send:
- Verify separately — run candidates through an email-verification step (MX check, SMTP probe, catch-all detection). Blasting unverified scraped addresses tanks your sender reputation fast.
- Prefer role to generic where appropriate —
info@andcontact@are catch-alls that may route to a black hole; a named or departmental address often performs better, but also respect that generic boxes are the intended public contact. - Respect consent and local law — GDPR, CAN-SPAM and CASL govern cold outreach. Scraping publicly-listed business contacts is generally permissible; how you use them is where the legal line sits. Honor opt-outs and keep your basis-for-processing defensible.
Treat the scraper as the top of a funnel: extract broadly, verify ruthlessly, send narrowly.
Typical use cases
- B2B lead generation — turn a list of target-company domains into an outreach-ready contact list.
- Sales prospecting / CRM enrichment — append emails, phones and socials to accounts already in your CRM.
- Competitive research — map a competitor set’s contact channels and social footprint.
- Recruiting outreach — find direct contact details for hiring or partnership conversations.
- Agency prospecting at scale — gather prospect contact and social links across an industry vertical.
The value is throughput with quality control: thousands of domains in, a clean and attributed contact list out — which then feeds a verification step before any send.
Cost math for the managed approach
This scraper is priced per result (per dataset item), so cost tracks leads found, not effort spent. A 2,000-domain enrichment pass is a small, predictable bill. Against the alternative — building and maintaining the contact-page discovery, the browser fallback, the false-positive filter and the proxy rotation — the per-result model means you pay for outcomes. The bigger cost you’re avoiding is reputational: a poorly-filtered DIY scraper that floods your list with @2x.png garbage and dead catch-alls costs you deliverability, which is far more expensive than compute.
Common pitfalls
- Skipping the browser fallback — JS-rendered contact pages return empty to a pure HTTP fetch. You silently miss the very sites that invest in modern frontends.
- Trusting raw regex output — without false-positive filtering you’ll ship image filenames and DSNs as “emails.” Always filter.
- Crawling too deep — depth beyond contact/about pages explodes cost and pulls in irrelevant addresses (blog comment authors, vendor footers).
- Treating extraction as verification — the single biggest deliverability mistake. Always verify before sending.
- Ignoring
impressum/kontakt— for DACH and other EU sites, the legally-mandated imprint page is where the real contact lives. Make sure discovery includes localized contact-page names.
Wrapping up
Website contact extraction looks like a one-line regex and turns out to be a discovery problem (finding the contact page), a rendering problem (JS sites), and a quality problem (false positives and deliverability). For a handful of domains, a script does. For a real lead-gen pipeline where list quality protects your sender reputation, let a managed, per-result scraper handle discovery, the browser fallback and filtering — then verify before you send.
▶ Open the Website Contact Scraper on Apify — emails, phones and social profiles per domain, false-positive filtered, JSON output. Pay per result. Feed it your target domains and start enriching.
Related guides
Eventbrite API Alternative: Public Event Search After 2019
Eventbrite removed public event search from its API in late 2019. Here is the working Eventbrite API alternative for public event data in 2026.
How to Bulk-Verify Email Deliverability in 2026
A practical guide to validating email lists at scale — syntax, MX/DNS, disposable, role and typo checks — to cut bounce rate and protect sender reputation before outreach.
How to Find Shopify Merchant Leads and Contacts in 2026
A practical guide to extracting B2B leads from Shopify stores — emails, phone numbers, social profiles and store metadata — via direct JSON endpoints with no browser.