lead-generation · May 26, 2026 · 7 min read

How to Scrape Emails, Phones & Social Links From Any Website in 2026

Extract emails, phone numbers and social profiles from company websites at scale — auto-crawling Contact and About pages — for B2B lead gen, CRM enrichment and outreach.

You have a list of domains. You need the emails, phones and social profiles behind them — for an outreach campaign, a CRM enrichment pass, or a competitive map. Visiting each site by hand, hunting for the Contact page, copy-pasting the address and dodging the obfuscated info [at] company [dot] com is the kind of work nobody should be doing in 2026. This guide covers how website contact extraction actually works — where the data hides, how to avoid garbage matches, and how deliverability concerns shape what’s worth keeping.

What’s worth extracting

Per domain, a contact crawl can surface a surprisingly complete picture:

Emails — every valid address found across the homepage and contact/about/team pages, deduplicated.
Phone numbers — from tel: links (highest confidence) and international-format text matches.
Social profiles — LinkedIn, X/Twitter, Instagram, Facebook and YouTube URLs, detected by known domain patterns.
Page metadata — page title and meta description, useful for qualifying the prospect.
Attribution — the root domain, which page each contact came from (homepage vs. contact vs. about), and a scrape timestamp.

For B2B lead gen the email is the headline, but the LinkedIn URL is often the more durable asset — emails bounce, LinkedIn profiles persist. Phones matter most for local-business and high-ticket outreach.

Where contact data hides — and why naive extraction fails

A regex over the homepage misses most of the data and grabs a lot of junk. The real problems:

Contacts aren’t on the homepage — they live on /contact, /about, /team, /impressum (mandatory in DACH), /contact-us. You have to discover and crawl those pages, typically at depth 1 from the root.
Client-rendered contacts — modern sites inject the email via JavaScript or hide it behind a React component. A pure HTML fetch returns an empty shell. You need a browser fallback for those domains.
Obfuscation — name [at] domain [dot] com, entity-encoded addresses, image-rendered emails. Some are recoverable with normalization; image-only emails aren’t (and that’s by design on the site’s part).
False positives — regex email matching happily grabs 2x@3.png from a CSS sprite, user@2x retina image refs, Sentry DSNs, and version strings. Without false-positive filtering your list is half garbage.
Phone format chaos — +44 20..., (212) 555..., 0800-.... tel: links are reliable; free-text matches need international-format validation or you’ll capture order numbers and dates.

The reliable approach is: discover contact pages by URL pattern, crawl them with HTTP first and a JavaScript-enabled browser as a fallback, regex-scan with aggressive false-positive filtering, dedupe, and attribute each hit to its source page. A managed actor does all of this — including the browser fallback with asset-blocking and fingerprinting — so you don’t rebuild it per project.

▶ Run the Website Contact Scraper — feed it domains, it auto-detects Contact/About pages (depth 1), extracts emails, phones and socials with false-positive filtering, and returns clean JSON per domain. HTTP-first with a Playwright fallback for JS-rendered sites.

How the crawl works

The flow per input domain:

1. Fetch homepage (HTTP).
2. Discover candidate contact pages by URL pattern:
     /contact, /contact-us, /about, /team, /impressum, /kontakt ...
3. Crawl those pages (depth 1). If a page is JS-rendered and empty,
   retry it with a headless browser (assets/analytics blocked).
4. Regex-scan all fetched HTML for emails + phones; detect socials
   by domain pattern (linkedin.com/company, twitter.com, ...).
5. Filter false positives, deduplicate, attribute to source page.
6. Emit one record per domain.

The depth-1 discovery is the key design choice: deep enough to find the contact page, shallow enough not to crawl the entire site and rack up cost on a blog archive.

Build it yourself vs. use a managed scraper

Roll your own — an afternoon to regex the homepage. Then the long tail: contact-page discovery, the browser fallback for JS sites, the false-positive filter (this alone is iterative — you keep finding new garbage patterns), phone validation, social detection, and per-page attribution. Plus proxy and concurrency management for any real volume.
Managed actor — running in minutes, browser fallback and false-positive filtering already tuned, output clean per domain, priced per result so cost scales with leads not effort.

For ten domains, a script is fine. For a few thousand domains where list quality directly affects sender reputation, the false-positive filtering and browser fallback are exactly what you don’t want to half-build.

Schema design for downstream use

A clean per-domain record:

{
  "domain": "acme-robotics.com",
  "emails": ["sales@acme-robotics.com", "info@acme-robotics.com"],
  "phones": ["+1-415-555-0142"],
  "socials": {
    "linkedin": "https://www.linkedin.com/company/acme-robotics",
    "twitter": "https://twitter.com/acmerobotics",
    "instagram": null,
    "facebook": "https://facebook.com/acmerobotics",
    "youtube": null
  },
  "page_title": "Acme Robotics — Industrial Automation",
  "meta_description": "Acme Robotics builds...",
  "source_pages": {
    "sales@acme-robotics.com": "https://acme-robotics.com/contact"
  },
  "scraped_at": "2026-05-26T10:00:00Z"
}

Schema choices worth making early:

Keep emails as an array, ranked. A generic info@ is a weaker lead than a role-specific sales@; preserve all and let downstream scoring pick.
Store source_pages attribution. An email found on /team is higher-confidence than one scraped from a footer that might be a vendor’s address.
Null out absent socials explicitly. Don’t omit the key — downstream enrichment logic is cleaner when the shape is stable.
Never treat a found email as verified. Extraction proves the address was published, not that it’s deliverable (more below).

Deliverability: extraction is not verification

This is the part most lead-gen pipelines get wrong. A scraped email is a candidate, not a confirmed mailbox. Before you load it into a send:

Verify separately — run candidates through an email-verification step (MX check, SMTP probe, catch-all detection). Blasting unverified scraped addresses tanks your sender reputation fast.
Prefer role to generic where appropriate — info@ and contact@ are catch-alls that may route to a black hole; a named or departmental address often performs better, but also respect that generic boxes are the intended public contact.
Respect consent and local law — GDPR, CAN-SPAM and CASL govern cold outreach. Scraping publicly-listed business contacts is generally permissible; how you use them is where the legal line sits. Honor opt-outs and keep your basis-for-processing defensible.

Treat the scraper as the top of a funnel: extract broadly, verify ruthlessly, send narrowly.

Typical use cases

B2B lead generation — turn a list of target-company domains into an outreach-ready contact list.
Sales prospecting / CRM enrichment — append emails, phones and socials to accounts already in your CRM.
Competitive research — map a competitor set’s contact channels and social footprint.
Recruiting outreach — find direct contact details for hiring or partnership conversations.
Agency prospecting at scale — gather prospect contact and social links across an industry vertical.

The value is throughput with quality control: thousands of domains in, a clean and attributed contact list out — which then feeds a verification step before any send.

Cost math for the managed approach

This scraper is priced per result (per dataset item), so cost tracks leads found, not effort spent. A 2,000-domain enrichment pass is a small, predictable bill. Against the alternative — building and maintaining the contact-page discovery, the browser fallback, the false-positive filter and the proxy rotation — the per-result model means you pay for outcomes. The bigger cost you’re avoiding is reputational: a poorly-filtered DIY scraper that floods your list with @2x.png garbage and dead catch-alls costs you deliverability, which is far more expensive than compute.

Common pitfalls

Skipping the browser fallback — JS-rendered contact pages return empty to a pure HTTP fetch. You silently miss the very sites that invest in modern frontends.
Trusting raw regex output — without false-positive filtering you’ll ship image filenames and DSNs as “emails.” Always filter.
Crawling too deep — depth beyond contact/about pages explodes cost and pulls in irrelevant addresses (blog comment authors, vendor footers).
Treating extraction as verification — the single biggest deliverability mistake. Always verify before sending.
Ignoring impressum/kontakt — for DACH and other EU sites, the legally-mandated imprint page is where the real contact lives. Make sure discovery includes localized contact-page names.

Wrapping up

Website contact extraction looks like a one-line regex and turns out to be a discovery problem (finding the contact page), a rendering problem (JS sites), and a quality problem (false positives and deliverability). For a handful of domains, a script does. For a real lead-gen pipeline where list quality protects your sender reputation, let a managed, per-result scraper handle discovery, the browser fallback and filtering — then verify before you send.

▶ Open the Website Contact Scraper on Apify — emails, phones and social profiles per domain, false-positive filtered, JSON output. Pay per result. Feed it your target domains and start enriching.