Crunchbase Scraper — Company, Funding & Investor Data Extraction

Data Dictionary

Every field we extract from crunchbase.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Company Profiles objects from crunchbase.com. All fields typed and schema-versioned.

company_namedomaindescriptionfounded_dateemployee_count_minemployee_count_maxhq_locationindustry_groupstotal_funding_usdoperating_statuslast_funding_typelinkedin_url

"company_name": "DataFlirt",
"domain": "dataflirt.com",
"founded_date": "2021-04-12",
"employee_count_min": 11,
"employee_count_max": 50,
"hq_location": "Bengaluru, Karnataka, India",
"total_funding_usd": 4500000,
"operating_status": "Active"

#	company_name	domain	description	founded_date	employee_count_min	employee_count_max
1
2
3

Complete list of extractable fields for Funding Rounds objects from crunchbase.com. All fields typed and schema-versioned.

transaction_nameannounced_datemoney_raisedcurrencyround_typelead_investorparticipating_investorspre_money_valuationpost_money_valuationinvestor_count

"transaction_name": "Series A - TechCorp",
"announced_date": "2024-02-15",
"money_raised": 12000000,
"currency": "USD",
"round_type": "Series A",
"lead_investor": "Sequoia Capital",
"investor_count": 4

#	transaction_name	announced_date	money_raised	currency	round_type	lead_investor
1
2
3

Complete list of extractable fields for Investors objects from crunchbase.com. All fields typed and schema-versioned.

investor_nameinvestor_typelocationinvestment_stagetotal_investmentslead_investmentsexitsportfolio_companiesactive_statuswebsite_url

"investor_name": "Accel",
"investor_type": "Venture Capital",
"location": "Palo Alto, California",
"total_investments": 1842,
"lead_investments": 612,
"exits": 345,
"active_status": true

#	investor_name	investor_type	location	investment_stage	total_investments	lead_investments
1
2
3

Complete list of extractable fields for Acquisitions objects from crunchbase.com. All fields typed and schema-versioned.

acquiree_nameacquirer_nameannounced_datepricecurrencyacquisition_typeacquiree_revenueacquiree_employeesstatus

"acquiree_name": "StartupX",
"acquirer_name": "MegaCorp",
"announced_date": "2023-11-20",
"price": 45000000,
"currency": "USD",
"acquisition_type": "Acquisition",
"status": "Complete"

#	acquiree_name	acquirer_name	announced_date	price	currency	acquisition_type
1
2
3

Complete list of extractable fields for Leadership objects from crunchbase.com. All fields typed and schema-versioned.

person_namecurrent_titlecompany_namepast_rolesboard_seatseducationlocationlinkedin_urltwitter_url

"person_name": "Jane Doe",
"current_title": "Chief Technology Officer",
"company_name": "DataFlirt",
"past_roles": "['VP Engineering at TechCorp', 'Senior Engineer at StartupX']",
"location": "Bengaluru, India",
"linkedin_url": "https://linkedin.com/in/janedoe"

#	person_name	current_title	company_name	past_roles	board_seats	education
1
2
3

Capabilities

Everything you need from Crunchbase — nothing you don't

Our Crunchbase scraper handles the entire directory structure: company profiles, funding histories, and investor graphs — with JavaScript rendering and Datadome bypass built in.

Company Overviews

Extract core firmographic data: founding date, employee counts, HQ location, operating status, and detailed industry taxonomies.

Funding Histories

Capture every funding round, including announced date, money raised, round type, pre-money valuation, and participating investors.

Investor Portfolios

Map investor profiles to their portfolio companies, tracking total investments, lead investments, exits, and preferred investment stages.

M&A Tracking

Monitor acquisition events, tracking acquirer, acquiree, transaction value, and acquisition type across the entire database.

Leadership & Board Data

Extract executive teams, founders, and board members, including their current titles, past roles, and professional social links.

Industry & Taxonomy Mapping

Normalise Crunchbase category groups and specific tags into structured arrays for precise market segmentation.

Web Traffic & Tech Stack

Extract built-in third-party signals surfaced on Crunchbase profiles, including monthly visit estimates and active technology tools.

IPO & Financials

Track IPO dates, stock symbols, initial stock prices, and valuation metrics for public companies listed in the directory.

Scheduled Updates

Configure continuous pipelines at daily or weekly cadences to capture new funding rounds and profile updates with change-detection diffing.

// engagement pipeline

From domain list to warehouse record

Brief in. Clean data out.

Define Scope

d 0

Provide target domains, investor names, or category filters. We design the extraction schema together.

Pipeline Build

d 2–4

We configure Scrapy / Playwright crawlers, proxy rotation, session management, and Datadome bypass for crunchbase.com.

Validation & QA

d 4–6

Schema validation, null-rate checks, and sample profile exports before full launch.

Delivery

ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our Crunchbase pipeline handles the hard parts

Crunchbase relies on Datadome and heavy client-side rendering. Here is how we maintain reliable extraction without IP bans.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

Datadome bypass

Residential proxies + TLS fingerprinting

Crunchbase uses Datadome to block automated traffic. We route requests through high-trust residential ISP proxies and spoof TLS fingerprints at the network layer, ensuring our crawlers appear as legitimate business users.

SPA State Extraction

Direct Next.js/React state parsing

Instead of relying purely on fragile DOM selectors, our pipeline intercepts the underlying JSON state injected by the frontend framework. This provides cleaner, strictly typed data directly from the source.

GraphQL Interception

Capturing XHR payloads

Much of the data on Crunchbase loads asynchronously via GraphQL requests. We intercept these network payloads during Playwright execution to extract deep pagination results without rendering thousands of DOM nodes.

Rate Limit Management

Distributed concurrency control

Aggressive scraping triggers temporary blocks. We distribute requests across a wide IP pool and implement randomised delays between page loads, adhering to safe concurrency limits while maintaining high overall throughput.

Entity Resolution

Handling duplicate profiles

We normalise company domains and Crunchbase UUIDs to ensure downstream systems receive deduplicated records, even when multiple profile variants exist for the same entity.

Applications

Who uses Crunchbase data — and how

Teams across industries use crunchbase.com data to build competitive products and smarter operations.

VC Deal Sourcing

Venture capital firms monitor new funding rounds and startup formations to identify early-stage investment opportunities before competitors.

B2B Lead Generation

Sales teams build targeted account lists based on recent funding events, employee count growth, and specific industry verticals.

Market Mapping

Strategy teams analyse entire sectors, tracking capital flows and competitor density to identify underserved market segments.

Competitor Intelligence

Product teams track competitor funding, acquisition activity, and leadership changes to anticipate market movements.

M&A Target Identification

Corporate development teams screen for potential acquisition targets based on funding stage, valuation estimates, and investor syndicates.

Investment Trend Analysis

Analysts aggregate thousands of funding rounds to model macro trends in capital deployment across regions and technologies.

Technical Spec

Crunchbase scraper — technical capabilities

Everything supported by our crunchbase.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Datadome bypass

Automated solver integration and TLS fingerprint spoofing

Supported

GraphQL interception

Direct extraction of XHR payloads for structured data

Supported

Residential proxy rotation

ISP-grade residential IPs from US/UK/EU pools

Supported

Pagination handling

Deep traversal of investor portfolios and funding lists

Supported

Change detection (diffs)

Hash-based diff: only emit records with changed fields since last run

Supported

Webhook delivery

HTTP POST per record or batch for real-time CRM updates

Supported

Funding round extraction

Complete historical capture of all capital events

Supported

Investor portfolio mapping

Bidirectional linking between companies and investors

Supported

Pro-only contact emails

Extraction of personal email addresses hidden behind paywalls

Partial

Saved search exports

Execution of user-specific saved queries requiring authentication

Partial

Infrastructure

Infrastructure powering the Crunchbase pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus

Scrapy + Playwright Stack

Scrapy handles crawl orchestration, deduplication, and retry logic. Playwright handles JavaScript rendering, cookie sessions, and interaction flows. Combined via scrapy-playwright middleware.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies across US/UK/EU regions. Rotation happens per-request with sticky sessions where required. IP score monitoring prevents blacklisted pool contamination.

Cloud-Native Orchestration

Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Newline-delimited or nested — schema versioned per run

CSV

Flat file with typed columns — Excel/Sheets compatible

XLS

Legacy spreadsheet format for business users

Parquet

Columnar format for BigQuery, Snowflake, Athena

AWS S3

Direct bucket delivery — compatible with any data lake

Webhook

HTTP POST per record for real-time downstream processing

API

REST endpoint to query your extracted datasets

BigQuery

Streamed directly into your dataset with schema auto-detect

Snowflake

Stage + COPY INTO workflow — incremental or full-replace

Postgres

Upsert into your existing schema with conflict resolution

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About crunchbase.com scraping, legality, and pipeline operations.

Ask us directly →

Is scraping Crunchbase legal?

Scraping publicly available information is generally permissible under applicable law. DataFlirt targets only public, non-authenticated company, funding, and investor data. We do not extract personal contact information behind paywalls or circumvent authentication systems. Clients should review Terms of Service and consult legal counsel for specific use cases.

How do you handle Datadome protection?

We use residential ISP proxies, full Playwright browser sessions with realistic TLS fingerprints, and request timing modelled on human behaviour. We monitor for block rate spikes in real time and trigger pool rotation automatically.

Can you extract data hidden behind the Crunchbase Pro paywall?

No. We only extract data that is publicly accessible on the platform without requiring a paid user session. This includes core firmographics, funding rounds, and investor profiles, but excludes proprietary contact data.

How fresh is the data?

Pipelines can be configured to run daily or weekly. We track recently funded companies and updated profiles to ensure your warehouse reflects market changes within 24 hours of publication.

Do you extract historical funding rounds?

Yes. When we map a company profile, we extract the entire historical ledger of funding events available on the page, not just the most recent round.

What is the minimum viable engagement?

Our smallest packages start at a defined list of 5,000 domains or specific category filters with weekly delivery. Contact us with your target parameters for a scoped quote.

How do you handle duplicate company profiles?

We use Crunchbase UUIDs and domain normalisation to ensure records are deduplicated before delivery, providing a clean relational dataset.

Can I request a sample dataset before committing?

Absolutely. We provide a sample run of up to 500 company profiles as part of the pre-engagement scoping process to validate schema fit and data quality.

Startup intelligence,
at warehouse scale.

Every field we extract from crunchbase.com

Everything you need from Crunchbase — nothing you don't

From domain list to warehouse record

How our Crunchbase pipeline handles the hard parts

Who uses Crunchbase data — and how

Crunchbase scraper — technical capabilities

Infrastructure powering the Crunchbase pipeline

Your data, your destination

Common questions.

Tell us what
to extract.
We do the rest.

Data Extraction for Every Industry

Startup intelligence, at warehouse scale.

Every field we extract from crunchbase.com

Everything you need from Crunchbase — nothing you don't

From domain list to warehouse record

How our Crunchbase pipeline handles the hard parts

Who uses Crunchbase data — and how

Crunchbase scraper — technical capabilities

Infrastructure powering the Crunchbase pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

Startup intelligence,
at warehouse scale.

Tell us what
to extract.
We do the rest.