SYSTEM all green source crunchbase.com queue 12,408 profiles p99 latency 318ms dataflirt.com · scraper/crunchbase-com
RUN · 112 active pipelines · crunchbase.com live

Startup intelligence,
at warehouse scale.

We extract company profiles, funding histories, investor portfolios, and M&A data from Crunchbase. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Companies tracked
3.2M
Funding events
18.4K /week
Investor profiles
412K /run
Active pipelines
112
Uptime
99.94%
Data Dictionary

Every field we extract from crunchbase.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Company Profiles objects from crunchbase.com. All fields typed and schema-versioned.

company_namedomaindescriptionfounded_dateemployee_count_minemployee_count_maxhq_locationindustry_groupstotal_funding_usdoperating_statuslast_funding_typelinkedin_url
company_profiles
● 200 OK
"company_name": "DataFlirt",
"domain": "dataflirt.com",
"founded_date": "2021-04-12",
"employee_count_min": 11,
"employee_count_max": 50,
"hq_location": "Bengaluru, Karnataka, India",
"total_funding_usd": 4500000,
"operating_status": "Active"
# company_namedomaindescriptionfounded_dateemployee_count_minemployee_count_max
1
2
3

Complete list of extractable fields for Funding Rounds objects from crunchbase.com. All fields typed and schema-versioned.

transaction_nameannounced_datemoney_raisedcurrencyround_typelead_investorparticipating_investorspre_money_valuationpost_money_valuationinvestor_count
funding_rounds
● 200 OK
"transaction_name": "Series A - TechCorp",
"announced_date": "2024-02-15",
"money_raised": 12000000,
"currency": "USD",
"round_type": "Series A",
"lead_investor": "Sequoia Capital",
"investor_count": 4
# transaction_nameannounced_datemoney_raisedcurrencyround_typelead_investor
1
2
3

Complete list of extractable fields for Investors objects from crunchbase.com. All fields typed and schema-versioned.

investor_nameinvestor_typelocationinvestment_stagetotal_investmentslead_investmentsexitsportfolio_companiesactive_statuswebsite_url
investors
● 200 OK
"investor_name": "Accel",
"investor_type": "Venture Capital",
"location": "Palo Alto, California",
"total_investments": 1842,
"lead_investments": 612,
"exits": 345,
"active_status": true
# investor_nameinvestor_typelocationinvestment_stagetotal_investmentslead_investments
1
2
3

Complete list of extractable fields for Acquisitions objects from crunchbase.com. All fields typed and schema-versioned.

acquiree_nameacquirer_nameannounced_datepricecurrencyacquisition_typeacquiree_revenueacquiree_employeesstatus
acquisitions
● 200 OK
"acquiree_name": "StartupX",
"acquirer_name": "MegaCorp",
"announced_date": "2023-11-20",
"price": 45000000,
"currency": "USD",
"acquisition_type": "Acquisition",
"status": "Complete"
# acquiree_nameacquirer_nameannounced_datepricecurrencyacquisition_type
1
2
3

Complete list of extractable fields for Leadership objects from crunchbase.com. All fields typed and schema-versioned.

person_namecurrent_titlecompany_namepast_rolesboard_seatseducationlocationlinkedin_urltwitter_url
leadership
● 200 OK
"person_name": "Jane Doe",
"current_title": "Chief Technology Officer",
"company_name": "DataFlirt",
"past_roles": "['VP Engineering at TechCorp', 'Senior Engineer at StartupX']",
"location": "Bengaluru, India",
"linkedin_url": "https://linkedin.com/in/janedoe"
# person_namecurrent_titlecompany_namepast_rolesboard_seatseducation
1
2
3

Capabilities

Everything you need from Crunchbase — nothing you don't

Our Crunchbase scraper handles the entire directory structure: company profiles, funding histories, and investor graphs — with JavaScript rendering and Datadome bypass built in.

Company Overviews

Extract core firmographic data: founding date, employee counts, HQ location, operating status, and detailed industry taxonomies.

Funding Histories

Capture every funding round, including announced date, money raised, round type, pre-money valuation, and participating investors.

Investor Portfolios

Map investor profiles to their portfolio companies, tracking total investments, lead investments, exits, and preferred investment stages.

M&A Tracking

Monitor acquisition events, tracking acquirer, acquiree, transaction value, and acquisition type across the entire database.

Leadership & Board Data

Extract executive teams, founders, and board members, including their current titles, past roles, and professional social links.

Industry & Taxonomy Mapping

Normalise Crunchbase category groups and specific tags into structured arrays for precise market segmentation.

Web Traffic & Tech Stack

Extract built-in third-party signals surfaced on Crunchbase profiles, including monthly visit estimates and active technology tools.

IPO & Financials

Track IPO dates, stock symbols, initial stock prices, and valuation metrics for public companies listed in the directory.

Scheduled Updates

Configure continuous pipelines at daily or weekly cadences to capture new funding rounds and profile updates with change-detection diffing.

// engagement pipeline

From domain list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide target domains, investor names, or category filters. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy / Playwright crawlers, proxy rotation, session management, and Datadome bypass for crunchbase.com.

Validation & QA
d 4–6

Schema validation, null-rate checks, and sample profile exports before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our Crunchbase pipeline handles the hard parts

Crunchbase relies on Datadome and heavy client-side rendering. Here is how we maintain reliable extraction without IP bans.

pipeline-monitor · crunchbase.com · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Datadome bypass
Residential proxies + TLS fingerprinting

Crunchbase uses Datadome to block automated traffic. We route requests through high-trust residential ISP proxies and spoof TLS fingerprints at the network layer, ensuring our crawlers appear as legitimate business users.

SPA State Extraction
Direct Next.js/React state parsing

Instead of relying purely on fragile DOM selectors, our pipeline intercepts the underlying JSON state injected by the frontend framework. This provides cleaner, strictly typed data directly from the source.

GraphQL Interception
Capturing XHR payloads

Much of the data on Crunchbase loads asynchronously via GraphQL requests. We intercept these network payloads during Playwright execution to extract deep pagination results without rendering thousands of DOM nodes.

Rate Limit Management
Distributed concurrency control

Aggressive scraping triggers temporary blocks. We distribute requests across a wide IP pool and implement randomised delays between page loads, adhering to safe concurrency limits while maintaining high overall throughput.

Entity Resolution
Handling duplicate profiles

We normalise company domains and Crunchbase UUIDs to ensure downstream systems receive deduplicated records, even when multiple profile variants exist for the same entity.

Applications

Who uses Crunchbase data — and how

Teams across industries use crunchbase.com data to build competitive products and smarter operations.

01
VC Deal Sourcing

Venture capital firms monitor new funding rounds and startup formations to identify early-stage investment opportunities before competitors.

02
B2B Lead Generation

Sales teams build targeted account lists based on recent funding events, employee count growth, and specific industry verticals.

03
Market Mapping

Strategy teams analyse entire sectors, tracking capital flows and competitor density to identify underserved market segments.

04
Competitor Intelligence

Product teams track competitor funding, acquisition activity, and leadership changes to anticipate market movements.

05
M&A Target Identification

Corporate development teams screen for potential acquisition targets based on funding stage, valuation estimates, and investor syndicates.

06
Investment Trend Analysis

Analysts aggregate thousands of funding rounds to model macro trends in capital deployment across regions and technologies.

Why DataFlirt

"Crunchbase holds the definitive graph of private market capital — but extracting it requires bypassing enterprise-grade bot protection and unravelling complex GraphQL states."

Most teams fail at scraping Crunchbase because they hit Datadome walls or struggle with the heavily obfuscated React DOM. DataFlirt manages the proxy rotation, TLS fingerprinting, and API interception required to extract clean company data at scale. You get structured JSON, we handle the infrastructure.

Technical Spec

Crunchbase scraper — technical capabilities

Everything supported by our crunchbase.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Datadome bypass
Automated solver integration and TLS fingerprint spoofing
Supported
GraphQL interception
Direct extraction of XHR payloads for structured data
Supported
Residential proxy rotation
ISP-grade residential IPs from US/UK/EU pools
Supported
Pagination handling
Deep traversal of investor portfolios and funding lists
Supported
Change detection (diffs)
Hash-based diff: only emit records with changed fields since last run
Supported
Webhook delivery
HTTP POST per record or batch for real-time CRM updates
Supported
Funding round extraction
Complete historical capture of all capital events
Supported
Investor portfolio mapping
Bidirectional linking between companies and investors
Supported
Pro-only contact emails
Extraction of personal email addresses hidden behind paywalls
Partial
Saved search exports
Execution of user-specific saved queries requiring authentication
Partial
Infrastructure

Infrastructure powering the Crunchbase pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
Scrapy + Playwright Stack

Scrapy handles crawl orchestration, deduplication, and retry logic. Playwright handles JavaScript rendering, cookie sessions, and interaction flows. Combined via scrapy-playwright middleware.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies across US/UK/EU regions. Rotation happens per-request with sticky sessions where required. IP score monitoring prevents blacklisted pool contamination.

Cloud-Native Orchestration

Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested — schema versioned per run
CSV
Flat file with typed columns — Excel/Sheets compatible
XLS
Legacy spreadsheet format for business users
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery — compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
REST endpoint to query your extracted datasets
BigQuery
Streamed directly into your dataset with schema auto-detect
Snowflake
Stage + COPY INTO workflow — incremental or full-replace
Postgres
Upsert into your existing schema with conflict resolution
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About crunchbase.com scraping, legality, and pipeline operations.

Ask us directly →
Is scraping Crunchbase legal?

Scraping publicly available information is generally permissible under applicable law. DataFlirt targets only public, non-authenticated company, funding, and investor data. We do not extract personal contact information behind paywalls or circumvent authentication systems. Clients should review Terms of Service and consult legal counsel for specific use cases.

How do you handle Datadome protection?

We use residential ISP proxies, full Playwright browser sessions with realistic TLS fingerprints, and request timing modelled on human behaviour. We monitor for block rate spikes in real time and trigger pool rotation automatically.

Can you extract data hidden behind the Crunchbase Pro paywall?

No. We only extract data that is publicly accessible on the platform without requiring a paid user session. This includes core firmographics, funding rounds, and investor profiles, but excludes proprietary contact data.

How fresh is the data?

Pipelines can be configured to run daily or weekly. We track recently funded companies and updated profiles to ensure your warehouse reflects market changes within 24 hours of publication.

Do you extract historical funding rounds?

Yes. When we map a company profile, we extract the entire historical ledger of funding events available on the page, not just the most recent round.

What is the minimum viable engagement?

Our smallest packages start at a defined list of 5,000 domains or specific category filters with weekly delivery. Contact us with your target parameters for a scoped quote.

How do you handle duplicate company profiles?

We use Crunchbase UUIDs and domain normalisation to ensure records are deduplicated before delivery, providing a clean relational dataset.

Can I request a sample dataset before committing?

Absolutely. We provide a sample run of up to 500 company profiles as part of the pre-engagement scoping process to validate schema fit and data quality.

$ dataflirt scope --new-project --source=crunchbase.com ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off export of a specific industry or a continuous feed of new funding rounds — we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →