SYSTEM all green source opencorporates.com queue 112,492 entities p99 latency 218ms dataflirt.com · scraper/opencorporates-com
RUN · 84 active pipelines · opencorporates.com live

Corporate graph data,
delivered at scale.

We extract company profiles, officer networks, filing histories, and jurisdictional metadata from OpenCorporates. Delivered as clean JSON, CSV, or Parquet to S3 or BigQuery.

Companies extracted
4.1M /day
Officer records
18.3M /week
Filing updates
645K /run
Active pipelines
84
Uptime
99.98%
Data Dictionary

Every field we extract from opencorporates.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Company Profiles objects from opencorporates.com. All fields typed and schema-versioned.

company_numberjurisdictionnamecompany_typeincorporation_datestatusregistered_addressindustry_codesprevious_namesagent_nameagent_addresspage_url
company_profiles
● 200 OK
"company_number": "08209948",
"jurisdiction": "gb",
"name": "DEEPMIND TECHNOLOGIES LIMITED",
"status": "Active",
"incorporation_date": "2012-09-11",
"company_type": "Private limited Company",
"registered_address": "5 New Street Square, London, EC4A 3TW"
# company_numberjurisdictionnamecompany_typeincorporation_datestatus
1
2
3

Complete list of extractable fields for Officers & Directors objects from opencorporates.com. All fields typed and schema-versioned.

officer_nameofficer_rolestatusappointed_dateresigned_datecompany_namecompany_numberjurisdictionaddressoccupationdate_of_birth
officers_& directors
● 200 OK
"officer_name": "HASSABIS, Demis",
"officer_role": "director",
"status": "Active",
"appointed_date": "2012-09-11",
"company_name": "DEEPMIND TECHNOLOGIES LIMITED",
"company_number": "08209948",
"jurisdiction": "gb"
# officer_nameofficer_rolestatusappointed_dateresigned_datecompany_name
1
2
3

Complete list of extractable fields for Filing History objects from opencorporates.com. All fields typed and schema-versioned.

filing_iddatetitledescriptiondocument_urlfiling_typecompany_numberjurisdictionsource_registry
filing_history
● 200 OK
"date": "2023-10-04",
"title": "Full accounts made up to 31 December 2022",
"filing_type": "Accounts",
"document_url": "https://find-and-update.company-information.service.gov.uk/company/08209948/filing-history",
"company_number": "08209948",
"jurisdiction": "gb",
"source_registry": "Companies House"
# filing_iddatetitledescriptiondocument_urlfiling_type
1
2
3

Complete list of extractable fields for Corporate Groupings objects from opencorporates.com. All fields typed and schema-versioned.

group_idparent_companyparent_jurisdictionsubsidiary_namesubsidiary_numbersubsidiary_jurisdictioncontrol_typecontrol_percentagesource_date
corporate_groupings
● 200 OK
"parent_company": "ALPHABET INC.",
"parent_jurisdiction": "us_de",
"subsidiary_name": "DEEPMIND TECHNOLOGIES LIMITED",
"subsidiary_number": "08209948",
"subsidiary_jurisdiction": "gb",
"control_type": "Shareholding",
"source_date": "2023-12-31"
# group_idparent_companyparent_jurisdictionsubsidiary_namesubsidiary_numbersubsidiary_jurisdiction
1
2
3

Complete list of extractable fields for Industry Codes objects from opencorporates.com. All fields typed and schema-versioned.

company_numberjurisdictioncodecode_schemedescriptionis_primarystart_dateend_date
industry_codes
● 200 OK
"company_number": "08209948",
"jurisdiction": "gb",
"code": "72200",
"code_scheme": "UK SIC",
"description": "Research and experimental development on social sciences and humanities",
"is_primary": true,
"start_date": "2012-09-11"
# company_numberjurisdictioncodecode_schemedescriptionis_primary
1
2
3

Capabilities

Structured corporate intelligence without the overhead

Extracting registry data requires navigating 140+ jurisdictional formats. We standardise company profiles, officer networks, and filing histories into clean, queryable records.

Global Registry Coverage

Extract normalised data across 140+ jurisdictions from Delaware to the UK, standardising disparate registry formats.

Officer Network Mapping

Capture director and officer relationships, active and resigned statuses, and cross-company appointments for KYC checks.

Corporate Group Extraction

Map parent-subsidiary hierarchies and branch relationships to build complete corporate ownership graphs.

Filing History Retrieval

Scrape document metadata, filing dates, and PDF URLs for annual returns, accounts, and incorporation certificates.

Industry Code Classification

Extract SIC, NACE, and NAICS codes with primary/secondary flags to categorise corporate entities accurately.

Status & Lifecycle Tracking

Monitor company status changes (active, dissolved, in liquidation) with precision timestamps.

Address Normalisation

Extract registered office addresses, agent addresses, and mailing addresses across thousands of regional formats.

Identifier Cross-Referencing

Capture LEI (Legal Entity Identifier), trademark numbers, and local tax IDs linked to the primary company record.

Scheduled Change Detection

Run daily or weekly diffs to capture newly incorporated entities or recently appointed directors in target jurisdictions.

// engagement pipeline

From target entities to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide target jurisdictions, company status filters, or specific corporate entities. We design the extraction schema.

Pipeline Build
d 2–4

We configure Scrapy crawlers, proxy rotation, and rate-limiting protocols to navigate OpenCorporates' bot protection.

Validation & QA
d 4–6

Schema validation, null-rate checks, and cross-jurisdiction normalisation before full pipeline launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

Navigating global registry aggregators

OpenCorporates standardises data but enforces strict rate limits and pagination caps. We deploy robust infrastructure to bypass these constraints.

pipeline-monitor · opencorporates.com · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Rate limiting & bot protection
Distributed proxy infrastructure

OpenCorporates employs strict rate limits and Cloudflare protection. We distribute requests across ISP-grade residential proxies with calibrated concurrency to maintain throughput without triggering blocks.

URL structure standardisation
Predictable jurisdiction routing

The site uses a strict jurisdiction/company_number URL pattern. We generate target URLs algorithmically from known registry ranges, eliminating the need for expensive site-wide discovery crawls.

Pagination depth limits
Search filter slicing

Search results are capped at 100 pages. We bypass this by slicing search spaces using date ranges and status filters, ensuring complete extraction of large jurisdictions.

Inconsistent registry schemas
Normalisation pipelines

Underlying data comes from 140+ different registries. We map disparate field names (e.g., 'Agent' vs 'Registered Agent') into a single unified schema for your data warehouse.

Change detection
Delta extraction logic

For continuous monitoring, we track the 'last updated' timestamps and run targeted crawls on recently modified records, reducing compute overhead and delivering clean diffs.

Applications

Who uses OpenCorporates data — and how

Teams across industries use opencorporates.com data to build competitive products and smarter operations.

01
KYC & AML Compliance

Fintechs and banks ingest officer and entity data to automate customer onboarding and verify corporate structures.

02
Master Data Management

B2B SaaS companies enrich their CRM records with accurate legal names, addresses, and industry codes.

03
Risk & Credit Scoring

Underwriters monitor filing histories and company statuses to detect early warning signs of insolvency.

04
Supply Chain Verification

Procurement teams audit supplier networks and identify hidden corporate linkages or sanctioned subsidiaries.

05
Investigative Journalism

Media organisations map cross-border corporate ownership to trace asset flows and beneficial ownership.

06
Lead Generation

Sales teams target newly incorporated companies in specific jurisdictions for early-stage B2B outreach.

Why DataFlirt

"OpenCorporates aggregates the world's company registries into a single graph. Accessing it at scale requires infrastructure that handles 140 different jurisdictional quirks."

Extracting corporate data across borders is inherently messy. While OpenCorporates centralises the data, navigating their rate limits, pagination caps, and schema variations requires dedicated infrastructure. DataFlirt manages the proxies, normalisation logic, and change detection so your compliance and data teams receive clean, structured records without building crawlers.

Technical Spec

OpenCorporates scraper — technical capabilities

Everything supported by our opencorporates.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Company profile extraction
Full extraction of status, dates, addresses, and identifiers
Supported
Officer networks
Director and agent details with appointment dates
Supported
Filing history metadata
Dates, titles, and document URLs for corporate filings
Supported
Corporate groupings
Parent-subsidiary relationships and branch linkages
Supported
Industry code mapping
SIC, NACE, and NAICS codes with descriptions
Supported
Search pagination bypass
Filter slicing to extract >10,000 results per query
Supported
Change detection (diffs)
Hash-based diffs for daily or weekly monitoring
Supported
Webhook delivery
HTTP POST per record for real-time KYC workflows
Supported
Bulk data dumps
Direct access to OpenCorporates' proprietary enterprise database dumps
Partial
API-only premium fields
Fields restricted exclusively to their paid Enterprise API tier
Partial
Infrastructure

Infrastructure powering the corporate data pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusSnowflakeBigQuery
Scrapy + Playwright Stack

Scrapy handles high-throughput orchestration and deduplication. Playwright is deployed selectively for complex JavaScript interactions or Cloudflare challenge bypasses.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies. Rotation happens per-request with sticky sessions where required, avoiding IP bans from aggressive rate limiters.

Cloud-Native Orchestration

Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested — schema versioned per run
CSV
Flat file with typed columns — Excel/Sheets compatible
Parquet
Columnar format for BigQuery, Snowflake, Athena
S3
Direct bucket delivery — compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
RESTful endpoints to query extracted datasets on demand
BigQuery
Streamed directly into your dataset with schema auto-detect
PostgreSQL
Upsert into your existing schema with conflict resolution
XLS
Standard Excel format for business analysts and operations teams
// faq

Common questions.

About opencorporates.com scraping, legality, and pipeline operations.

Ask us directly →
Is scraping OpenCorporates legal?

Scraping publicly available, factual corporate registry data is generally permissible. OpenCorporates itself aggregates public records. We only extract non-authenticated, public data. Clients must ensure their downstream use complies with local regulations like GDPR.

How do you bypass search pagination limits?

OpenCorporates restricts search results to a limited number of pages. We overcome this by programmatically slicing search queries by date ranges, jurisdictions, and company types to ensure full coverage of the dataset.

Can you monitor specific companies for changes?

Yes. We can configure pipelines to poll specific company URLs or jurisdiction feeds at defined intervals, extracting new filings, status changes, or officer appointments as they occur.

How do you handle different registry formats?

While OpenCorporates standardises much of the data, edge cases exist. Our extraction schema normalises fields across all 140+ jurisdictions, ensuring a consistent output format for your warehouse.

Do you extract actual PDF filings?

We extract the metadata and the source URLs for the filings. If the source registry provides public PDFs without authentication, we can configure secondary pipelines to download and store those documents in your S3 bucket.

What is the minimum viable engagement?

Our smallest packages start at a defined list of entities (typically 10,000+) or specific jurisdictions with weekly delivery. We price based on volume, frequency, and schema complexity.

How fresh is the data?

Data freshness depends on your pipeline cadence and OpenCorporates' own sync schedule with the primary registries. We can configure daily or weekly runs to capture the latest available updates.

$ dataflirt scope --new-project --source=opencorporates.com ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full jurisdiction export or continuous monitoring for KYC compliance — we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →