We extract company profiles, officer networks, filing histories, and jurisdictional metadata from OpenCorporates. Delivered as clean JSON, CSV, or Parquet to S3 or BigQuery.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Company Profiles objects from opencorporates.com. All fields typed and schema-versioned.
"company_number": "08209948", "jurisdiction": "gb", "name": "DEEPMIND TECHNOLOGIES LIMITED", "status": "Active", "incorporation_date": "2012-09-11", "company_type": "Private limited Company", "registered_address": "5 New Street Square, London, EC4A 3TW"
| # | company_number | jurisdiction | name | company_type | incorporation_date | status |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Officers & Directors objects from opencorporates.com. All fields typed and schema-versioned.
"officer_name": "HASSABIS, Demis", "officer_role": "director", "status": "Active", "appointed_date": "2012-09-11", "company_name": "DEEPMIND TECHNOLOGIES LIMITED", "company_number": "08209948", "jurisdiction": "gb"
| # | officer_name | officer_role | status | appointed_date | resigned_date | company_name |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Filing History objects from opencorporates.com. All fields typed and schema-versioned.
"date": "2023-10-04", "title": "Full accounts made up to 31 December 2022", "filing_type": "Accounts", "document_url": "https://find-and-update.company-information.service.gov.uk/company/08209948/filing-history", "company_number": "08209948", "jurisdiction": "gb", "source_registry": "Companies House"
| # | filing_id | date | title | description | document_url | filing_type |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Corporate Groupings objects from opencorporates.com. All fields typed and schema-versioned.
"parent_company": "ALPHABET INC.", "parent_jurisdiction": "us_de", "subsidiary_name": "DEEPMIND TECHNOLOGIES LIMITED", "subsidiary_number": "08209948", "subsidiary_jurisdiction": "gb", "control_type": "Shareholding", "source_date": "2023-12-31"
| # | group_id | parent_company | parent_jurisdiction | subsidiary_name | subsidiary_number | subsidiary_jurisdiction |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Industry Codes objects from opencorporates.com. All fields typed and schema-versioned.
"company_number": "08209948", "jurisdiction": "gb", "code": "72200", "code_scheme": "UK SIC", "description": "Research and experimental development on social sciences and humanities", "is_primary": true, "start_date": "2012-09-11"
| # | company_number | jurisdiction | code | code_scheme | description | is_primary |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Extracting registry data requires navigating 140+ jurisdictional formats. We standardise company profiles, officer networks, and filing histories into clean, queryable records.
Extract normalised data across 140+ jurisdictions from Delaware to the UK, standardising disparate registry formats.
Capture director and officer relationships, active and resigned statuses, and cross-company appointments for KYC checks.
Map parent-subsidiary hierarchies and branch relationships to build complete corporate ownership graphs.
Scrape document metadata, filing dates, and PDF URLs for annual returns, accounts, and incorporation certificates.
Extract SIC, NACE, and NAICS codes with primary/secondary flags to categorise corporate entities accurately.
Monitor company status changes (active, dissolved, in liquidation) with precision timestamps.
Extract registered office addresses, agent addresses, and mailing addresses across thousands of regional formats.
Capture LEI (Legal Entity Identifier), trademark numbers, and local tax IDs linked to the primary company record.
Run daily or weekly diffs to capture newly incorporated entities or recently appointed directors in target jurisdictions.
Brief in. Clean data out.
Provide target jurisdictions, company status filters, or specific corporate entities. We design the extraction schema.
We configure Scrapy crawlers, proxy rotation, and rate-limiting protocols to navigate OpenCorporates' bot protection.
Schema validation, null-rate checks, and cross-jurisdiction normalisation before full pipeline launch.
JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.
OpenCorporates standardises data but enforces strict rate limits and pagination caps. We deploy robust infrastructure to bypass these constraints.
OpenCorporates employs strict rate limits and Cloudflare protection. We distribute requests across ISP-grade residential proxies with calibrated concurrency to maintain throughput without triggering blocks.
The site uses a strict jurisdiction/company_number URL pattern. We generate target URLs algorithmically from known registry ranges, eliminating the need for expensive site-wide discovery crawls.
Search results are capped at 100 pages. We bypass this by slicing search spaces using date ranges and status filters, ensuring complete extraction of large jurisdictions.
Underlying data comes from 140+ different registries. We map disparate field names (e.g., 'Agent' vs 'Registered Agent') into a single unified schema for your data warehouse.
For continuous monitoring, we track the 'last updated' timestamps and run targeted crawls on recently modified records, reducing compute overhead and delivering clean diffs.
Fintechs and banks ingest officer and entity data to automate customer onboarding and verify corporate structures.
B2B SaaS companies enrich their CRM records with accurate legal names, addresses, and industry codes.
Underwriters monitor filing histories and company statuses to detect early warning signs of insolvency.
Procurement teams audit supplier networks and identify hidden corporate linkages or sanctioned subsidiaries.
Media organisations map cross-border corporate ownership to trace asset flows and beneficial ownership.
Sales teams target newly incorporated companies in specific jurisdictions for early-stage B2B outreach.
"OpenCorporates aggregates the world's company registries into a single graph. Accessing it at scale requires infrastructure that handles 140 different jurisdictional quirks."
Extracting corporate data across borders is inherently messy. While OpenCorporates centralises the data, navigating their rate limits, pagination caps, and schema variations requires dedicated infrastructure. DataFlirt manages the proxies, normalisation logic, and change detection so your compliance and data teams receive clean, structured records without building crawlers.
Everything supported by our opencorporates.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Scrapy handles high-throughput orchestration and deduplication. Playwright is deployed selectively for complex JavaScript interactions or Cloudflare challenge bypasses.
We maintain pools of residential ISP proxies. Rotation happens per-request with sticky sessions where required, avoiding IP bans from aggressive rate limiters.
Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.
Data delivered to where your team already works — no new tooling required.
About opencorporates.com scraping, legality, and pipeline operations.
Ask us directly →Scraping publicly available, factual corporate registry data is generally permissible. OpenCorporates itself aggregates public records. We only extract non-authenticated, public data. Clients must ensure their downstream use complies with local regulations like GDPR.
OpenCorporates restricts search results to a limited number of pages. We overcome this by programmatically slicing search queries by date ranges, jurisdictions, and company types to ensure full coverage of the dataset.
Yes. We can configure pipelines to poll specific company URLs or jurisdiction feeds at defined intervals, extracting new filings, status changes, or officer appointments as they occur.
While OpenCorporates standardises much of the data, edge cases exist. Our extraction schema normalises fields across all 140+ jurisdictions, ensuring a consistent output format for your warehouse.
We extract the metadata and the source URLs for the filings. If the source registry provides public PDFs without authentication, we can configure secondary pipelines to download and store those documents in your S3 bucket.
Our smallest packages start at a defined list of entities (typically 10,000+) or specific jurisdictions with weekly delivery. We price based on volume, frequency, and schema complexity.
Data freshness depends on your pipeline cadence and OpenCorporates' own sync schedule with the primary registries. We can configure daily or weekly runs to capture the latest available updates.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full jurisdiction export or continuous monitoring for KYC compliance — we scope, build, and operate the pipeline. Tell us what you need.