We extract company registrations, director appointments, PSC registers, and filing histories from Companies House. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Company Profile objects from companieshouse.gov.uk. All fields typed and schema-versioned.
"company_name": "TESCO PLC", "company_number": "00445790", "company_status": "active", "company_type": "plc", "incorporation_date": "1947-11-27", "registered_address": "Tesco House, Shire Park, Kestrel Way, Welwyn Garden City, AL7 1GA", "sic_codes": "['47110']", "next_accounts_due": "2025-08-31"
| # | company_name | company_number | company_status | company_type | incorporation_date | dissolution_date |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Officers objects from companieshouse.gov.uk. All fields typed and schema-versioned.
"officer_name": "MURPHY, Kenneth", "officer_role": "director", "appointed_on": "2020-10-01", "resigned_on": "None", "date_of_birth": "1967-04", "nationality": "Irish", "country_of_residence": "England", "occupation": "Chief Executive Officer"
| # | officer_name | officer_role | appointed_on | resigned_on | date_of_birth | nationality |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for PSC Register objects from companieshouse.gov.uk. All fields typed and schema-versioned.
"psc_name": "Mr John Doe", "psc_kind": "individual-person-with-significant-control", "notified_on": "2016-04-06", "ceased_on": "None", "date_of_birth": "1975-08", "nationality": "British", "nature_of_control": "['ownership-of-shares-75-to-100-percent']", "address": "10 Downing Street, London, SW1A 2AA"
| # | psc_name | psc_kind | notified_on | ceased_on | date_of_birth | nationality |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Filing History objects from companieshouse.gov.uk. All fields typed and schema-versioned.
"transaction_id": "MzM5MzgxNTk2NGFkaXF6a2N4", "date": "2023-11-14", "category": "accounts", "description": "Full accounts made up to 25 February 2023", "paper_filed": false, "type": "AA", "pages": 184, "document_url": "https://find-and-update.company-information.service.gov.uk/company/00445790/filing-history/MzM5MzgxNTk2NGFkaXF6a2N4/document"
| # | transaction_id | date | category | description | paper_filed | type |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Charges objects from companieshouse.gov.uk. All fields typed and schema-versioned.
"charge_code": "004457900001", "created_on": "2015-06-12", "delivered_on": "2015-06-15", "status": "outstanding", "persons_entitled": "['NatWest Markets Plc']", "instrument_description": "Debenture dated 12 June 2015", "short_particulars": "Fixed and floating charges over the undertaking and all property and assets", "property_acquired": false
| # | charge_code | created_on | delivered_on | status | persons_entitled | instrument_description |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Our pipeline extracts the full entity graph from Companies House, navigating pagination limits, parsing unstructured PDF filings, and resolving cross-directorships across millions of records.
Extract core entity data including registered address, SIC codes, incorporation dates, and current operational status.
Monitor board changes, capturing appointment dates, resignations, roles, nationalities, and correspondence addresses.
Extract the Persons with Significant Control register to map ultimate beneficial owners and precise control thresholds.
Index every historical filing event, capturing document types, transaction IDs, page counts, and direct download URLs.
Parse embedded tables within scanned PDF accounts to extract structured balance sheet and P&L metrics.
Track secured lending by extracting charge codes, creation dates, entitled persons, and short particulars of assets.
Capture administration, liquidation, and strike-off events the moment they hit the public register.
Link individuals across multiple corporate entities using partial DOBs and name matching heuristics.
Maintain a hash index of 5M+ entities to push only delta updates, minimising downstream processing costs.
Brief in. Clean data out.
Provide a list of company numbers, SIC codes, or request a full register sync. We map the required data points.
We configure extraction logic, PDF parsing queues, and API rate-limit circumvention strategies.
We run schema validation, null-rate checks, and OCR accuracy tests on historical filing samples.
JSON, CSV, or Parquet files pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on schedule.
Companies House provides an API, but rate limits, missing data, and unstructured PDFs make bulk extraction difficult. Here is how we engineer around these constraints.
Millions of UK companies file their accounts as unstructured PDFs or scanned images. We pipe these documents through AWS Textract and custom heuristics to extract structured balance sheet and P&L metrics, turning image blobs into queryable JSON.
The public Companies House API imposes strict rate limits that make full-register syncs impossible. We use distributed crawlers across residential IP pools to scrape the web frontend directly, achieving throughput far beyond API constraints.
Large PLCs often have thousands of historical filings spanning decades. Our crawlers handle deep pagination state efficiently, ensuring complete historical records without memory bloat or connection timeouts.
Officer names are frequently misspelled or formatted inconsistently across different company appointments. We apply deterministic matching rules using name permutations and partial dates of birth to accurately map cross-directorships.
Polling the entire UK corporate register daily is inefficient. We maintain state hashes for every entity and only emit records when a new filing, status change, or officer update is detected, providing a clean changelog.
Fintechs and banks automate customer onboarding by verifying corporate structures, directors, and ultimate beneficial owners against the public register.
Sales teams track new incorporations by SIC code and region to target newly funded or expanding businesses with relevant services.
Lenders ingest mortgage charges, insolvency events, and historical accounts to build automated credit scoring models for SME lending.
Consultancies aggregate financial metrics from extracted PDF accounts to benchmark industry performance and identify acquisition targets.
Procurement teams monitor supplier health by tracking late filings, director resignations, and new floating charges.
Enterprise data teams use Companies House as the golden source to cleanse and enrich stale CRM records with accurate registered addresses and legal names.
"UK Companies House holds the definitive graph of British corporate ownership, but turning millions of unstructured PDFs and fragmented records into queryable data requires serious infrastructure."
Most teams underestimate the compute required to parse scanned financial accounts or map cross-directorships at scale. DataFlirt absorbs the complexity of OCR, entity resolution, and continuous diffing so your engineers can focus on risk modelling and analysis.
Everything supported by our companieshouse.gov.uk scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Scrapy manages concurrency and deduplication for millions of entities, while Playwright handles complex dynamic interactions and cookie sessions when necessary.
We route unstructured PDF filings through a scalable AWS Textract and Tesseract cluster, applying custom heuristics to normalise tabular financial data.
Pipelines run on Kubernetes clusters. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.
Data delivered to where your team already works — no new tooling required.
About companieshouse.gov.uk scraping, legality, and pipeline operations.
Ask us directly →Yes. Companies House data is a public register. Information is published under the Open Government Licence (OGL). DataFlirt extracts only publicly available corporate and officer data. We do not attempt to bypass security controls or extract protected residential addresses.
The official API imposes strict rate limits (600 requests per 5 minutes) which makes monitoring 5 million active companies impossible. Scraping the web interface allows us to achieve the throughput necessary for daily full-register diffs.
Yes. We use a combination of AWS Textract and custom parsing heuristics to extract structured balance sheet and P&L metrics from scanned PDF filings, converting them into queryable JSON fields.
We run continuous polling on target entity lists. For full-register monitoring, we push delta updates daily, ensuring your warehouse reflects the latest appointments, resignations, and filings within 24 hours of publication.
Yes. We can extract the complete filing history for any active or recently dissolved company, including legacy documents dating back to incorporation.
We track dissolution events and update company status accordingly. Note that Companies House removes records of companies dissolved for more than 20 years (pre-2010), which are no longer available on the public register.
Yes. We extract partial dates of birth and full names to help you resolve entities and map complex corporate structures and beneficial ownership graphs.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a targeted list of 10,000 companies or a daily sync of the entire 5M+ UK corporate register, we build and operate the infrastructure. Tell us what you need.