SYSTEM all green source companieshouse.gov.uk queue 114,892 entities p99 latency 185ms dataflirt.com · scraper/companieshouse-gov.uk
RUN · 82 active pipelines · companieshouse.gov.uk live

UK corporate data,
at warehouse scale.

We extract company registrations, director appointments, PSC registers, and filing histories from Companies House. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Companies tracked
5.2M
Officer updates
412K /day
PDFs parsed
89K /run
Active pipelines
82
Uptime
99.98%
Data Dictionary

Every field we extract from companieshouse.gov.uk

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Company Profile objects from companieshouse.gov.uk. All fields typed and schema-versioned.

company_namecompany_numbercompany_statuscompany_typeincorporation_datedissolution_dateregistered_addresssic_codesnext_accounts_due
company_profile
● 200 OK
"company_name": "TESCO PLC",
"company_number": "00445790",
"company_status": "active",
"company_type": "plc",
"incorporation_date": "1947-11-27",
"registered_address": "Tesco House, Shire Park, Kestrel Way, Welwyn Garden City, AL7 1GA",
"sic_codes": "['47110']",
"next_accounts_due": "2025-08-31"
# company_namecompany_numbercompany_statuscompany_typeincorporation_datedissolution_date
1
2
3

Complete list of extractable fields for Officers objects from companieshouse.gov.uk. All fields typed and schema-versioned.

officer_nameofficer_roleappointed_onresigned_ondate_of_birthnationalitycountry_of_residenceoccupationcorrespondence_address
officers
● 200 OK
"officer_name": "MURPHY, Kenneth",
"officer_role": "director",
"appointed_on": "2020-10-01",
"resigned_on": "None",
"date_of_birth": "1967-04",
"nationality": "Irish",
"country_of_residence": "England",
"occupation": "Chief Executive Officer"
# officer_nameofficer_roleappointed_onresigned_ondate_of_birthnationality
1
2
3

Complete list of extractable fields for PSC Register objects from companieshouse.gov.uk. All fields typed and schema-versioned.

psc_namepsc_kindnotified_onceased_ondate_of_birthnationalitynature_of_controladdress
psc_register
● 200 OK
"psc_name": "Mr John Doe",
"psc_kind": "individual-person-with-significant-control",
"notified_on": "2016-04-06",
"ceased_on": "None",
"date_of_birth": "1975-08",
"nationality": "British",
"nature_of_control": "['ownership-of-shares-75-to-100-percent']",
"address": "10 Downing Street, London, SW1A 2AA"
# psc_namepsc_kindnotified_onceased_ondate_of_birthnationality
1
2
3

Complete list of extractable fields for Filing History objects from companieshouse.gov.uk. All fields typed and schema-versioned.

transaction_iddatecategorydescriptionpaper_filedtypepagesdocument_urlbarcode
filing_history
● 200 OK
"transaction_id": "MzM5MzgxNTk2NGFkaXF6a2N4",
"date": "2023-11-14",
"category": "accounts",
"description": "Full accounts made up to 25 February 2023",
"paper_filed": false,
"type": "AA",
"pages": 184,
"document_url": "https://find-and-update.company-information.service.gov.uk/company/00445790/filing-history/MzM5MzgxNTk2NGFkaXF6a2N4/document"
# transaction_iddatecategorydescriptionpaper_filedtype
1
2
3

Complete list of extractable fields for Charges objects from companieshouse.gov.uk. All fields typed and schema-versioned.

charge_codecreated_ondelivered_onstatuspersons_entitledinstrument_descriptionshort_particularsproperty_acquired
charges
● 200 OK
"charge_code": "004457900001",
"created_on": "2015-06-12",
"delivered_on": "2015-06-15",
"status": "outstanding",
"persons_entitled": "['NatWest Markets Plc']",
"instrument_description": "Debenture dated 12 June 2015",
"short_particulars": "Fixed and floating charges over the undertaking and all property and assets",
"property_acquired": false
# charge_codecreated_ondelivered_onstatuspersons_entitledinstrument_description
1
2
3

Capabilities

Complete UK corporate intelligence

Our pipeline extracts the full entity graph from Companies House, navigating pagination limits, parsing unstructured PDF filings, and resolving cross-directorships across millions of records.

Company Demographics

Extract core entity data including registered address, SIC codes, incorporation dates, and current operational status.

Officer & Director Tracking

Monitor board changes, capturing appointment dates, resignations, roles, nationalities, and correspondence addresses.

PSC & Beneficial Ownership

Extract the Persons with Significant Control register to map ultimate beneficial owners and precise control thresholds.

Filing History Metadata

Index every historical filing event, capturing document types, transaction IDs, page counts, and direct download URLs.

Financial PDF OCR

Parse embedded tables within scanned PDF accounts to extract structured balance sheet and P&L metrics.

Mortgage & Charges

Track secured lending by extracting charge codes, creation dates, entitled persons, and short particulars of assets.

Insolvency Monitoring

Capture administration, liquidation, and strike-off events the moment they hit the public register.

Cross-Directorship Resolution

Link individuals across multiple corporate entities using partial DOBs and name matching heuristics.

Daily Diff Generation

Maintain a hash index of 5M+ entities to push only delta updates, minimising downstream processing costs.

// engagement pipeline

From company numbers to warehouse records

Brief in. Clean data out.

Define Scope
d 0

Provide a list of company numbers, SIC codes, or request a full register sync. We map the required data points.

Pipeline Build
d 2–4

We configure extraction logic, PDF parsing queues, and API rate-limit circumvention strategies.

Validation & QA
d 4–6

We run schema validation, null-rate checks, and OCR accuracy tests on historical filing samples.

Delivery
ongoing

JSON, CSV, or Parquet files pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on schedule.

Under the hood

Handling the complexities of government data

Companies House provides an API, but rate limits, missing data, and unstructured PDFs make bulk extraction difficult. Here is how we engineer around these constraints.

pipeline-monitor · companieshouse.gov.uk · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
PDF Parsing
Extracting structured data from scanned accounts

Millions of UK companies file their accounts as unstructured PDFs or scanned images. We pipe these documents through AWS Textract and custom heuristics to extract structured balance sheet and P&L metrics, turning image blobs into queryable JSON.

Rate Limits
Bypassing API quotas via web scraping

The public Companies House API imposes strict rate limits that make full-register syncs impossible. We use distributed crawlers across residential IP pools to scrape the web frontend directly, achieving throughput far beyond API constraints.

Pagination
Deep extraction for heavy filers

Large PLCs often have thousands of historical filings spanning decades. Our crawlers handle deep pagination state efficiently, ensuring complete historical records without memory bloat or connection timeouts.

Entity Resolution
Matching officers despite data inconsistencies

Officer names are frequently misspelled or formatted inconsistently across different company appointments. We apply deterministic matching rules using name permutations and partial dates of birth to accurately map cross-directorships.

Change Detection
Efficient delta updates for 5M+ entities

Polling the entire UK corporate register daily is inefficient. We maintain state hashes for every entity and only emit records when a new filing, status change, or officer update is detected, providing a clean changelog.

Applications

Who uses Companies House data

Teams across industries use companieshouse.gov.uk data to build competitive products and smarter operations.

01
KYC & AML Compliance

Fintechs and banks automate customer onboarding by verifying corporate structures, directors, and ultimate beneficial owners against the public register.

02
B2B Lead Generation

Sales teams track new incorporations by SIC code and region to target newly funded or expanding businesses with relevant services.

03
Credit Risk & Underwriting

Lenders ingest mortgage charges, insolvency events, and historical accounts to build automated credit scoring models for SME lending.

04
Market & Competitor Analysis

Consultancies aggregate financial metrics from extracted PDF accounts to benchmark industry performance and identify acquisition targets.

05
Supply Chain Due Diligence

Procurement teams monitor supplier health by tracking late filings, director resignations, and new floating charges.

06
Master Data Management

Enterprise data teams use Companies House as the golden source to cleanse and enrich stale CRM records with accurate registered addresses and legal names.

Why DataFlirt

"UK Companies House holds the definitive graph of British corporate ownership, but turning millions of unstructured PDFs and fragmented records into queryable data requires serious infrastructure."

Most teams underestimate the compute required to parse scanned financial accounts or map cross-directorships at scale. DataFlirt absorbs the complexity of OCR, entity resolution, and continuous diffing so your engineers can focus on risk modelling and analysis.

Technical Spec

Companies House scraper - technical capabilities

Everything supported by our companieshouse.gov.uk scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Web interface scraping
Bypasses strict API rate limits for high-throughput extraction
Supported
PDF OCR extraction
Converts scanned financial accounts into structured JSON metrics
Supported
Change detection (diffs)
Hash-based diff: only emit records with changed fields since last run
Supported
Deep pagination
Captures complete filing histories exceeding 1,000+ documents
Supported
Residential proxy rotation
ISP-grade residential IPs to prevent rate limiting and blocks
Supported
Webhook delivery
HTTP POST per record or batch for real-time KYC workflows
Supported
Pre-2010 dissolved entities
Records removed from the public register after 20 years
Partial
Protected officer addresses
Full residential addresses protected under section 243
Partial
Infrastructure

Infrastructure powering the Companies House pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusAWS TextractTesseract OCR
Scrapy + Playwright Stack

Scrapy manages concurrency and deduplication for millions of entities, while Playwright handles complex dynamic interactions and cookie sessions when necessary.

Distributed OCR Pipeline

We route unstructured PDF filings through a scalable AWS Textract and Tesseract cluster, applying custom heuristics to normalise tabular financial data.

Cloud-Native Orchestration

Pipelines run on Kubernetes clusters. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested - schema versioned per run
CSV
Flat file with typed columns - Excel/Sheets compatible
Parquet
Columnar format for BigQuery, Snowflake, Athena
S3
Direct bucket delivery - compatible with any data lake
BigQuery
Streamed directly into your dataset with schema auto-detect
Webhook
HTTP POST per record for real-time downstream processing
Postgres
Upsert into your existing schema with conflict resolution
Snowflake
Stage + COPY INTO workflow - incremental or full-replace
// faq

Common questions.

About companieshouse.gov.uk scraping, legality, and pipeline operations.

Ask us directly →
Is scraping Companies House legal?

Yes. Companies House data is a public register. Information is published under the Open Government Licence (OGL). DataFlirt extracts only publicly available corporate and officer data. We do not attempt to bypass security controls or extract protected residential addresses.

Why scrape the web interface instead of using the API?

The official API imposes strict rate limits (600 requests per 5 minutes) which makes monitoring 5 million active companies impossible. Scraping the web interface allows us to achieve the throughput necessary for daily full-register diffs.

Can you extract data from PDF financial accounts?

Yes. We use a combination of AWS Textract and custom parsing heuristics to extract structured balance sheet and P&L metrics from scanned PDF filings, converting them into queryable JSON fields.

How fresh is the data?

We run continuous polling on target entity lists. For full-register monitoring, we push delta updates daily, ensuring your warehouse reflects the latest appointments, resignations, and filings within 24 hours of publication.

Do you capture historical filings?

Yes. We can extract the complete filing history for any active or recently dissolved company, including legacy documents dating back to incorporation.

How do you handle dissolved companies?

We track dissolution events and update company status accordingly. Note that Companies House removes records of companies dissolved for more than 20 years (pre-2010), which are no longer available on the public register.

Can you track cross-directorships?

Yes. We extract partial dates of birth and full names to help you resolve entities and map complex corporate structures and beneficial ownership graphs.

$ dataflirt scope --new-project --source=companieshouse.gov.uk ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a targeted list of 10,000 companies or a daily sync of the entire 5M+ UK corporate register, we build and operate the infrastructure. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →