SYSTEM all green source gleif.org queue 11,492 entities p99 latency 185ms dataflirt.com · scraper/gleif-org
RUN · 37 active pipelines · gleif.org live

Global entity data,
normalised at scale.

We extract Level 1 entity records and Level 2 ownership hierarchies from the Global Legal Entity Identifier Foundation. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Postgres on your cadence.

LEI records extracted
2.8M /run
Relationship mappings
1.4M /run
Delta updates
14,205 /24h
Active pipelines
37
Uptime
99.98%
Data Dictionary

Every field we extract from gleif.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Level 1 Entity Data objects from gleif.org. All fields typed and schema-versioned.

lei_codelegal_namelegal_jurisdictionentity_statuslegal_form_coderegistration_datelast_update_datemanaging_lou
level_1 entity data
● 200 OK
"lei_code": "5493006MHB84DD0ZWV18",
"legal_name": "DataFlirt Technologies Ltd",
"legal_jurisdiction": "GB",
"entity_status": "ACTIVE",
"legal_form_code": "8FTB",
"managing_lou": "EVK05KS7XY1DEII3R011"
# lei_codelegal_namelegal_jurisdictionentity_statuslegal_form_coderegistration_date
1
2
3

Complete list of extractable fields for Level 2 Ownership Data objects from gleif.org. All fields typed and schema-versioned.

child_leiparent_leirelationship_typerelationship_statusstart_dateend_dateaccounting_standardvalidation_sources
level_2 ownership data
● 200 OK
"child_lei": "5493006MHB84DD0ZWV18",
"parent_lei": "549300O897ZC5R7BMG32",
"relationship_type": "ULTIMATE_ACCOUNTING_CONSOLIDATING_PARENT",
"relationship_status": "ACTIVE",
"accounting_standard": "IFRS",
"validation_sources": "FULLY_CORROBORATED"
# child_leiparent_leirelationship_typerelationship_statusstart_dateend_date
1
2
3

Complete list of extractable fields for Registration Details objects from gleif.org. All fields typed and schema-versioned.

lei_codeinitial_registration_datenext_renewal_dateregistration_statusmanaging_louvalidation_authority_idvalidation_authority_entity_idcorroboration_level
registration_details
● 200 OK
"lei_code": "5493006MHB84DD0ZWV18",
"initial_registration_date": "2018-05-14T09:00:00Z",
"next_renewal_date": "2025-05-14T09:00:00Z",
"registration_status": "ISSUED",
"managing_lou": "EVK05KS7XY1DEII3R011",
"corroboration_level": "FULLY_CORROBORATED"
# lei_codeinitial_registration_datenext_renewal_dateregistration_statusmanaging_louvalidation_authority_id
1
2
3

Complete list of extractable fields for Address Information objects from gleif.org. All fields typed and schema-versioned.

lei_codelegal_address_line1legal_address_citylegal_address_countrylegal_address_postal_codehq_address_line1hq_address_cityhq_address_country
address_information
● 200 OK
"lei_code": "5493006MHB84DD0ZWV18",
"legal_address_line1": "123 Tech Park",
"legal_address_city": "London",
"legal_address_country": "GB",
"legal_address_postal_code": "EC1A 1BB",
"hq_address_country": "GB"
# lei_codelegal_address_line1legal_address_citylegal_address_countrylegal_address_postal_codehq_address_line1
1
2
3

Complete list of extractable fields for Event History objects from gleif.org. All fields typed and schema-versioned.

lei_codeevent_typeevent_dateevent_statusprevious_namenew_nameprevious_addressnew_address
event_history
● 200 OK
"lei_code": "5493006MHB84DD0ZWV18",
"event_type": "ENTITY_NAME_CHANGE",
"event_date": "2023-11-04T14:30:00Z",
"event_status": "COMPLETED",
"previous_name": "DataFlirt Inc",
"new_name": "DataFlirt Technologies Ltd"
# lei_codeevent_typeevent_dateevent_statusprevious_namenew_name
1
2
3

Capabilities

Extract authoritative corporate identity data

Our GLEIF scraper processes complex XML schemas and daily delta files, converting millions of nested LEI records and ownership hierarchies into flat, queryable tables.

Level 1 Entity Parsing

Extract core entity data, legal forms, jurisdiction codes, and registration statuses for over 2.5 million global entities.

Level 2 Hierarchy Mapping

Resolve ultimate and direct parent relationships. We join parent and child LEIs to reconstruct corporate family trees.

Delta File Processing

Parse daily published delta files for incremental updates, ensuring your database reflects the latest corporate actions.

XML Payload Normalisation

Convert deeply nested GLEIF Common Data Format (CDF) XML schemas into flat relational tables or JSON objects.

Historical Record Tracking

Maintain changelogs for entity status changes, registration renewals, and corporate name updates over time.

LOU Data Extraction

Track managing Local Operating Units (LOUs) and validation authorities responsible for corroborating each LEI.

Address Standardisation

Split and normalise legal and headquarters address fields across different international formats and character sets.

Event History Mining

Capture corporate actions, mergers, acquisitions, and name changes published in the GLEIF event logs.

Scheduled Delivery

Push updates daily to sync with GLEIF publication cycles, ensuring zero drift between your system and the global index.

// engagement pipeline

From XML schemas to warehouse records

Brief in. Clean data out.

Define Scope
d 0

Specify whether you need a full database sync or targeted extraction based on jurisdiction, legal form, or LOU.

Pipeline Build
d 2–4

We configure parsers for GLEIF XML schemas, setup daily delta processing, and implement relationship mapping logic.

Validation & QA
d 4–6

Schema validation, null-rate checks, and relationship integrity tests ensure parent-child mappings are accurate.

Delivery
ongoing

JSON, CSV, or Parquet pushed to your S3 bucket, BigQuery dataset, or Postgres database on a daily cadence.

Under the hood

How our GLEIF pipeline handles the hard parts

Processing GLEIF data requires parsing massive XML files and reconciling daily deltas. Here is how we maintain pipeline integrity.

pipeline-monitor · gleif.org · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Data structure
XML schema complexity

GLEIF uses deeply nested XML formats (CDF) for entity and relationship data. We flatten these hierarchical structures into relational tables, handling varying schema versions and optional fields automatically.

State management
Delta reconciliation

Applying daily deltas requires strict state management to prevent data corruption. We process additions, modifications, and deletions in sequence, ensuring your local copy mirrors the authoritative index.

Hierarchy resolution
Relationship mapping across files

Level 2 data splits parents and children across different files and records. We join these foreign keys in transit, validating relationship statuses and accounting standards before delivery.

Infrastructure
Pagination and rate limits

While GLEIF provides bulk files, API endpoints for real-time lookups throttle heavy requests. We manage concurrency, implement backoff strategies, and use proxy rotation for high-volume API queries.

Data quality
Address normalisation

Address formats vary globally across millions of entities. We apply standardisation rules during extraction, separating street, city, region, and postal codes into consistent typed columns.

Applications

Who uses GLEIF data — and how

Teams across industries use gleif.org data to build competitive products and smarter operations.

01
KYC and AML Compliance

Automate counterparty identification and verification using authoritative LEI data to meet global regulatory standards.

02
Vendor Master Data

Cleanse, deduplicate, and append LEI codes to internal vendor databases to maintain accurate corporate records.

03
Corporate Risk Assessment

Map ultimate parent entities and subsidiary hierarchies to calculate aggregate exposure across complex corporate groups.

04
Regulatory Reporting

Fulfil MiFID II, EMIR, and Dodd-Frank reporting requirements with verified, up-to-date entity identifiers.

05
Financial Data Aggregation

Link disparate market data feeds, credit ratings, and financial statements using the LEI as the primary key.

06
Supply Chain Visibility

Trace corporate ownership across global supplier networks to identify concentration risks and geopolitical exposure.

Why DataFlirt

"GLEIF provides the most authoritative corporate identity graph in the world, but navigating the nested XML and relationship mapping requires heavy engineering."

Most teams underestimate the complexity of parsing Level 2 ownership data. Resolving direct and ultimate parent relationships across millions of entities requires strict state management and daily delta reconciliation. DataFlirt handles the extraction and mapping so your compliance systems receive flat, queryable records.

Technical Spec

GLEIF scraper — technical capabilities

Everything supported by our gleif.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Level 1 LEI extraction
Core entity data, legal forms, and jurisdiction codes
Supported
Level 2 ownership mapping
Direct and ultimate parent relationship resolution
Supported
Daily delta processing
Incremental updates based on GLEIF daily publications
Supported
XML to JSON flattening
Conversion of CDF XML schemas into flat relational formats
Supported
Historical event tracking
Capture of corporate actions and status changes over time
Supported
Webhook delivery per update
HTTP POST for real-time entity status changes
Supported
Non-public ownership exceptions
GLEIF does not publish structures where entities have opted out for legal reasons
Partial
Real-time LOU internal data
Requires direct LOU access; not available on public gleif.org endpoints
Partial
Infrastructure

Infrastructure powering the GLEIF pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
Scrapy + Playwright Stack

Scrapy handles API orchestration, file downloading, and retry logic. Playwright handles any JavaScript-rendered search interfaces or portal interactions required for supplementary data.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies to handle high-volume API requests, bypassing rate limits and IP blocks during intensive historical data backfills.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles the complex dependency graphs required for processing daily deltas and reconciling Level 2 relationship mappings.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested arrays for hierarchical data
CSV
Flat files with typed columns for Level 1 and Level 2 data
XLS
Excel compatible exports for compliance team reviews
Parquet
Columnar format optimized for BigQuery and Athena
AWS S3
Direct bucket delivery for data lake ingestion
Webhook
HTTP POST payloads for real-time system alerts
API
REST endpoints to query specific LEI records on demand
PostgreSQL
Direct upserts into your existing relational database schema
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About gleif.org scraping, legality, and pipeline operations.

Ask us directly →
Is scraping GLEIF legal?

Yes. GLEIF data is public domain and intended for global open access. The foundation operates under an open data policy to promote transparency in global financial markets. We extract this public data in compliance with their terms of use.

How do you handle daily updates?

We monitor GLEIF publication schedules and process the daily delta files every 24 hours. Our pipeline applies additions, modifications, and deletions sequentially to ensure your database remains perfectly synchronized with the global index.

Can you map parent-child relationships?

Yes. We extract Level 2 Relationship Record (RR) files and resolve the foreign keys against Level 1 entity data. This provides a complete corporate hierarchy, identifying both direct and ultimate accounting consolidating parents.

Do I need to parse XML?

No. We handle the complex parsing of GLEIF Common Data Format (CDF) XML files. We flatten the nested elements and deliver clean, typed formats like JSON, CSV, or Parquet directly to your warehouse.

What fields are included in the extraction?

All public fields are included: LEI code, legal name, jurisdiction, entity status, legal form, registration dates, managing LOU, legal addresses, headquarters addresses, and Level 2 relationship types.

How fast is the initial sync?

A full sync of all 2.5 million+ LEI records and their associated relationship mappings typically completes within 12 hours. Subsequent daily delta updates are processed and delivered within minutes of publication.

$ dataflirt scope --new-project --source=gleif.org ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full historical sync of global LEI records or a continuous daily delta feed for compliance monitoring — we scope, build, and operate the pipeline.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →