SYSTEM all green source sec.gov queue 14,892 filings p99 latency 214ms dataflirt.com · scraper/sec-gov
RUN · 112 active pipelines · sec.gov live

SEC EDGAR data,
parsed at scale.

We extract corporate filings, insider trading forms, institutional holdings, and XBRL financial statements from SEC EDGAR. Delivered as clean JSON, CSV, or Parquet to S3 or BigQuery on your cadence.

Filings parsed
42,105 /day
XBRL facts extracted
8.4M /24h
Form 4 records
12,450 /run
Active pipelines
112
Uptime
99.98%
Data Dictionary

Every field we extract from sec.gov

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Company Metadata objects from sec.gov. All fields typed and schema-versioned.

ciktickercompany_namesic_codesic_descriptionstate_of_incorporationfiscal_year_endbusiness_addressmailing_addressformer_names
company_metadata
● 200 OK
"cik": "0000320193",
"ticker": "AAPL",
"company_name": "Apple Inc.",
"sic_code": "3571",
"sic_description": "Electronic Computers",
"state_of_incorporation": "CA",
"fiscal_year_end": "0930"
# ciktickercompany_namesic_codesic_descriptionstate_of_incorporation
1
2
3

Complete list of extractable fields for 10-K / 10-Q Financials objects from sec.gov. All fields typed and schema-versioned.

accession_numbercikform_typefiling_dateperiod_of_reportxbrl_instance_urltotal_revenuenet_incomeeps_basictotal_assetstotal_liabilitiesoperating_cash_flow
10-k_/ 10-q financials
● 200 OK
"accession_number": "0000320193-23-000106",
"form_type": "10-K",
"filing_date": "2023-11-03",
"period_of_report": "2023-09-30",
"total_revenue": 383285000000,
"net_income": 96995000000,
"eps_basic": 6.16
# accession_numbercikform_typefiling_dateperiod_of_reportxbrl_instance_url
1
2
3

Complete list of extractable fields for Form 4 (Insider Trading) objects from sec.gov. All fields typed and schema-versioned.

accession_numberissuer_cikreporting_owner_cikreporting_owner_nameis_directoris_officerofficer_titletransaction_datetransaction_codeshares_transactedprice_per_shareshares_owned_following
form_4 (insider trading)
● 200 OK
"accession_number": "0000320193-24-000012",
"reporting_owner_name": "COOK TIMOTHY D",
"is_officer": true,
"officer_title": "Chief Executive Officer",
"transaction_date": "2024-04-01",
"transaction_code": "S",
"shares_transacted": 196410,
"price_per_share": 169.33
# accession_numberissuer_cikreporting_owner_cikreporting_owner_nameis_directoris_officer
1
2
3

Complete list of extractable fields for Form 13F (Holdings) objects from sec.gov. All fields typed and schema-versioned.

accession_numbermanager_namereport_calendar_or_quartername_of_issuertitle_of_classcusipvalue_usdshares_prn_amountput_callinvestment_discretion
form_13f (holdings)
● 200 OK
"manager_name": "BERKSHIRE HATHAWAY INC",
"report_calendar_or_quarter": "2023-12-31",
"name_of_issuer": "APPLE INC",
"title_of_class": "COM",
"cusip": "037833100",
"value_usd": 174347781000,
"shares_prn_amount": 905560000
# accession_numbermanager_namereport_calendar_or_quartername_of_issuertitle_of_classcusip
1
2
3

Complete list of extractable fields for 8-K (Material Events) objects from sec.gov. All fields typed and schema-versioned.

accession_numbercikcompany_namefiling_dateitem_numbersitem_descriptionshtml_urltext_urlexhibits_counthas_press_release
8-k_(material events)
● 200 OK
"accession_number": "0001193125-24-023456",
"cik": "0001652044",
"company_name": "Alphabet Inc.",
"filing_date": "2024-01-30",
"item_numbers": "['2.02', '9.01']",
"item_descriptions": "['Results of Operations and Financial Condition', 'Financial Statements and Exhibits']",
"has_press_release": true
# accession_numbercikcompany_namefiling_dateitem_numbersitem_descriptions
1
2
3

Capabilities

Everything you need from EDGAR — parsed and normalised

Our SEC pipeline handles every layer of the EDGAR database: strict rate limits, complex XBRL taxonomies, historical SGML formats, and real-time event polling.

EDGAR Rate Limit Compliance

Built-in throttling and user-agent declaration to strictly respect the SEC 10 requests per second limit without missing filings.

XBRL & iXBRL Parsing

Extract and normalise financial facts from complex XBRL instance documents, handling custom taxonomies and dimensional data.

Real-Time 8-K Monitoring

Poll EDGAR RSS feeds at maximum permitted frequency to detect material events, earnings releases, and executive changes instantly.

Insider Trading (Form 4) Extraction

Parse XML ownership documents to track executive buys, sells, and option exercises with exact share counts and prices.

Institutional Holdings (13F)

Aggregate quarterly 13F-HR filings to track hedge fund and institutional positions, mapped by CUSIP and issuer name.

Historical Filing Backfills

Access decades of historical filings. We parse legacy text and SGML formats from pre-2004 submissions.

CIK & Ticker Resolution

Map SEC Central Index Keys (CIK) to current and historical ticker symbols, accounting for mergers and name changes.

Exhibit & Attachment Downloading

Automatically extract and store material contracts, press releases, and EX-101 attachments associated with primary filings.

Comment Letter Analysis

Extract SEC UPLOAD and company CORRESP filings to monitor regulatory scrutiny and accounting inquiries.

Mutual Fund Prospectuses

Parse N-1A, 497, and N-PORT filings for fund objectives, fee structures, and portfolio holdings.

// engagement pipeline

From CIK list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide CIK lists, ticker sets, or form types (e.g., 10-K, 8-K, Form 4). We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy crawlers, EDGAR rate-limit queues, and XBRL parsing logic for sec.gov.

Validation & QA
d 4–6

Schema validation, null-rate checks, taxonomy resolution, and historical data sampling before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our EDGAR pipeline handles the hard parts

SEC data is public, but extracting it cleanly is notoriously difficult. Here is how we manage the complexity of EDGAR.

pipeline-monitor · sec.gov · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Rate limits
Strict 10 requests per second throttling

The SEC strictly bans IP addresses that exceed 10 requests per second. Our distributed queuing system enforces global rate limits across all crawler nodes and correctly formats the required User-Agent headers to ensure zero blockages.

XBRL parsing
Navigating custom taxonomies

Companies frequently invent custom XBRL tags (extensions) when standard US-GAAP tags do not fit. Our parser maps custom extensions back to standard financial concepts, ensuring your revenue and net income columns remain populated.

Legacy formats
Parsing pre-2004 SGML filings

Before XML became standard, EDGAR filings were submitted as massive, unstructured text files with loose SGML tags. We maintain custom regular expression pipelines to extract structured tables and metadata from these legacy documents.

CIK mapping
Point-in-time ticker resolution

Tickers change, companies merge, and CIKs remain static. We maintain a point-in-time mapping database so you can query historical filings by the ticker symbol that was active on the date of the filing.

Real-time polling
Sub-minute event detection

For event-driven trading, waiting for a daily bulk dump is useless. We poll EDGAR RSS feeds at the maximum allowed frequency and push 8-K and Form 4 data via Webhook within seconds of publication.

Applications

Who uses SEC data — and how

Teams across industries use sec.gov data to build competitive products and smarter operations.

01
Quantitative Trading

Ingesting 10-K and 10-Q financial statements into factor models for fundamental equity strategies.

02
Event-Driven Arbitrage

Monitoring 8-K filings for M&A announcements, executive departures, or bankruptcy declarations.

03
Insider Tracking

Analysing Form 4 filings to track executive sentiment and cluster insider buying signals.

04
Institutional Flow Analysis

Aggregating 13F data to track hedge fund positions, sector rotation, and institutional crowding.

05
Credit Risk Modelling

Extracting debt covenants, off-balance-sheet arrangements, and liquidity metrics from footnotes.

06
Alternative Data Aggregation

Feeding raw SEC text (Management Discussion & Analysis) into NLP models for sentiment analysis.

Why DataFlirt

"EDGAR is the definitive source of truth for US public markets, but extracting structured data from decades of inconsistent filing formats requires serious engineering."

Most teams underestimate the complexity of SEC data. Between strict rate limits, evolving XBRL taxonomies, and messy historical SGML files, maintaining a reliable EDGAR parser is a full-time job. DataFlirt handles the extraction, parsing, and normalisation so your quants can focus on signal generation.

Technical Spec

SEC EDGAR scraper — technical capabilities

Everything supported by our sec.gov scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

EDGAR REST API integration
Native consumption of SEC JSON endpoints for submissions and company facts
Supported
XBRL/iXBRL taxonomy resolution
Parsing US-GAAP, IFRS, and custom company extensions
Supported
Real-time RSS feed polling
Optimised polling for immediate detection of new filings
Supported
Historical backfills to 1994
Extraction of legacy text and SGML format filings
Supported
CIK to Ticker mapping
Point-in-time resolution of company identifiers
Supported
Full text extraction (HTML/TXT)
Clean text extraction stripping HTML boilerplate for NLP use cases
Supported
Exhibit image extraction
Downloading embedded graphics and charts from filings
Supported
Confidential Treatment Requests (CTRs)
Redacted commercial contracts and unredacted non-public versions
Partial
Whistle-blower identity reports
Strictly confidential SEC enforcement tip submissions
Partial
Non-public investigative files
Ongoing SEC enforcement division internal documents
Partial
Infrastructure

Infrastructure powering the SEC pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheuslxmlBeautifulSoup
Distributed Rate Limiting

Our Redis-backed queuing system ensures that aggregate requests across all pipeline nodes never exceed the SEC limit of 10 requests per second, preventing IP bans.

XBRL Parsing Engine

Custom Python modules built on lxml parse complex XML namespaces, resolve taxonomy references, and flatten multi-dimensional financial facts into tabular formats.

Cloud-Native Orchestration

Pipelines run on Kubernetes. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Nested structures ideal for Form 4 and 13F data
CSV
Flat file with typed columns for financial statements
XLS
Excel compatible exports for analyst teams
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery for bulk historical archives
Webhook
HTTP POST for real-time 8-K event triggers
API
REST endpoints to query specific CIKs or forms
BigQuery
Streamed directly into your dataset
Postgres
Upsert into your existing schema
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About sec.gov scraping, legality, and pipeline operations.

Ask us directly →
Is scraping sec.gov legal?

Yes. SEC EDGAR data is public domain information provided by the US government. However, users must strictly adhere to the SEC fair access policies, which mandate a maximum of 10 requests per second and require specific User-Agent declarations. DataFlirt pipelines are built to comply with these regulations natively.

How do you handle EDGAR's 10 requests per second limit?

We use a centralised Redis token bucket to throttle outbound requests across our entire infrastructure. We also declare the required company name and email address in the User-Agent header as mandated by SEC guidelines.

Can you parse iXBRL (Inline XBRL)?

Yes. We extract facts embedded directly within HTML documents using the iXBRL specification, resolving contexts, units, and taxonomy references to produce clean tabular data.

How fast can I get 8-K filings?

For real-time pipelines, we poll the EDGAR RSS feeds at the maximum permitted frequency. Webhook payloads containing the parsed 8-K metadata and text are typically delivered within 60 seconds of publication on sec.gov.

Do you map CIKs to tickers?

Yes. We maintain a database mapping Central Index Keys (CIK) to current ticker symbols, company names, and SIC codes. We can also handle point-in-time ticker resolution for historical backtests.

What about older filings from the 1990s?

Pre-2004 filings were submitted in text format with SGML headers rather than XML/HTML. We maintain legacy parsers to extract metadata and financial tables from these older documents.

Can you extract text for NLP analysis?

Yes. We can strip HTML boilerplate, remove tables, and isolate specific sections (e.g., Item 1A Risk Factors or Management Discussion & Analysis) to provide clean text corpora for machine learning models.

$ dataflirt scope --new-project --source=sec.gov ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a historical backfill of 10-K filings or a real-time feed of Form 4 insider trades — we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →