SYSTEM all green source sec.gov queue 14,892 filings p99 latency 214ms dataflirt.com · scraper/sec-gov

RUN · 112 active pipelines · sec.gov live

SEC EDGAR data,
parsed at scale.

We extract corporate filings, insider trading forms, institutional holdings, and XBRL financial statements from SEC EDGAR. Delivered as clean JSON, CSV, or Parquet to S3 or BigQuery on your cadence.

Get data from sec.gov → See how it works

Filings parsed

42,105 /day

XBRL facts extracted

8.4M /24h

Form 4 records

12,450 /run

Active pipelines

112

Uptime

99.98%

◆ SEC EDGAR Filings◆ 10-K & 10-Q Extraction◆ 8-K Event Triggers◆ Form 4 Insider Trading◆ 13F Institutional Holdings◆ XBRL & iXBRL Parsing◆ CIK to Ticker Mapping◆ Mutual Fund Prospectuses◆ Comment Letters (UPLOAD)◆ Enforcement Actions◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Bengaluru HQ◆ Enterprise SLA◆ SEC EDGAR Filings◆ 10-K & 10-Q Extraction◆ 8-K Event Triggers◆ Form 4 Insider Trading◆ 13F Institutional Holdings◆ XBRL & iXBRL Parsing◆ CIK to Ticker Mapping◆ Mutual Fund Prospectuses◆ Comment Letters (UPLOAD)◆ Enforcement Actions◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Bengaluru HQ◆ Enterprise SLA

Data Dictionary

Every field we extract from sec.gov

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Company Metadata objects from sec.gov. All fields typed and schema-versioned.

ciktickercompany_namesic_codesic_descriptionstate_of_incorporationfiscal_year_endbusiness_addressmailing_addressformer_names

"cik": "0000320193",
"ticker": "AAPL",
"company_name": "Apple Inc.",
"sic_code": "3571",
"sic_description": "Electronic Computers",
"state_of_incorporation": "CA",
"fiscal_year_end": "0930"

#	cik	ticker	company_name	sic_code	sic_description	state_of_incorporation
1
2
3

Complete list of extractable fields for 10-K / 10-Q Financials objects from sec.gov. All fields typed and schema-versioned.

accession_numbercikform_typefiling_dateperiod_of_reportxbrl_instance_urltotal_revenuenet_incomeeps_basictotal_assetstotal_liabilitiesoperating_cash_flow

"accession_number": "0000320193-23-000106",
"form_type": "10-K",
"filing_date": "2023-11-03",
"period_of_report": "2023-09-30",
"total_revenue": 383285000000,
"net_income": 96995000000,
"eps_basic": 6.16

#	accession_number	cik	form_type	filing_date	period_of_report	xbrl_instance_url
1
2
3

Complete list of extractable fields for Form 4 (Insider Trading) objects from sec.gov. All fields typed and schema-versioned.

accession_numberissuer_cikreporting_owner_cikreporting_owner_nameis_directoris_officerofficer_titletransaction_datetransaction_codeshares_transactedprice_per_shareshares_owned_following

"accession_number": "0000320193-24-000012",
"reporting_owner_name": "COOK TIMOTHY D",
"is_officer": true,
"officer_title": "Chief Executive Officer",
"transaction_date": "2024-04-01",
"transaction_code": "S",
"shares_transacted": 196410,
"price_per_share": 169.33

#	accession_number	issuer_cik	reporting_owner_cik	reporting_owner_name	is_director	is_officer
1
2
3

Complete list of extractable fields for Form 13F (Holdings) objects from sec.gov. All fields typed and schema-versioned.

accession_numbermanager_namereport_calendar_or_quartername_of_issuertitle_of_classcusipvalue_usdshares_prn_amountput_callinvestment_discretion

"manager_name": "BERKSHIRE HATHAWAY INC",
"report_calendar_or_quarter": "2023-12-31",
"name_of_issuer": "APPLE INC",
"title_of_class": "COM",
"cusip": "037833100",
"value_usd": 174347781000,
"shares_prn_amount": 905560000

#	accession_number	manager_name	report_calendar_or_quarter	name_of_issuer	title_of_class	cusip
1
2
3

Complete list of extractable fields for 8-K (Material Events) objects from sec.gov. All fields typed and schema-versioned.

accession_numbercikcompany_namefiling_dateitem_numbersitem_descriptionshtml_urltext_urlexhibits_counthas_press_release

"accession_number": "0001193125-24-023456",
"cik": "0001652044",
"company_name": "Alphabet Inc.",
"filing_date": "2024-01-30",
"item_numbers": "['2.02', '9.01']",
"item_descriptions": "['Results of Operations and Financial Condition', 'Financial Statements and Exhibits']",
"has_press_release": true

#	accession_number	cik	company_name	filing_date	item_numbers	item_descriptions
1
2
3

Capabilities

Everything you need from EDGAR — parsed and normalised

Our SEC pipeline handles every layer of the EDGAR database: strict rate limits, complex XBRL taxonomies, historical SGML formats, and real-time event polling.

EDGAR Rate Limit Compliance

Built-in throttling and user-agent declaration to strictly respect the SEC 10 requests per second limit without missing filings.

XBRL & iXBRL Parsing

Extract and normalise financial facts from complex XBRL instance documents, handling custom taxonomies and dimensional data.

Real-Time 8-K Monitoring

Poll EDGAR RSS feeds at maximum permitted frequency to detect material events, earnings releases, and executive changes instantly.

Insider Trading (Form 4) Extraction

Parse XML ownership documents to track executive buys, sells, and option exercises with exact share counts and prices.

Institutional Holdings (13F)

Aggregate quarterly 13F-HR filings to track hedge fund and institutional positions, mapped by CUSIP and issuer name.

Historical Filing Backfills

Access decades of historical filings. We parse legacy text and SGML formats from pre-2004 submissions.

CIK & Ticker Resolution

Map SEC Central Index Keys (CIK) to current and historical ticker symbols, accounting for mergers and name changes.

Exhibit & Attachment Downloading

Automatically extract and store material contracts, press releases, and EX-101 attachments associated with primary filings.

Comment Letter Analysis

Extract SEC UPLOAD and company CORRESP filings to monitor regulatory scrutiny and accounting inquiries.

Mutual Fund Prospectuses

Parse N-1A, 497, and N-PORT filings for fund objectives, fee structures, and portfolio holdings.

// engagement pipeline

From CIK list to warehouse record

Brief in. Clean data out.

Define Scope

d 0

Provide CIK lists, ticker sets, or form types (e.g., 10-K, 8-K, Form 4). We design the extraction schema together.

Pipeline Build

d 2–4

We configure Scrapy crawlers, EDGAR rate-limit queues, and XBRL parsing logic for sec.gov.

Validation & QA

d 4–6

Schema validation, null-rate checks, taxonomy resolution, and historical data sampling before full launch.

Delivery

ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our EDGAR pipeline handles the hard parts

SEC data is public, but extracting it cleanly is notoriously difficult. Here is how we manage the complexity of EDGAR.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

Rate limits

Strict 10 requests per second throttling

The SEC strictly bans IP addresses that exceed 10 requests per second. Our distributed queuing system enforces global rate limits across all crawler nodes and correctly formats the required User-Agent headers to ensure zero blockages.

XBRL parsing

Navigating custom taxonomies

Companies frequently invent custom XBRL tags (extensions) when standard US-GAAP tags do not fit. Our parser maps custom extensions back to standard financial concepts, ensuring your revenue and net income columns remain populated.

Legacy formats

Parsing pre-2004 SGML filings

Before XML became standard, EDGAR filings were submitted as massive, unstructured text files with loose SGML tags. We maintain custom regular expression pipelines to extract structured tables and metadata from these legacy documents.

CIK mapping

Point-in-time ticker resolution

Tickers change, companies merge, and CIKs remain static. We maintain a point-in-time mapping database so you can query historical filings by the ticker symbol that was active on the date of the filing.

Real-time polling

Sub-minute event detection

For event-driven trading, waiting for a daily bulk dump is useless. We poll EDGAR RSS feeds at the maximum allowed frequency and push 8-K and Form 4 data via Webhook within seconds of publication.

Applications

Who uses SEC data — and how

Teams across industries use sec.gov data to build competitive products and smarter operations.

Quantitative Trading

Ingesting 10-K and 10-Q financial statements into factor models for fundamental equity strategies.

Event-Driven Arbitrage

Monitoring 8-K filings for M&A announcements, executive departures, or bankruptcy declarations.

Insider Tracking

Analysing Form 4 filings to track executive sentiment and cluster insider buying signals.

Institutional Flow Analysis

Aggregating 13F data to track hedge fund positions, sector rotation, and institutional crowding.

Credit Risk Modelling

Extracting debt covenants, off-balance-sheet arrangements, and liquidity metrics from footnotes.

Alternative Data Aggregation

Feeding raw SEC text (Management Discussion & Analysis) into NLP models for sentiment analysis.

Why DataFlirt

"EDGAR is the definitive source of truth for US public markets, but extracting structured data from decades of inconsistent filing formats requires serious engineering."

Most teams underestimate the complexity of SEC data. Between strict rate limits, evolving XBRL taxonomies, and messy historical SGML files, maintaining a reliable EDGAR parser is a full-time job. DataFlirt handles the extraction, parsing, and normalisation so your quants can focus on signal generation.

Technical Spec

SEC EDGAR scraper — technical capabilities

Everything supported by our sec.gov scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

EDGAR REST API integration

Native consumption of SEC JSON endpoints for submissions and company facts

Supported

XBRL/iXBRL taxonomy resolution

Parsing US-GAAP, IFRS, and custom company extensions

Supported

Real-time RSS feed polling

Optimised polling for immediate detection of new filings

Supported

Historical backfills to 1994

Extraction of legacy text and SGML format filings

Supported

CIK to Ticker mapping

Point-in-time resolution of company identifiers

Supported

Full text extraction (HTML/TXT)

Clean text extraction stripping HTML boilerplate for NLP use cases

Supported

Exhibit image extraction

Downloading embedded graphics and charts from filings

Supported

Confidential Treatment Requests (CTRs)

Redacted commercial contracts and unredacted non-public versions

Partial

Whistle-blower identity reports

Strictly confidential SEC enforcement tip submissions

Partial

Non-public investigative files

Ongoing SEC enforcement division internal documents

Partial

Infrastructure

Infrastructure powering the SEC pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheuslxmlBeautifulSoup

Distributed Rate Limiting

Our Redis-backed queuing system ensures that aggregate requests across all pipeline nodes never exceed the SEC limit of 10 requests per second, preventing IP bans.

XBRL Parsing Engine

Custom Python modules built on lxml parse complex XML namespaces, resolve taxonomy references, and flatten multi-dimensional financial facts into tabular formats.

Cloud-Native Orchestration

Pipelines run on Kubernetes. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Nested structures ideal for Form 4 and 13F data

CSV

Flat file with typed columns for financial statements

XLS

Excel compatible exports for analyst teams

Parquet

Columnar format for BigQuery, Snowflake, Athena

AWS S3

Direct bucket delivery for bulk historical archives

Webhook

HTTP POST for real-time 8-K event triggers

API

REST endpoints to query specific CIKs or forms

BigQuery

Streamed directly into your dataset

Postgres

Upsert into your existing schema

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About sec.gov scraping, legality, and pipeline operations.

Ask us directly →

Is scraping sec.gov legal?

Yes. SEC EDGAR data is public domain information provided by the US government. However, users must strictly adhere to the SEC fair access policies, which mandate a maximum of 10 requests per second and require specific User-Agent declarations. DataFlirt pipelines are built to comply with these regulations natively.

How do you handle EDGAR's 10 requests per second limit?

We use a centralised Redis token bucket to throttle outbound requests across our entire infrastructure. We also declare the required company name and email address in the User-Agent header as mandated by SEC guidelines.

Can you parse iXBRL (Inline XBRL)?

Yes. We extract facts embedded directly within HTML documents using the iXBRL specification, resolving contexts, units, and taxonomy references to produce clean tabular data.

How fast can I get 8-K filings?

For real-time pipelines, we poll the EDGAR RSS feeds at the maximum permitted frequency. Webhook payloads containing the parsed 8-K metadata and text are typically delivered within 60 seconds of publication on sec.gov.

Do you map CIKs to tickers?

Yes. We maintain a database mapping Central Index Keys (CIK) to current ticker symbols, company names, and SIC codes. We can also handle point-in-time ticker resolution for historical backtests.

What about older filings from the 1990s?

Pre-2004 filings were submitted in text format with SGML headers rather than XML/HTML. We maintain legacy parsers to extract metadata and financial tables from these older documents.

Can you extract text for NLP analysis?

Yes. We can strip HTML boilerplate, remove tables, and isolate specific sections (e.g., Item 1A Risk Factors or Management Discussion & Analysis) to provide clean text corpora for machine learning models.

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a historical backfill of 10-K filings or a real-time feed of Form 4 insider trades — we scope, build, and operate the pipeline. Tell us what you need.

Start a sec.gov pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

Services

Data Extraction for Every Industry

View All Services →

🛍️ eCommerce → 🔍 Search Engine → ⚽ Sports Data → 📱 App Store → 🍕 Food Delivery → 📉 Betting Odds → ✈️ Aviation & Flight → 🛒 Grocery → 🎓 E-Learning → 💹 Stock Market → 🏠 Real Estate → 🤖 AI Training Data → 🧠 LLM Data → 📰 News → ⭐ Reviews → 💼 Job Board → 🏥 Healthcare → 💊 Pharma → 🏢 Company Data → 🤝 B2B Marketplace → 🚗 Automotive → 🌍 Travel → 🏨 Hospitality → 🪙 Cryptocurrency → 💡 IP & Patents → 📈 SEO Data → ⚖️ Legal → 🛡️ Insurance → 📲 Mobile App → 📸 Influencer → 🏛️ Government → 🚚 Transportation → 🎟️ Events → 📂 Directory → ⚡ Dynamic Websites → 📄 PDF Extraction → ✍️ Blog Content → ☁️ Weather → 🖥️ Cloud Scraping → 👨‍💻 Managed Service →

SEC EDGAR data, parsed at scale.

Every field we extract from sec.gov

Everything you need from EDGAR — parsed and normalised

From CIK list to warehouse record

How our EDGAR pipeline handles the hard parts

Who uses SEC data — and how

SEC EDGAR scraper — technical capabilities

Infrastructure powering the SEC pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

SEC EDGAR data,
parsed at scale.

Tell us what
to extract.
We do the rest.