We extract corporate filings, insider trading forms, institutional holdings, and XBRL financial statements from SEC EDGAR. Delivered as clean JSON, CSV, or Parquet to S3 or BigQuery on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Company Metadata objects from sec.gov. All fields typed and schema-versioned.
"cik": "0000320193", "ticker": "AAPL", "company_name": "Apple Inc.", "sic_code": "3571", "sic_description": "Electronic Computers", "state_of_incorporation": "CA", "fiscal_year_end": "0930"
| # | cik | ticker | company_name | sic_code | sic_description | state_of_incorporation |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for 10-K / 10-Q Financials objects from sec.gov. All fields typed and schema-versioned.
"accession_number": "0000320193-23-000106", "form_type": "10-K", "filing_date": "2023-11-03", "period_of_report": "2023-09-30", "total_revenue": 383285000000, "net_income": 96995000000, "eps_basic": 6.16
| # | accession_number | cik | form_type | filing_date | period_of_report | xbrl_instance_url |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Form 4 (Insider Trading) objects from sec.gov. All fields typed and schema-versioned.
"accession_number": "0000320193-24-000012", "reporting_owner_name": "COOK TIMOTHY D", "is_officer": true, "officer_title": "Chief Executive Officer", "transaction_date": "2024-04-01", "transaction_code": "S", "shares_transacted": 196410, "price_per_share": 169.33
| # | accession_number | issuer_cik | reporting_owner_cik | reporting_owner_name | is_director | is_officer |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Form 13F (Holdings) objects from sec.gov. All fields typed and schema-versioned.
"manager_name": "BERKSHIRE HATHAWAY INC", "report_calendar_or_quarter": "2023-12-31", "name_of_issuer": "APPLE INC", "title_of_class": "COM", "cusip": "037833100", "value_usd": 174347781000, "shares_prn_amount": 905560000
| # | accession_number | manager_name | report_calendar_or_quarter | name_of_issuer | title_of_class | cusip |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for 8-K (Material Events) objects from sec.gov. All fields typed and schema-versioned.
"accession_number": "0001193125-24-023456", "cik": "0001652044", "company_name": "Alphabet Inc.", "filing_date": "2024-01-30", "item_numbers": "['2.02', '9.01']", "item_descriptions": "['Results of Operations and Financial Condition', 'Financial Statements and Exhibits']", "has_press_release": true
| # | accession_number | cik | company_name | filing_date | item_numbers | item_descriptions |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Our SEC pipeline handles every layer of the EDGAR database: strict rate limits, complex XBRL taxonomies, historical SGML formats, and real-time event polling.
Built-in throttling and user-agent declaration to strictly respect the SEC 10 requests per second limit without missing filings.
Extract and normalise financial facts from complex XBRL instance documents, handling custom taxonomies and dimensional data.
Poll EDGAR RSS feeds at maximum permitted frequency to detect material events, earnings releases, and executive changes instantly.
Parse XML ownership documents to track executive buys, sells, and option exercises with exact share counts and prices.
Aggregate quarterly 13F-HR filings to track hedge fund and institutional positions, mapped by CUSIP and issuer name.
Access decades of historical filings. We parse legacy text and SGML formats from pre-2004 submissions.
Map SEC Central Index Keys (CIK) to current and historical ticker symbols, accounting for mergers and name changes.
Automatically extract and store material contracts, press releases, and EX-101 attachments associated with primary filings.
Extract SEC UPLOAD and company CORRESP filings to monitor regulatory scrutiny and accounting inquiries.
Parse N-1A, 497, and N-PORT filings for fund objectives, fee structures, and portfolio holdings.
Brief in. Clean data out.
Provide CIK lists, ticker sets, or form types (e.g., 10-K, 8-K, Form 4). We design the extraction schema together.
We configure Scrapy crawlers, EDGAR rate-limit queues, and XBRL parsing logic for sec.gov.
Schema validation, null-rate checks, taxonomy resolution, and historical data sampling before full launch.
JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.
SEC data is public, but extracting it cleanly is notoriously difficult. Here is how we manage the complexity of EDGAR.
The SEC strictly bans IP addresses that exceed 10 requests per second. Our distributed queuing system enforces global rate limits across all crawler nodes and correctly formats the required User-Agent headers to ensure zero blockages.
Companies frequently invent custom XBRL tags (extensions) when standard US-GAAP tags do not fit. Our parser maps custom extensions back to standard financial concepts, ensuring your revenue and net income columns remain populated.
Before XML became standard, EDGAR filings were submitted as massive, unstructured text files with loose SGML tags. We maintain custom regular expression pipelines to extract structured tables and metadata from these legacy documents.
Tickers change, companies merge, and CIKs remain static. We maintain a point-in-time mapping database so you can query historical filings by the ticker symbol that was active on the date of the filing.
For event-driven trading, waiting for a daily bulk dump is useless. We poll EDGAR RSS feeds at the maximum allowed frequency and push 8-K and Form 4 data via Webhook within seconds of publication.
Ingesting 10-K and 10-Q financial statements into factor models for fundamental equity strategies.
Monitoring 8-K filings for M&A announcements, executive departures, or bankruptcy declarations.
Analysing Form 4 filings to track executive sentiment and cluster insider buying signals.
Aggregating 13F data to track hedge fund positions, sector rotation, and institutional crowding.
Extracting debt covenants, off-balance-sheet arrangements, and liquidity metrics from footnotes.
Feeding raw SEC text (Management Discussion & Analysis) into NLP models for sentiment analysis.
"EDGAR is the definitive source of truth for US public markets, but extracting structured data from decades of inconsistent filing formats requires serious engineering."
Most teams underestimate the complexity of SEC data. Between strict rate limits, evolving XBRL taxonomies, and messy historical SGML files, maintaining a reliable EDGAR parser is a full-time job. DataFlirt handles the extraction, parsing, and normalisation so your quants can focus on signal generation.
Everything supported by our sec.gov scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Our Redis-backed queuing system ensures that aggregate requests across all pipeline nodes never exceed the SEC limit of 10 requests per second, preventing IP bans.
Custom Python modules built on lxml parse complex XML namespaces, resolve taxonomy references, and flatten multi-dimensional financial facts into tabular formats.
Pipelines run on Kubernetes. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.
Data delivered to where your team already works — no new tooling required.
About sec.gov scraping, legality, and pipeline operations.
Ask us directly →Yes. SEC EDGAR data is public domain information provided by the US government. However, users must strictly adhere to the SEC fair access policies, which mandate a maximum of 10 requests per second and require specific User-Agent declarations. DataFlirt pipelines are built to comply with these regulations natively.
We use a centralised Redis token bucket to throttle outbound requests across our entire infrastructure. We also declare the required company name and email address in the User-Agent header as mandated by SEC guidelines.
Yes. We extract facts embedded directly within HTML documents using the iXBRL specification, resolving contexts, units, and taxonomy references to produce clean tabular data.
For real-time pipelines, we poll the EDGAR RSS feeds at the maximum permitted frequency. Webhook payloads containing the parsed 8-K metadata and text are typically delivered within 60 seconds of publication on sec.gov.
Yes. We maintain a database mapping Central Index Keys (CIK) to current ticker symbols, company names, and SIC codes. We can also handle point-in-time ticker resolution for historical backtests.
Pre-2004 filings were submitted in text format with SGML headers rather than XML/HTML. We maintain legacy parsers to extract metadata and financial tables from these older documents.
Yes. We can strip HTML boilerplate, remove tables, and isolate specific sections (e.g., Item 1A Risk Factors or Management Discussion & Analysis) to provide clean text corpora for machine learning models.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a historical backfill of 10-K filings or a real-time feed of Form 4 insider trades — we scope, build, and operate the pipeline. Tell us what you need.