SYSTEM all green source soccerstats.com queue 14,293 pages p99 latency 214ms dataflirt.com · scraper/soccerstats-com
RUN * 41 active pipelines * soccerstats.com live

Football statistics,
at warehouse scale.

We extract match results, league tables, form guides, goal timing stats, and H2H records from Soccerstats. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Matches extracted
18.2K /week
League tables updated
412 /day
Historical seasons
24
Active pipelines
41
Uptime
99.94%
Data Dictionary

Every field we extract from soccerstats.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Match Results objects from soccerstats.com. All fields typed and schema-versioned.

match_iddateleaguehome_teamaway_teamfull_time_home_goalsfull_time_away_goalshalf_time_home_goalshalf_time_away_goalsstadiumattendancereferee
match_results
● 200 OK
"match_id": "eng_pr_2023_114",
"date": "2023-10-21",
"home_team": "Arsenal",
"away_team": "Chelsea",
"full_time_home_goals": 2,
"full_time_away_goals": 2,
"half_time_home_goals": 0,
"half_time_away_goals": 1
# match_iddateleaguehome_teamaway_teamfull_time_home_goals
1
2
3

Complete list of extractable fields for League Tables objects from soccerstats.com. All fields typed and schema-versioned.

league_idseasonrankteam_namematches_playedwinsdrawslossesgoals_forgoals_againstgoal_differencepointsform_last_6
league_tables
● 200 OK
"rank": 1,
"team_name": "Manchester City",
"matches_played": 38,
"wins": 28,
"draws": 7,
"losses": 3,
"goal_difference": 62,
"points": 91,
"form_last_6": "WWWWWW"
# league_idseasonrankteam_namematches_playedwins
1
2
3

Complete list of extractable fields for Goal Timing objects from soccerstats.com. All fields typed and schema-versioned.

team_nameleaguetotal_goals_scoredgoals_0_15goals_16_30goals_31_45goals_46_60goals_61_75goals_76_90late_goals_percentage
goal_timing
● 200 OK
"team_name": "Liverpool",
"total_goals_scored": 84,
"goals_0_15": 12,
"goals_16_30": 14,
"goals_76_90": 22,
"late_goals_percentage": 26.2
# team_nameleaguetotal_goals_scoredgoals_0_15goals_16_30goals_31_45
1
2
3

Complete list of extractable fields for Head-to-Head objects from soccerstats.com. All fields typed and schema-versioned.

team_ateam_btotal_matchesteam_a_winsdrawsteam_b_winsteam_a_goalsteam_b_goalslast_meeting_datelast_meeting_result
head-to-head
● 200 OK
"team_a": "Real Madrid",
"team_b": "Barcelona",
"total_matches": 254,
"team_a_wins": 103,
"draws": 52,
"team_b_wins": 99,
"last_meeting_date": "2023-10-28",
"last_meeting_result": "1-2"
# team_ateam_btotal_matchesteam_a_winsdrawsteam_b_wins
1
2
3

Complete list of extractable fields for Over/Under Stats objects from soccerstats.com. All fields typed and schema-versioned.

team_namematches_playedover_0_5_pctover_1_5_pctover_2_5_pctover_3_5_pctbtts_pctclean_sheet_pctfailed_to_score_pct
over/under_stats
● 200 OK
"team_name": "Bayern Munich",
"matches_played": 34,
"over_1_5_pct": 94.1,
"over_2_5_pct": 82.4,
"over_3_5_pct": 58.8,
"btts_pct": 61.8,
"clean_sheet_pct": 32.4
# team_namematches_playedover_0_5_pctover_1_5_pctover_2_5_pctover_3_5_pct
1
2
3

Capabilities

Deep football statistics parsed into clean schemas

Soccerstats contains a wealth of data trapped in legacy HTML table structures. We handle the complex DOM traversal, team name normalisation, and historical archiving.

League & Form Tables

Extract overall, home, and away league tables, alongside rolling 6-match and 8-match form guides for every team.

Match Results Archive

Capture full-time and half-time scores, match dates, and venue details across thousands of historical fixtures.

Goal Timing Analysis

Extract 15-minute interval breakdowns for goals scored and conceded, enabling deep in-play probability modelling.

Over/Under & BTTS Metrics

Pull percentage frequencies for Over 1.5, 2.5, and 3.5 goals, plus Both Teams To Score (BTTS) statistics.

Head-to-Head Records

Scrape historical matchups between specific teams, including aggregate goals, win distributions, and recent meeting results.

Home vs Away Splits

Isolate team performance metrics based on venue, capturing the statistical impact of home advantage.

Referee Statistics

Extract cards per game, fouls awarded, and penalty frequencies broken down by individual match officials.

Legacy HTML Parsing

Our parsers navigate deeply nested, classless table structures to extract reliable data without schema breakage.

Scheduled Updates

Configure daily or weekly pipelines to capture weekend fixture results and updated league standings automatically.

// engagement pipeline

From target leagues to warehouse records

Brief in. Clean data out.

Define Scope
d 0

Provide the target leagues, seasons, and statistical categories. We design the relational extraction schema.

Pipeline Build
d 2–4

We configure Scrapy crawlers with custom lxml parsers to navigate the nested table structures of soccerstats.com.

Validation & QA
d 4–6

Schema validation, team name normalisation checks, and data type enforcement before full pipeline launch.

Delivery
ongoing

JSON, CSV, or Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

Navigating legacy web structures at scale

Soccerstats is a data goldmine built on older web technologies. Here is how we extract clean data from complex DOMs.

pipeline-monitor · soccerstats.com · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
DOM Parsing
Navigating nested table hell

Soccerstats relies heavily on nested HTML tables without semantic class names or IDs. We use custom XPath and lxml parsers that rely on structural hierarchy rather than fragile CSS selectors, ensuring stable extraction.

Data Normalisation
Consistent team and league naming

Team names often vary between pages (e.g., 'Man Utd' vs 'Manchester United'). Our pipeline includes a normalisation layer that maps all variations to a canonical UUID, ensuring clean joins in your database.

URL Routing
Handling legacy query parameters

Navigation relies on complex URL query parameters rather than RESTful paths. We map the entire parameter space for target leagues and seasons, ensuring complete coverage of historical archives without missing fixtures.

Rate Limiting
Respectful concurrency management

To prevent IP bans and server strain, we manage request concurrency and implement exponential backoff. We route requests through distributed IP pools to maintain throughput while respecting target infrastructure.

Change Detection
Incremental weekend updates

Instead of re-scraping entire historical seasons every week, we compute hashes of current season pages and only extract new match results and updated table rows, reducing pipeline runtime and downstream load.

Applications

Who uses Soccerstats data

Teams across industries use soccerstats.com data to build competitive products and smarter operations.

01
Predictive Modelling

Quantitative syndicates use historical match results and goal timing data to train Poisson distribution models for match outcomes.

02
Odds Compilation

Sportsbooks ingest Over/Under and BTTS frequencies to validate their opening lines and identify pricing anomalies.

03
Fantasy Football Analytics

Platform providers use form guides and fixture difficulty metrics to power player recommendation engines.

04
Sports Media & Journalism

Publishers populate pre-match preview articles with automated H2H statistics and team form summaries.

05
Team Performance Analysis

Club analysts benchmark their team's late-goal concession rates against league averages to identify tactical weaknesses.

06
Algorithmic Trading

In-play traders use 15-minute goal interval statistics to model liquidity entry points on exchange platforms.

Why DataFlirt

"Soccerstats holds decades of structured football history, but its legacy HTML table structure makes automated extraction a nightmare for unspecialised crawlers."

Extracting data from Soccerstats requires parsing deeply nested legacy HTML tables, handling inconsistent team naming conventions across seasons, and managing rate limits. DataFlirt normalises this chaos into clean, relational datasets so your quants can focus on modelling rather than DOM traversal.

Technical Spec

Soccerstats scraper technical capabilities

Everything supported by our soccerstats.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Legacy HTML table parsing
Custom XPath extraction for deeply nested, classless tabular data
Supported
Historical season extraction
Access to archived league tables and match results spanning decades
Supported
Team name normalisation
Mapping inconsistent team strings to canonical identifiers
Supported
Change detection (diffs)
Only push new match results and updated league standings
Supported
Webhook delivery
HTTP POST upon completion of weekend fixture updates
Supported
Proxy rotation
Datacenter and residential pools to manage rate limits
Supported
Live in-play match events
Real-time clock, live score updates, and in-game event feeds
Partial
Player-level tracking data
Expected goals (xG), heatmaps, and individual player passing stats
Partial
Infrastructure

Infrastructure powering the pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheuslxmlBeautifulSoup4
Scrapy + lxml Stack

Scrapy handles crawl orchestration and request scheduling, while lxml processes complex XPath queries against legacy HTML structures with high performance.

Proxy Infrastructure

We maintain pools of datacenter and residential IPs to distribute request load, preventing rate limits while extracting large historical archives.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling for weekend fixture updates. All state and normalised team mappings are stored in PostgreSQL.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested arrays for hierarchical stats
CSV
Flat files perfect for importing into statistical software
XLS
Excel compatible format for analyst review
Parquet
Columnar format for BigQuery, Snowflake, and Athena
AWS S3
Direct bucket delivery compatible with any data lake
Webhook
HTTP POST per batch for downstream processing triggers
API
REST endpoints to query historical match data on demand
PostgreSQL
Direct relational upserts into your existing database schema
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About soccerstats.com scraping, legality, and pipeline operations.

Ask us directly →
Is scraping Soccerstats legal?

Scraping publicly available statistical data is generally permissible. DataFlirt extracts factual, non-copyrightable sports statistics. We do not extract personal data or bypass authentication. Clients should review target site Terms of Service and consult legal counsel for specific commercial use cases.

How do you handle the complex table structures?

We use custom lxml parsers and structural XPath queries rather than relying on CSS classes. Our engineering team maps the nested table hierarchy for each specific page type (league table, form guide, H2H) to ensure robust extraction.

Which leagues do you support?

We can extract data for any league available on the platform, including major European leagues (Premier League, La Liga, Serie A, Bundesliga, Ligue 1), lower divisions, and international tournaments.

How fresh is the data?

Soccerstats is typically updated shortly after matches conclude. We schedule our pipelines to run at defined intervals (e.g., daily or post-weekend) to capture the latest results and updated tables.

Do you provide live in-play data?

No. Soccerstats is best suited for pre-match analysis, historical research, and post-match statistics. We do not offer sub-second live match event scraping from this source.

Can you extract data from previous seasons?

Yes. We can crawl the historical archives to extract league tables, match results, and team statistics spanning multiple decades, depending on league availability on the site.

How do you handle different team names across seasons?

Our pipeline includes a normalisation layer. We maintain a mapping database that standardises team name variations into a single canonical identifier, ensuring your historical joins work correctly.

What is the minimum viable engagement?

Engagements typically start with a defined set of leagues and seasons for historical extraction, followed by a recurring weekly pipeline for current season updates. Contact us with your league list for a scoped quote.

$ dataflirt scope --new-project --source=soccerstats.com ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full historical archive of 20 leagues or a weekly update of form guides and goal stats, we build and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →