We extract player statistics, match logs, expected goals (xG), and advanced scouting reports from Fbref. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Player Stats objects from fbref.com. All fields typed and schema-versioned.
"player_id": "a1b2c3d4", "name": "Lionel Messi", "nationality": "ar ARG", "position": "FW", "age": 36, "goals": 20, "assists": 10, "xg": 18.5
| # | player_id | name | nationality | position | age | matches_played |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Match Logs objects from fbref.com. All fields typed and schema-versioned.
"match_id": "e5f6g7h8", "date": "2023-10-28", "competition": "La Liga", "home_team": "Barcelona", "away_team": "Real Madrid", "result": "1-2", "possession": 53, "xg_home": 1.2
| # | match_id | date | competition | home_team | away_team | result |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Scouting Reports objects from fbref.com. All fields typed and schema-versioned.
"player_id": "i9j0k1l2", "template": "Midfielders", "minutes_played": 2450, "goals_percentile": 85, "xg_percentile": 82, "progressive_passes": 95, "tackles": 40, "interceptions": 60
| # | player_id | template | minutes_played | goals_percentile | xg_percentile | shot_creating_actions |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Team Stats objects from fbref.com. All fields typed and schema-versioned.
"team_id": "m3n4o5p6", "season": "2023-2024", "competition": "Premier League", "rank": 1, "matches_played": 38, "points": 91, "goals_for": 96, "xg_for": 88.5
| # | team_id | season | competition | rank | matches_played | wins |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Goalkeeping objects from fbref.com. All fields typed and schema-versioned.
"player_id": "q7r8s9t0", "matches_played": 38, "shots_on_target_against": 120, "saves": 90, "save_percentage": 75.0, "clean_sheets": 15, "psxg": 35.2, "psxg_net": 5.2
| # | player_id | matches_played | shots_on_target_against | saves | save_percentage | clean_sheets |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Our Fbref scraper handles complex multi-level tables, strict rate limits, and deep historical pagination to deliver clean, queryable football data.
Extract core metrics including goals, assists, playing time, and card accumulations across all domestic and international competitions.
Capture expected goals (xG), expected assisted goals (xAG), shot creation actions, and detailed shooting efficiency metrics.
Extract pass completion rates, progressive passes, key passes, and possession statistics parsed from complex nested tables.
Track tackles, interceptions, blocks, clearances, and aerial duals won for comprehensive defensive profiling.
Extract post-shot expected goals (PSxG), save percentages, cross stopping, and sweeping actions for goalkeeper analysis.
Scrape detailed match-by-match logs for players and teams, including event timelines and formation data.
Extract data from past seasons across major leagues, maintaining consistent schemas despite historical formatting variations.
Data extraction spanning top European leagues, international tournaments, and lower divisions available on the platform.
Configure pipelines to run post-matchweek to capture updated statistics and standings automatically.
Brief in. Clean data out.
Specify leagues, seasons, and specific data tables (e.g., standard stats, passing, scouting reports) required.
We configure Scrapy crawlers, table parsing logic, and rate-limit management systems for Sports Reference infrastructure.
Schema validation, null-rate checks, and cross-referencing totals to ensure accurate table extraction.
JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.
Sports Reference sites present unique structural and infrastructural challenges. Here is how we build resilient extraction.
Fbref uses complex HTML tables with multiple header rows (e.g., categorising 'Passes' into 'Total', 'Short', 'Medium', 'Long'). Our parsers flatten these hierarchies into clean, single-level column names suitable for relational databases.
Sports Reference implements aggressive rate limiting. We utilise distributed proxy pools and precise request throttling to maintain extraction volume without triggering IP bans or HTTP 429 responses.
We extract and preserve Fbref unique identifiers for players, teams, and matches, allowing you to build relational models and map entities across different datasets.
Historical seasons often lack advanced metrics like xG. Our pipelines handle missing columns gracefully, ensuring historical data fits into modern schemas without breaking downstream processes.
For active seasons, we identify updated match logs and recalculate season totals, pushing only the necessary updates to your warehouse.
Professional clubs use Fbref scouting reports and percentile rankings to identify undervalued talent across global leagues.
Data scientists build predictive models for FPL and other fantasy games using underlying xG and xAG metrics rather than raw outputs.
Syndicates ingest historical match logs and team performance data to train predictive models and find edge in betting markets.
Sports journalists and broadcasters use advanced metrics to enrich match commentary and analytical articles.
Researchers analyse long-term trends in tactical evolution, player longevity, and league competitiveness using historical datasets.
Coaching staff evaluate team performance against expected metrics to identify tactical inefficiencies and areas for improvement.
"Fbref provides the most comprehensive publicly available football dataset, but extracting it from multi-level HTML tables requires specialised parsing architecture."
Parsing Sports Reference tables is notoriously difficult due to complex headers, embedded JavaScript variables, and strict rate limits. DataFlirt handles the extraction complexity, delivering flattened, typed, and warehouse-ready data so your analysts can focus on building models rather than writing parsing scripts.
Everything supported by our fbref.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Custom Python modules designed specifically to parse Sports Reference DOM structures, handling multi-row headers and dynamic column generation.
Intelligent request scheduling via Redis and Airflow to respect target site limits while maintaining extraction throughput across distributed proxy pools.
Pipelines run on AWS Lambda and Kubernetes. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.
Data delivered to where your team already works — no new tooling required.
About fbref.com scraping, legality, and pipeline operations.
Ask us directly →Scraping publicly available factual data (such as sports statistics) is generally permissible. DataFlirt targets only public, non-authenticated statistical data. We do not extract personal data or circumvent authentication walls like Stathead. Clients should review Terms of Service and consult legal counsel for specific use cases.
Sports Reference sites enforce strict request limits. We manage this through distributed residential proxies, precise request delays, and concurrency controls to ensure reliable data extraction without triggering blocks.
Pipelines are typically scheduled weekly or daily following matchdays to capture updated statistics. Real-time extraction during matches is not supported as Fbref updates data post-match.
Yes. We extract all advanced metrics provided on the platform, including xG, xAG, PSxG, and shot-creating actions, preserving the granularity of the original tables.
We can extract data as far back as Fbref provides it. Note that advanced metrics like xG are only available for recent seasons; our schema handles these historical variations gracefully.
No. Stathead requires a paid subscription and authenticated access. We only extract publicly available data from the main Fbref domain.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need historical season data or continuous updates for predictive modelling — we scope, build, and operate the pipeline. Tell us what you need.