SYSTEM all green source sports-reference.com queue 18,392 pages p99 latency 215ms dataflirt.com · scraper/sports-reference-com
RUN - 41 active pipelines - sports-reference.com live

Sports data,
at warehouse scale.

We extract box scores, player profiles, advanced metrics, and season aggregates from the Sports-Reference network. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Box scores extracted
14,291 /day
Player profiles
89,412 /run
Season aggregates
3,450 /24h
Active pipelines
41
Uptime
99.94%
Data Dictionary

Every field we extract from sports-reference.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Player Profiles objects from sports-reference.com. All fields typed and schema-versioned.

player_idnamesportpositionheightweightbirth_datedraft_yeardebut_datefinal_game_datehall_of_famecareer_earningscollege
player_profiles
● 200 OK
"player_id": "jorda-01",
"name": "Michael Jordan",
"sport": "basketball",
"position": "Shooting Guard",
"height": "6-6",
"weight": 198,
"birth_date": "1963-02-17",
"draft_year": 1984
# player_idnamesportpositionheightweight
1
2
3

Complete list of extractable fields for Box Scores objects from sports-reference.com. All fields typed and schema-versioned.

game_iddatehome_teamaway_teamhome_scoreaway_scoreattendancevenuerefereedurationweatherbox_score_url
box_scores
● 200 OK
"game_id": "202310240DEN",
"date": "2023-10-24",
"home_team": "Denver Nuggets",
"away_team": "Los Angeles Lakers",
"home_score": 119,
"away_score": 107,
"attendance": 19842,
"venue": "Ball Arena"
# game_iddatehome_teamaway_teamhome_scoreaway_score
1
2
3

Complete list of extractable fields for Advanced Analytics objects from sports-reference.com. All fields typed and schema-versioned.

player_idseasonteamleaguegames_playedwarpertrue_shooting_pctusage_ratewin_sharesbox_plus_minusvorp
advanced_analytics
● 200 OK
"player_id": "jokicni01",
"season": "2023-24",
"team": "DEN",
"war": 13.2,
"per": 31.0,
"true_shooting_pct": 0.65,
"usage_rate": 29.3,
"win_shares": 17.0
# player_idseasonteamleaguegames_playedwar
1
2
3

Complete list of extractable fields for Team Seasons objects from sports-reference.com. All fields typed and schema-versioned.

team_idseasonleaguewinslossestiespoints_forpoints_againstcoachtop_scorerplayoff_result
team_seasons
● 200 OK
"team_id": "BOS",
"season": "2023-24",
"wins": 64,
"losses": 18,
"points_for": 120.6,
"points_against": 109.2,
"playoff_result": "Won Finals"
# team_idseasonleaguewinslossesties
1
2
3

Complete list of extractable fields for Play-by-Play objects from sports-reference.com. All fields typed and schema-versioned.

game_idquartertime_remainingteamplayeraction_typedescriptionscore_homescore_awaywin_probability
play-by-play
● 200 OK
"game_id": "202310240DEN",
"quarter": 4,
"time_remaining": "02:14",
"team": "DEN",
"action_type": "Made Shot",
"description": "Nikola Jokic makes 2-pt jump shot",
"score_home": 115,
"score_away": 103
# game_idquartertime_remainingteamplayeraction_type
1
2
3

Capabilities

Extract the definitive sports archive

Our Sports-Reference scraper handles the entire network: Baseball-Reference, FBref, Basketball-Reference, and Pro-Football-Reference. We normalise idiosyncratic table structures and bypass strict rate limits.

Cross-Network Extraction

Unified pipelines for Baseball, Basketball, Football, Hockey, College Sports, and Soccer (FBref). Normalised schemas across disparate dom structures.

Historical Box Score Parsing

Extract line scores, individual player stats, and game metadata for every recorded game in league history, adapting to missing fields from older eras.

Advanced Metric Capture

Capture calculated metrics like WAR, PER, xG, and Win Shares directly from the source tables, updated daily as algorithms adjust.

Play-by-Play Event Logs

Sequential event extraction for game flow analysis, including time remaining, action types, and real-time win probability shifts.

Draft Class Histories

Extract complete draft boards by year, including pick number, team, player, college, and subsequent career value metrics.

Franchise Record Aggregation

Track team histories, coaching changes, stadium shifts, and season-by-season performance records across all major leagues.

Commented HTML Parsing

Sports-Reference hides secondary tables inside HTML comments to optimise load times. Our parsers extract and render this hidden DOM data automatically.

Rate Limit Circumvention

The network enforces strict 20 request per minute limits. We distribute load across residential IP pools to execute high-volume historical backfills without bans.

Scheduled Daily Syncs

Configure continuous pipelines to append new box scores and update season aggregates every morning during active seasons.

// engagement pipeline

From player list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide sport categories, season ranges, or specific team IDs. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy crawlers, proxy rotation, and specific DOM parsers for the target Sports-Reference sub-site.

Validation & QA
d 4–6

Schema validation, null-rate checks, and cross-era data normalisation before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our pipeline handles the hard parts

Sports-Reference employs aggressive rate limiting and relies on legacy HTML structures. Here is how we maintain reliable extraction.

pipeline-monitor · sports-reference.com · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Rate limiting
Handling 429 Too Many Requests

Sports-Reference aggressively bans IPs that exceed 20 requests per minute. We distribute crawl tasks across thousands of residential IPs, ensuring request rates per IP remain well below threshold during massive historical backfills.

DOM structure
Extracting tables from HTML comments

To optimise page load times, Sports-Reference injects the majority of its data tables as commented-out HTML strings. Standard parsers miss this data entirely. Our middleware extracts, unescapes, and parses these comments into standard DOM trees before extraction.

Schema normalisation
Handling cross-era data gaps

A box score from 1962 lacks the fields of a box score from 2024. Our schemas gracefully handle missing data points from historical eras, enforcing null values rather than breaking the pipeline or misaligning columns.

Daily syncs
Overnight box score generation

For active seasons, we schedule pipelines to run at 04:00 EST, capturing the previous night's box scores and updated season aggregates as soon as the site publishes them, delivering fresh data before morning analysis.

Monitoring
Schema drift detection

When Sports-Reference adds new advanced metrics or alters table structures, our observability stack detects the schema drift. We alert on missing fields and update selectors before you receive malformed data.

Applications

Who uses Sports-Reference data

Teams across industries use sports-reference.com data to build competitive products and smarter operations.

01
Sports Betting Models

Quantitative syndicates ingest historical play-by-play logs and advanced metrics to backtest predictive models and identify line inefficiencies.

02
Fantasy Sports Analytics

DFS platforms and high-volume players use daily updated box scores and usage rates to project player performance and optimise lineups.

03
Academic Research

Economists and statisticians analyse decades of salary, draft, and performance data to study labour markets and predictive validity.

04
Media & Broadcasting

Sports networks use structured historical data to generate automated pre-game notes, on-screen graphics, and historical comparisons.

05
Player Valuation Models

Front offices and agencies benchmark player performance against historical cohorts to inform contract negotiations and trade evaluations.

06
Historical Archiving

Data libraries maintain permanent, queryable records of franchise histories and league evolutions independent of third-party interfaces.

Why DataFlirt

"Sports-Reference holds the definitive historical record of global athletics, but their HTML structure is notoriously dense and idiosyncratic across different sports."

Extracting data from Baseball-Reference or FBref requires parsing commented-out DOM elements, normalising schemas across decades of rule changes, and strictly managing request rates to avoid outright IP bans. DataFlirt handles the proxy rotation and parsing logic so your data scientists can focus on modelling.

Technical Spec

Sports-Reference scraper - technical capabilities

Everything supported by our sports-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Cross-sport schema normalisation
Unified output formats across Baseball, Basketball, Football, and Soccer
Supported
Commented-out HTML parsing
Extracts data tables hidden inside HTML comments for load optimisation
Supported
Play-by-play log extraction
Sequential event data with timestamps and score shifts
Supported
Daily active season updates
Scheduled runs to capture overnight box scores and updated aggregates
Supported
Historical franchise archives
Full backfills of team histories dating back to league inception
Supported
Proxy rotation for rate limits
Distributed crawling to bypass 20 req/min IP bans
Supported
Stathead custom queries
Custom database queries require a paid Stathead subscription and auth
Partial
Direct API endpoint access
Sports-Reference offers a paid API; we scrape the public HTML frontend
Partial
Webhook delivery
HTTP POST per record or batch for downstream pipeline ingestion
Supported
Infrastructure

Infrastructure powering the pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
Scrapy + Playwright Stack

Scrapy handles crawl orchestration, deduplication, and retry logic. Custom middleware unescapes hidden HTML comments before passing the DOM to the parsing layer.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies to distribute request load. Rotation happens per-request to ensure no single IP trips the strict rate limiting thresholds.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling for overnight daily syncs, and SLA alerting ensures data is ready before morning analysis.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested - schema versioned per run
CSV
Flat file with typed columns - Excel/Sheets compatible
XLS
Legacy spreadsheet format for direct analyst consumption
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery - compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
RESTful endpoints to query your extracted datasets
PostgreSQL
Upsert into your existing schema with conflict resolution
Snowflake
Stage + COPY INTO workflow - incremental or full-replace
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About sports-reference.com scraping, legality, and pipeline operations.

Ask us directly →
Is scraping Sports-Reference legal?

Scraping publicly available, factual historical data (like box scores and player stats) is generally permissible. DataFlirt targets only public, non-authenticated pages. We do not extract proprietary Stathead subscription data or circumvent authentication walls.

How do you handle their strict rate limits?

Sports-Reference enforces a strict 20 request per minute limit per IP. We distribute crawl tasks across a large pool of residential proxies, ensuring individual IPs remain well below the threshold during high-volume backfills.

Why are some tables missing in standard scrapers?

To optimise page load performance, Sports-Reference injects many secondary data tables as commented-out HTML strings. Standard HTTP clients ignore comments. Our custom middleware extracts and parses these comments into standard DOM elements.

Which sports do you support?

We support the entire network: Baseball-Reference, Basketball-Reference, Pro-Football-Reference, Hockey-Reference, FBref (Soccer), and College Sports sub-sites. All use unified data normalisation pipelines.

Can you normalise data across different eras?

Yes. A box score from 1950 lacks the advanced metrics of a game from 2024. Our schemas gracefully handle missing fields from historical eras, enforcing null values rather than breaking column alignment.

How fast are daily updates delivered?

For active seasons, pipelines run overnight following the conclusion of games. Deliveries typically complete by 06:00 EST, providing fresh box scores and season aggregates before morning analysis begins.

Do you extract Stathead data?

No. Stathead is a paid subscription service that provides custom database query tools behind an authentication wall. We extract data exclusively from the public frontend pages of the Sports-Reference network.

$ dataflirt scope --new-project --source=sports-reference.com ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full historical backfill of every MLB box score or a daily sync of NBA advanced metrics - we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →