We extract player biographies, advanced metrics, game logs, and play-by-play data from Hockey-Reference. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Player Profiles objects from hockey-reference.com. All fields typed and schema-versioned.
"player_id": "mcdavco01", "name": "Connor McDavid", "position": "C", "shoots": "L", "height_inches": 73, "weight_lbs": 193, "birth_date": "1997-01-13", "draft_year": 2015
| # | player_id | name | position | shoots | height_inches | weight_lbs |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Standard Stats objects from hockey-reference.com. All fields typed and schema-versioned.
"player_id": "mcdavco01", "season": "2022-23", "team": "EDM", "games_played": 82, "goals": 64, "assists": 89, "points": 153, "plus_minus": 22
| # | player_id | season | team | league | games_played | goals |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Advanced Metrics objects from hockey-reference.com. All fields typed and schema-versioned.
"player_id": "mcdavco01", "season": "2022-23", "corsi_for_pct": 53.4, "fenwick_for_pct": 53.1, "pdo": 103.2, "offensive_zone_start_pct": 62.1, "point_shares": 16.5, "time_on_ice_per_game": "22:23"
| # | player_id | season | team | corsi_for_pct | fenwick_for_pct | pdo |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Game Logs objects from hockey-reference.com. All fields typed and schema-versioned.
"game_id": "202304130EDM", "date": "2023-04-13", "team": "EDM", "opponent": "SJS", "home_away": "Home", "goals": 1, "assists": 2, "time_on_ice": "18:45"
| # | game_id | date | team | opponent | home_away | result |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Box Scores objects from hockey-reference.com. All fields typed and schema-versioned.
"game_id": "202304130EDM", "away_team": "SJS", "home_team": "EDM", "away_goals": 2, "home_goals": 5, "attendance": 18347, "venue": "Rogers Place", "shots_on_goal": 35
| # | game_id | date | away_team | home_team | away_goals | home_goals |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Hockey-Reference houses the most comprehensive public NHL database. We extract every table, parse commented-out HTML structures, and normalise multi-season data while adhering to strict rate limits.
Extract physical attributes, draft positions, birthplaces, and career milestones for every player in NHL history.
Capture goals, assists, points, penalty minutes, and plus-minus across regular season and playoff campaigns.
Parse Corsi, Fenwick, PDO, zone starts, and point shares from heavily nested statistical tables.
Extract save percentages, goals against averages, shutouts, and quality starts for goaltenders.
Collect per-game performance records, including time on ice and shift counts, for any player across any season.
Scrape team-level game summaries, period-by-period scoring, penalty summaries, and venue attendance.
Extract historical NHL entry draft results, including pick numbers, teams, and amateur leagues.
Capture Hart, Norris, Vezina, and Calder trophy voting histories and point distributions.
Schedule pipelines to fetch the previous night's box scores and updated player statistics every morning.
Brief in. Clean data out.
Provide player IDs, team slugs, or season years. We design the extraction schema together.
We configure Scrapy crawlers, proxy rotation, and rate-limit compliance logic for hockey-reference.com.
Schema validation, null-rate checks, and table-parsing verification before full launch.
JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.
Sports Reference sites aggressively throttle automated access. Here is how we maintain pipeline stability.
Hockey-Reference enforces strict 20 requests-per-minute limits. We distribute requests across residential IP pools and enforce global concurrency locks to prevent 429 Too Many Requests errors while maintaining throughput.
Many advanced statistical tables on Hockey-Reference are hidden inside HTML comments to speed up page loads. Our parsers automatically detect, uncomment, and extract this data without requiring full browser rendering.
Stats like time on ice or blocked shots were not recorded in early NHL seasons. Our schema handles nulls gracefully and normalises historical records against modern datasets.
For ongoing seasons, we maintain a state file of completed games. Pipelines only scrape new box scores and update cumulative player stats, drastically reducing request volume.
We monitor for missing tables, unexpected schema changes, and IP bans. If a stat column shifts, our alerting stack catches it before bad data reaches your warehouse.
Platform providers ingest daily game logs and advanced metrics to power player projections and pricing algorithms.
Data scientists train machine learning models on historical Corsi and Fenwick data to predict game outcomes.
Syndicates track team performance trends, goalie metrics, and schedule fatigue to find edges in the betting market.
Sports networks populate on-screen graphics and pre-game research packets with historical franchise records.
Economists and statisticians analyse draft history and career longevity for sports economics papers.
Minor league and international teams benchmark player performance using NHL historical comparables.
"Hockey-Reference holds the definitive historical record of the NHL, but pulling multi-season advanced metrics requires navigating strict rate limits and complex table structures."
Sports Reference properties aggressively throttle automated access. Extracting multi-season game logs or play-by-play data at scale requires distributed request timing, IP rotation, and parsing massive, heavily nested HTML tables. DataFlirt handles the infrastructure so your data science team can focus on building better models.
Everything supported by our hockey-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Scrapy handles high-concurrency extraction. Custom middleware detects HTML comments containing payload data and parses them into standard DOM elements before extraction.
Redis-backed token buckets enforce global rate limits across distributed worker nodes, ensuring we never exceed Sports Reference strict request thresholds.
Pipelines run on Kubernetes. Airflow triggers nightly jobs immediately after West Coast games conclude, ensuring fresh data for morning analytics.
Data delivered to where your team already works — no new tooling required.
About hockey-reference.com scraping, legality, and pipeline operations.
Ask us directly →Scraping publicly available statistical data is generally permissible. DataFlirt extracts only public, non-authenticated sports statistics. We do not bypass authentication for Stathead premium features. Clients should review the site Terms of Service and consult legal counsel.
We use distributed request timing, Redis-backed concurrency locks, and residential IP rotation to ensure our extraction stays below the mandated 20 requests per minute per IP threshold.
Yes. Hockey-Reference loads many advanced tables as commented HTML to improve render speeds. Our parsers automatically strip the comment tags and extract the underlying table structures.
For active seasons, we configure pipelines to run daily, typically scheduled early morning to capture all box scores and updated statistics from the previous night.
Yes. We can extract data going back to the league inception. Our schema handles missing fields gracefully, as older seasons lack data like time on ice or blocked shots.
Yes. We provide a sample run covering a specific team season or a list of 100 players during the scoping phase so you can validate the schema and data quality.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need historical franchise records or daily game log updates, we build and operate the pipeline. Tell us what you need.