SYSTEM all green source khanacademy.org queue 12,845 nodes p99 latency 184ms dataflirt.com · scraper/khanacademy-org
RUN · 42 active pipelines · khanacademy.org live

Educational data,
structured for scale.

We extract course taxonomies, video metadata, transcripts, and exercise structures from Khan Academy. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Courses mapped
14,291 /run
Video transcripts
84.2K /month
Exercise items
312K /run
Active pipelines
42
Uptime
99.98%
Data Dictionary

Every field we extract from khanacademy.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Courses & Units objects from khanacademy.org. All fields typed and schema-versioned.

course_idtitlesubjectdomaindescriptionunit_countlesson_counttotal_duration_minsstandards_alignmenturl
courses_& units
● 200 OK
"course_id": "cc-eighth-grade-math",
"title": "8th grade (Eureka Math/EngageNY)",
"subject": "Math",
"domain": "Math",
"unit_count": 7,
"lesson_count": 124,
"standards_alignment": "['CCSS.MATH.CONTENT.8.EE.A.1', 'CCSS.MATH.CONTENT.8.EE.A.2']",
"url": "https://www.khanacademy.org/math/cc-eighth-grade-math"
# course_idtitlesubjectdomaindescriptionunit_count
1
2
3

Complete list of extractable fields for Video Lessons objects from khanacademy.org. All fields typed and schema-versioned.

video_idyoutube_idtitledescriptionduration_secondsauthortranscript_availableunit_idcourse_idpublish_date
video_lessons
● 200 OK
"video_id": "v-intro-to-derivatives",
"youtube_id": "rAof9Ld5sOg",
"title": "Introduction to derivatives",
"duration_seconds": 542,
"author": "Sal Khan",
"transcript_available": true,
"unit_id": "derivatives-1",
"course_id": "ap-calculus-ab"
# video_idyoutube_idtitledescriptionduration_secondsauthor
1
2
3

Complete list of extractable fields for Transcripts objects from khanacademy.org. All fields typed and schema-versioned.

video_idlanguagesegment_start_secsegment_end_sectextspeakerauto_generatedword_count
transcripts
● 200 OK
"video_id": "v-intro-to-derivatives",
"language": "en",
"segment_start_sec": 14.5,
"segment_end_sec": 18.2,
"text": "Let us look at the slope of the tangent line.",
"auto_generated": false,
"word_count": 10
# video_idlanguagesegment_start_secsegment_end_sectextspeaker
1
2
3

Complete list of extractable fields for Exercises objects from khanacademy.org. All fields typed and schema-versioned.

exercise_idtitletopicquestion_countdifficulty_leveltagsstandards_mappedrelated_video_idurl
exercises
● 200 OK
"exercise_id": "e-power-rule",
"title": "Power rule practice",
"topic": "Derivatives",
"question_count": 4,
"difficulty_level": "intermediate",
"tags": "['calculus', 'power-rule', 'differentiation']",
"standards_mapped": "['AP.CALC.2.1']",
"url": "https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-1-new/ab-2-5/e/power-rule"
# exercise_idtitletopicquestion_countdifficulty_leveltags
1
2
3

Complete list of extractable fields for Articles objects from khanacademy.org. All fields typed and schema-versioned.

article_idtitleauthorcontent_htmlword_countreading_time_minsunit_idcourse_idlast_updated
articles
● 200 OK
"article_id": "a-limits-intro",
"title": "Limits introduction",
"author": "Khan Academy Mathematics",
"word_count": 845,
"reading_time_mins": 5,
"unit_id": "limits-1",
"course_id": "ap-calculus-ab",
"last_updated": "2024-01-15T10:00:00Z"
# article_idtitleauthorcontent_htmlword_countreading_time_mins
1
2
3

Capabilities

Extract the entire educational graph

Khan Academy structures content in a deep hierarchy. We traverse domains, subjects, courses, units, and lessons to extract raw video metadata, transcripts, and exercise structures.

Deep Taxonomy Mapping

Extract the exact hierarchical relationship from top-level domains down to individual lesson nodes. Maintain relational integrity across the entire curriculum.

Video & Transcript Extraction

Capture YouTube IDs, duration, titles, and full closed-caption transcripts with timestamped segments for NLP processing.

Article Content Parsing

Extract instructional text, inline equations (LaTeX/KaTeX), and embedded image URLs from article nodes.

Standards Alignment

Map courses and exercises to Common Core State Standards (CCSS), Next Generation Science Standards (NGSS), and AP curriculum tags.

Exercise Metadata

Extract practice problem sets, question counts, difficulty markers, and related prerequisite lesson mappings.

Multi-Language Support

Extract localized curriculum data from es.khanacademy.org, pt.khanacademy.org, and other supported regional subdomains.

GraphQL Interception

Bypass HTML parsing by intercepting Relay GraphQL queries directly, ensuring clean, structured JSON extraction with zero DOM reliance.

Incremental Updates

Track changes to course structures, new lesson additions, and updated transcripts with hash-based diffing. Only ingest what changes.

Instructor & Contributor Data

Extract author attribution, voiceover credits, and content contributor metadata attached to specific video lessons and articles.

// engagement pipeline

From curriculum target to structured dataset

Brief in. Clean data out.

Define Scope
d 0

Specify target domains, subjects, or specific course URLs. We map the required data points and delivery schema.

Pipeline Build
d 2–4

We configure GraphQL interceptors, Playwright renderers, and proxy rotation to traverse the course hierarchy.

Validation & QA
d 4–6

Schema validation, null-rate checks on transcripts, and relational integrity testing across unit-lesson mappings.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

Navigating Khan Academy infrastructure

Modern educational platforms use complex single-page architectures. Here is how we extract clean data from React and GraphQL backends.

pipeline-monitor · khanacademy.org · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
GraphQL parsing
Direct API interception over DOM scraping

Khan Academy uses React and Relay. Scraping the DOM is brittle. Our pipelines intercept the underlying GraphQL queries, extracting clean JSON payloads before they hit the rendering engine.

Hierarchy traversal
Maintaining relational integrity

Educational content is useless without context. We maintain foreign keys linking every transcript, exercise, and article back to its parent lesson, unit, course, and subject domain.

YouTube integration
Extracting embedded video data

Video content relies on embedded YouTube iframes. We extract the underlying video IDs, query YouTube metadata endpoints, and pull full transcript files directly into your dataset.

Math rendering
Preserving KaTeX and LaTeX equations

Math and science articles use complex rendering libraries. We extract the raw KaTeX/LaTeX strings rather than the rendered SVG elements, ensuring the data remains machine-readable.

Change detection
Curriculum version control

Courses update frequently. We hash node content and emit diffs, allowing you to track curriculum evolution without re-ingesting the entire catalogue.

Applications

Who uses Khan Academy data

Teams across industries use khanacademy.org data to build competitive products and smarter operations.

01
AI Tutor Training Data

Machine learning teams ingest structured transcripts and articles to fine-tune LLMs on high-quality, pedagogically sound educational content.

02
Curriculum Mapping

EdTech platforms map their own content against Khan Academy taxonomy to identify curriculum gaps and align with Common Core standards.

03
Multi-language Resource Aggregation

Global education initiatives extract localized content to build supplementary learning portals for non-English speaking regions.

04
Academic Research

Researchers analyze pedagogical structures, video durations, and exercise sequencing to study optimal digital learning pathways.

05
Alternative Credentialing

Skills platforms map open-source course content to specific job competencies to build custom learning tracks.

06
EdTech Content Gap Analysis

Publishers analyze the density of practice questions per topic to identify underserved subject areas for new product development.

Why DataFlirt

"Khan Academy houses the internet's most comprehensive structured curriculum, but navigating its GraphQL endpoints requires a purpose-built extraction layer."

Extracting educational taxonomies at scale means handling deeply nested node structures, React-driven dynamic rendering, and embedded video states. DataFlirt manages the entire pipeline, mapping domains to units to lessons into relational tables so your engineering team can focus on product development rather than DOM parsing.

Technical Spec

Khan Academy scraper — technical capabilities

Everything supported by our khanacademy.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

GraphQL interception
Direct extraction of Relay GraphQL payloads for clean JSON data
Supported
YouTube transcript extraction
Full closed-caption text with timestamp arrays
Supported
Taxonomy tree mapping
Preserves relational links from Domain to Lesson node
Supported
Multi-language support
Target specific localized subdomains (es., pt., fr.)
Supported
Common Core extraction
Capture standards tags attached to courses and exercises
Supported
Article HTML extraction
Raw HTML and KaTeX equation strings
Supported
Change detection (diffs)
Only emit records for new or updated curriculum nodes
Supported
User progress data
Individual student mastery points and completion states
Partial
Teacher dashboard analytics
Classroom performance metrics and assignment tracking
Partial
Infrastructure

Infrastructure powering the pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
Scrapy + Playwright Stack

Scrapy handles crawl orchestration and deduplication. Playwright handles JavaScript rendering, GraphQL interception, and complex interaction flows.

GraphQL Interception Layer

We bypass brittle HTML parsing by monitoring network traffic and capturing raw JSON responses from internal API endpoints directly.

Cloud-Native Orchestration

Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested — schema versioned per run
CSV
Flat file with typed columns — Excel/Sheets compatible
XLS
Excel format for non-technical analyst teams
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery — compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
Queryable REST endpoints for on-demand data retrieval
PostgreSQL
Upsert into your existing schema with conflict resolution
BigQuery
Streamed directly into your dataset with schema auto-detect
Snowflake
Stage + COPY INTO workflow — incremental or full-replace
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About khanacademy.org scraping, legality, and pipeline operations.

Ask us directly →
Is scraping Khan Academy legal?

Scraping publicly available educational content is generally permissible under fair use for research, analysis, and non-commercial aggregation. DataFlirt targets only public, non-authenticated course, video, and exercise metadata. We do not extract personal student data or circumvent authentication walls. Clients must ensure their downstream use complies with Khan Academy terms of service and copyright law.

How do you handle the React single-page application?

We use Playwright to execute JavaScript and intercept the underlying Relay GraphQL network requests. This allows us to extract clean, structured JSON data directly from the API layer rather than relying on brittle CSS selectors.

Can you extract video transcripts?

Yes. We extract the embedded YouTube video IDs and pull the full closed-caption transcripts, including timestamped segments and speaker attribution where available.

Do you capture Common Core standards?

Yes. When courses, units, or exercises are tagged with Common Core State Standards (CCSS) or Next Generation Science Standards (NGSS), we extract and map those tags in the final dataset.

Can I get non-English content?

Yes. We can target specific localized subdomains (e.g., es.khanacademy.org) to extract curriculum data in Spanish, Portuguese, French, and other supported languages.

How fresh is the data?

Curriculum structures change relatively slowly. We typically configure these pipelines for weekly or monthly full-catalogue refreshes, with hash-based diffing to highlight new or modified courses.

$ dataflirt scope --new-project --source=khanacademy.org ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full dump of the math curriculum or continuous updates on new science units, we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →