SYSTEM all green source edx.org queue 18,492 courses p99 latency 218ms dataflirt.com · scraper/edx-org
RUN · 37 active pipelines · edx.org live

edX course data,
at warehouse scale.

We extract course listings, institution catalogues, MicroMasters details, pricing, and syllabus structures from edX. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Courses extracted
4,281 /day
Institution catalogues
164 /run
Programme updates
892 /24h
Active pipelines
37
Uptime
99.98%
Data Dictionary

Every field we extract from edx.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Course Listings objects from edx.org. All fields typed and schema-versioned.

course_idtitleinstitutionsubjectlevellanguageduration_weekseffort_hoursprice_auditprice_verifiedenrollment_countstart_dateshort_descriptionfull_descriptionurl
course_listings
● 200 OK
"course_id": "course-v1:MITx+6.00.1x+2T2024",
"title": "Introduction to Computer Science and Programming Using Python",
"institution": "MITx",
"subject": "Computer Science",
"level": "Introductory",
"language": "English",
"price_verified": 149.0,
"enrollment_count": 1245000
# course_idtitleinstitutionsubjectlevellanguage
1
2
3

Complete list of extractable fields for Programmes & MicroMasters objects from edx.org. All fields typed and schema-versioned.

programme_idtitletypeinstitutioncourse_counttotal_durationtotal_priceskills_gainedjob_outlookinstructorsaverage_salary_projectionurl
programmes_& micromasters
● 200 OK
"programme_id": "micromasters-mitx-supply-chain-management",
"title": "Supply Chain Management",
"type": "MicroMasters",
"institution": "MITx",
"course_count": 6,
"total_price": 1350.0,
"skills_gained": "['Supply Chain Design', 'Inventory Management']"
# programme_idtitletypeinstitutioncourse_counttotal_duration
1
2
3

Complete list of extractable fields for Institutions objects from edx.org. All fields typed and schema-versioned.

institution_idnamelocationdescriptioncourse_countprogramme_countwebsite_urllogo_urlactive_instructorsfounded_year
institutions
● 200 OK
"institution_id": "mitx",
"name": "Massachusetts Institute of Technology",
"location": "Cambridge, MA",
"course_count": 142,
"programme_count": 18,
"website_url": "https://web.mit.edu"
# institution_idnamelocationdescriptioncourse_countprogramme_count
1
2
3

Complete list of extractable fields for Instructors objects from edx.org. All fields typed and schema-versioned.

instructor_idnametitleinstitutionbioimage_urlcourse_countsocial_linksacademic_backgroundpublications
instructors
● 200 OK
"instructor_id": "john-guttag",
"name": "John Guttag",
"title": "Dugald C. Jackson Professor of Computer Science and Electrical Engineering",
"institution": "MIT",
"course_count": 4,
"bio": "John Guttag is a professor at MIT..."
# instructor_idnametitleinstitutionbioimage_url
1
2
3

Complete list of extractable fields for Syllabus & Modules objects from edx.org. All fields typed and schema-versioned.

course_idmodule_numbermodule_titleexpected_durationcontent_typelearning_outcomesassessment_typeprerequisite_modulesvideo_countreading_count
syllabus_& modules
● 200 OK
"course_id": "course-v1:MITx+6.00.1x+2T2024",
"module_number": 1,
"module_title": "Python Basics",
"expected_duration": "4 hours",
"content_type": "['Video', 'Reading', 'Exercise']",
"learning_outcomes": "['Understand variables', 'Write basic loops']"
# course_idmodule_numbermodule_titleexpected_durationcontent_typelearning_outcomes
1
2
3

Capabilities

Complete edX catalogue extraction

Our edX scraper handles the complex nested structures of online learning platforms: course hierarchies, programme bundles, dynamic pricing, and university metadata.

Course Metadata

Extract titles, descriptions, enrollment counts, duration, effort estimates, and prerequisites for every course.

Programme Bundling

Map individual courses to parent structures like MicroMasters, Professional Certificates, and XSeries.

Pricing & Tracks

Capture the cost of the verified certificate track versus the free audit tier, including regional pricing variations.

Syllabus Extraction

Pull detailed module lists, weekly learning objectives, and assessment structures directly from the course outline.

Instructor Profiles

Scrape instructor names, academic titles, biographies, and institutional affiliations across all courses.

Institution Catalogues

Aggregate data at the university level to track total course output and programme offerings per institution.

Enrollment Metrics

Track student enrollment numbers over time to identify trending subjects and high-demand skills.

Skill Tagging

Extract the specific competencies and skills listed for each course to build job-market correlation models.

Multi-Language Support

Capture course availability and transcript languages across global offerings.

Bootcamps & Exec Ed

Include premium offerings like university bootcamps and executive education programmes in the dataset.

// engagement pipeline

From course URL to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide target institutions, subjects, or programme types. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy and Playwright crawlers to navigate edX's single-page architecture and pagination.

Validation & QA
d 4–6

Schema validation, null-rate checks, and nested-data resolution checks before full launch.

Delivery
ongoing

JSON, CSV, or Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

Bypassing edX extraction hurdles

edX relies on modern JavaScript frameworks and aggressive rate limiting. Here is how our infrastructure maintains constant throughput.

pipeline-monitor · edx.org · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
SPA handling
Executing JavaScript for dynamic content

edX uses modern frontend frameworks that load course details dynamically. We use Playwright to execute JavaScript, wait for API calls to resolve, and extract the fully rendered DOM.

Rate limiting
Distributed request timing

Aggressive scraping triggers CDN blocks. We distribute requests across residential proxies and normalise request headers to mimic standard browser behaviour.

Nested data resolution
Mapping courses to programmes

Courses belong to multiple programmes. Our pipeline maintains relational integrity, outputting normalised tables that link courses, programmes, and institutions perfectly.

Geolocation pricing
Region-specific cost extraction

edX alters pricing based on the user's IP address. We route requests through specific regional proxies to capture accurate local pricing data.

Schema drift
Resilient DOM selectors

Platform updates frequently break naive scrapers. We use fallback selector chains and monitor payload structures to ensure uninterrupted data delivery.

Applications

Who uses edX data

Teams across industries use edx.org data to build competitive products and smarter operations.

01
EdTech Market Intelligence

Platform operators analyse edX course structures, pricing, and enrollment volumes to identify gaps in their own offerings.

02
Competitor Benchmarking

Universities track peer institution catalogues to benchmark their online presence and programme portfolio.

03
Corporate L&D Planning

Enterprise training teams aggregate course data to build internal learning paths aligned with specific skill requirements.

04
Academic Research

Researchers analyse syllabus trends and instructor networks to map the evolution of emerging academic disciplines.

05
SEO & Content Strategy

Content marketers extract course descriptions and learning outcomes to optimise educational lead generation portals.

06
Labour Market Analysis

Analysts correlate edX skill tags and enrollment metrics with job market demand to forecast future talent supply.

Why DataFlirt

"edX aggregates the world's top universities into a single platform. Extracting this data reveals the exact skills and curricula driving the modern knowledge economy."

Mapping edX requires traversing complex programme-to-course hierarchies, executing JavaScript to render dynamic pricing tracks, and bypassing strict CDN rate limits. DataFlirt handles the proxy rotation and DOM parsing so your team can focus on curriculum analysis and market intelligence.

Technical Spec

edX scraper - technical capabilities

Everything supported by our edx.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

JavaScript rendering
Full Playwright sessions for dynamic course pages
Supported
CAPTCHA bypass
Automated 2Captcha + CapSolver integration
Supported
Geotargeted pricing
Extract prices using region-specific residential IPs
Supported
Syllabus extraction
Nested module and lesson data capture
Supported
Instructor mapping
Relational linking between courses and academics
Supported
Learner reviews
Pagination through all available course reviews
Supported
Change detection (diffs)
Only emit records with changed fields since last run
Supported
Webhook delivery
HTTP POST per record or batch
Supported
Student forum posts
Requires authenticated student enrollment to access discussion boards
Partial
Video lecture files
Gated behind individual user enrollment and DRM protection
Partial
Infrastructure

Infrastructure powering the edX pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusSnowflakeBigQuery
Scrapy + Playwright Stack

Scrapy handles orchestration and deduplication. Playwright manages JavaScript execution for edX's dynamic frontend.

Residential Proxy Infrastructure

Pools of residential ISP proxies prevent rate limiting and enable accurate geotargeted price extraction.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling and dependency management for complex nested scrapes.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested arrays
CSV
Flat file with typed columns
XLS
Excel compatible format for business teams
Parquet
Columnar format for data warehouses
AWS S3
Direct bucket delivery
Webhook
HTTP POST per record
API
REST endpoint for on-demand querying
BigQuery
Streamed directly into your dataset
PostgreSQL
Direct database upserts
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About edx.org scraping, legality, and pipeline operations.

Ask us directly →
Can you extract data for all edX partner universities?

Yes. The pipeline captures data across all partner institutions listed on the platform, mapping courses accurately to their respective universities.

How do you handle courses that belong to multiple programmes?

Our schema uses relational IDs. A single course record will contain an array of programme IDs it belongs to, preventing data duplication while maintaining structural accuracy.

Is the verified track pricing accurate globally?

We use residential proxies to simulate requests from specific countries, ensuring the pricing data reflects local currency and purchasing power parity adjustments applied by edX.

Can you extract learner reviews and ratings?

Yes. We paginate through the review sections on course landing pages to extract text, ratings, and timestamps.

Do you scrape the actual course video content?

No. Video content and internal course materials are gated behind user enrollment and authentication. We only extract publicly available metadata, syllabi, and catalogue information.

How frequently can the catalogue be updated?

We can configure pipelines to run daily, weekly, or monthly depending on your requirements for tracking new course launches and price changes.

Does the extraction include Executive Education and Bootcamps?

Yes. We capture the specific metadata associated with premium offerings, including cohort start dates and application requirements.

What format is the syllabus data delivered in?

Syllabus data is typically delivered as a nested JSON array within the course record, detailing modules, expected duration, and learning outcomes per section.

$ dataflirt scope --new-project --source=edx.org ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or continuous monitoring of university course offerings, we build and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →