SYSTEM all green source edx.org queue 18,492 courses p99 latency 218ms dataflirt.com · scraper/edx-org

RUN · 37 active pipelines · edx.org live

edX course data,
at warehouse scale.

We extract course listings, institution catalogues, MicroMasters details, pricing, and syllabus structures from edX. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Get data from edx.org → See how it works

Courses extracted

4,281 /day

Institution catalogues

164 /run

Programme updates

892 /24h

Active pipelines

Uptime

99.98%

◆ edX Course Catalogues◆ MicroMasters Programmes◆ Professional Certificates◆ University Profiles◆ Instructor Metadata◆ Course Pricing & Tracks◆ Syllabus Structures◆ Learner Reviews◆ Executive Education◆ Bootcamps Data◆ Skill Tags◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Enterprise SLA◆ edX Course Catalogues◆ MicroMasters Programmes◆ Professional Certificates◆ University Profiles◆ Instructor Metadata◆ Course Pricing & Tracks◆ Syllabus Structures◆ Learner Reviews◆ Executive Education◆ Bootcamps Data◆ Skill Tags◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Enterprise SLA

Data Dictionary

Every field we extract from edx.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Course Listings objects from edx.org. All fields typed and schema-versioned.

course_idtitleinstitutionsubjectlevellanguageduration_weekseffort_hoursprice_auditprice_verifiedenrollment_countstart_dateshort_descriptionfull_descriptionurl

"course_id": "course-v1:MITx+6.00.1x+2T2024",
"title": "Introduction to Computer Science and Programming Using Python",
"institution": "MITx",
"subject": "Computer Science",
"level": "Introductory",
"language": "English",
"price_verified": 149.0,
"enrollment_count": 1245000

#	course_id	title	institution	subject	level	language
1
2
3

Complete list of extractable fields for Programmes & MicroMasters objects from edx.org. All fields typed and schema-versioned.

programme_idtitletypeinstitutioncourse_counttotal_durationtotal_priceskills_gainedjob_outlookinstructorsaverage_salary_projectionurl

"programme_id": "micromasters-mitx-supply-chain-management",
"title": "Supply Chain Management",
"type": "MicroMasters",
"institution": "MITx",
"course_count": 6,
"total_price": 1350.0,
"skills_gained": "['Supply Chain Design', 'Inventory Management']"

#	programme_id	title	type	institution	course_count	total_duration
1
2
3

Complete list of extractable fields for Institutions objects from edx.org. All fields typed and schema-versioned.

institution_idnamelocationdescriptioncourse_countprogramme_countwebsite_urllogo_urlactive_instructorsfounded_year

"institution_id": "mitx",
"name": "Massachusetts Institute of Technology",
"location": "Cambridge, MA",
"course_count": 142,
"programme_count": 18,
"website_url": "https://web.mit.edu"

#	institution_id	name	location	description	course_count	programme_count
1
2
3

Complete list of extractable fields for Instructors objects from edx.org. All fields typed and schema-versioned.

instructor_idnametitleinstitutionbioimage_urlcourse_countsocial_linksacademic_backgroundpublications

"instructor_id": "john-guttag",
"name": "John Guttag",
"title": "Dugald C. Jackson Professor of Computer Science and Electrical Engineering",
"institution": "MIT",
"course_count": 4,
"bio": "John Guttag is a professor at MIT..."

#	instructor_id	name	title	institution	bio	image_url
1
2
3

Complete list of extractable fields for Syllabus & Modules objects from edx.org. All fields typed and schema-versioned.

course_idmodule_numbermodule_titleexpected_durationcontent_typelearning_outcomesassessment_typeprerequisite_modulesvideo_countreading_count

"course_id": "course-v1:MITx+6.00.1x+2T2024",
"module_number": 1,
"module_title": "Python Basics",
"expected_duration": "4 hours",
"content_type": "['Video', 'Reading', 'Exercise']",
"learning_outcomes": "['Understand variables', 'Write basic loops']"

#	course_id	module_number	module_title	expected_duration	content_type	learning_outcomes
1
2
3

Capabilities

Complete edX catalogue extraction

Our edX scraper handles the complex nested structures of online learning platforms: course hierarchies, programme bundles, dynamic pricing, and university metadata.

Course Metadata

Extract titles, descriptions, enrollment counts, duration, effort estimates, and prerequisites for every course.

Programme Bundling

Map individual courses to parent structures like MicroMasters, Professional Certificates, and XSeries.

Pricing & Tracks

Capture the cost of the verified certificate track versus the free audit tier, including regional pricing variations.

Syllabus Extraction

Pull detailed module lists, weekly learning objectives, and assessment structures directly from the course outline.

Instructor Profiles

Scrape instructor names, academic titles, biographies, and institutional affiliations across all courses.

Institution Catalogues

Aggregate data at the university level to track total course output and programme offerings per institution.

Enrollment Metrics

Track student enrollment numbers over time to identify trending subjects and high-demand skills.

Skill Tagging

Extract the specific competencies and skills listed for each course to build job-market correlation models.

Multi-Language Support

Capture course availability and transcript languages across global offerings.

Bootcamps & Exec Ed

Include premium offerings like university bootcamps and executive education programmes in the dataset.

// engagement pipeline

From course URL to warehouse record

Brief in. Clean data out.

Define Scope

d 0

Provide target institutions, subjects, or programme types. We design the extraction schema together.

Pipeline Build

d 2–4

We configure Scrapy and Playwright crawlers to navigate edX's single-page architecture and pagination.

Validation & QA

d 4–6

Schema validation, null-rate checks, and nested-data resolution checks before full launch.

Delivery

ongoing

JSON, CSV, or Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

Bypassing edX extraction hurdles

edX relies on modern JavaScript frameworks and aggressive rate limiting. Here is how our infrastructure maintains constant throughput.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

SPA handling

Executing JavaScript for dynamic content

edX uses modern frontend frameworks that load course details dynamically. We use Playwright to execute JavaScript, wait for API calls to resolve, and extract the fully rendered DOM.

Rate limiting

Distributed request timing

Aggressive scraping triggers CDN blocks. We distribute requests across residential proxies and normalise request headers to mimic standard browser behaviour.

Nested data resolution

Mapping courses to programmes

Courses belong to multiple programmes. Our pipeline maintains relational integrity, outputting normalised tables that link courses, programmes, and institutions perfectly.

Geolocation pricing

Region-specific cost extraction

edX alters pricing based on the user's IP address. We route requests through specific regional proxies to capture accurate local pricing data.

Schema drift

Resilient DOM selectors

Platform updates frequently break naive scrapers. We use fallback selector chains and monitor payload structures to ensure uninterrupted data delivery.

Applications

Who uses edX data

Teams across industries use edx.org data to build competitive products and smarter operations.

EdTech Market Intelligence

Platform operators analyse edX course structures, pricing, and enrollment volumes to identify gaps in their own offerings.

Competitor Benchmarking

Universities track peer institution catalogues to benchmark their online presence and programme portfolio.

Corporate L&D Planning

Enterprise training teams aggregate course data to build internal learning paths aligned with specific skill requirements.

Academic Research

Researchers analyse syllabus trends and instructor networks to map the evolution of emerging academic disciplines.

SEO & Content Strategy

Content marketers extract course descriptions and learning outcomes to optimise educational lead generation portals.

Labour Market Analysis

Analysts correlate edX skill tags and enrollment metrics with job market demand to forecast future talent supply.

Why DataFlirt

"edX aggregates the world's top universities into a single platform. Extracting this data reveals the exact skills and curricula driving the modern knowledge economy."

Mapping edX requires traversing complex programme-to-course hierarchies, executing JavaScript to render dynamic pricing tracks, and bypassing strict CDN rate limits. DataFlirt handles the proxy rotation and DOM parsing so your team can focus on curriculum analysis and market intelligence.

Technical Spec

edX scraper - technical capabilities

Everything supported by our edx.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

JavaScript rendering

Full Playwright sessions for dynamic course pages

Supported

CAPTCHA bypass

Automated 2Captcha + CapSolver integration

Supported

Geotargeted pricing

Extract prices using region-specific residential IPs

Supported

Syllabus extraction

Nested module and lesson data capture

Supported

Instructor mapping

Relational linking between courses and academics

Supported

Learner reviews

Pagination through all available course reviews

Supported

Change detection (diffs)

Only emit records with changed fields since last run

Supported

Webhook delivery

HTTP POST per record or batch

Supported

Student forum posts

Requires authenticated student enrollment to access discussion boards

Partial

Video lecture files

Gated behind individual user enrollment and DRM protection

Partial

Infrastructure

Infrastructure powering the edX pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusSnowflakeBigQuery

Scrapy + Playwright Stack

Scrapy handles orchestration and deduplication. Playwright manages JavaScript execution for edX's dynamic frontend.

Residential Proxy Infrastructure

Pools of residential ISP proxies prevent rate limiting and enable accurate geotargeted price extraction.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling and dependency management for complex nested scrapes.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Newline-delimited or nested arrays

CSV

Flat file with typed columns

XLS

Excel compatible format for business teams

Parquet

Columnar format for data warehouses

AWS S3

Direct bucket delivery

Webhook

HTTP POST per record

API

REST endpoint for on-demand querying

BigQuery

Streamed directly into your dataset

PostgreSQL

Direct database upserts

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About edx.org scraping, legality, and pipeline operations.

Ask us directly →

Can you extract data for all edX partner universities?

Yes. The pipeline captures data across all partner institutions listed on the platform, mapping courses accurately to their respective universities.

How do you handle courses that belong to multiple programmes?

Our schema uses relational IDs. A single course record will contain an array of programme IDs it belongs to, preventing data duplication while maintaining structural accuracy.

Is the verified track pricing accurate globally?

We use residential proxies to simulate requests from specific countries, ensuring the pricing data reflects local currency and purchasing power parity adjustments applied by edX.

Can you extract learner reviews and ratings?

Yes. We paginate through the review sections on course landing pages to extract text, ratings, and timestamps.

Do you scrape the actual course video content?

No. Video content and internal course materials are gated behind user enrollment and authentication. We only extract publicly available metadata, syllabi, and catalogue information.

How frequently can the catalogue be updated?

We can configure pipelines to run daily, weekly, or monthly depending on your requirements for tracking new course launches and price changes.

Does the extraction include Executive Education and Bootcamps?

Yes. We capture the specific metadata associated with premium offerings, including cohort start dates and application requirements.

What format is the syllabus data delivered in?

Syllabus data is typically delivered as a nested JSON array within the course record, detailing modules, expected duration, and learning outcomes per section.

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or continuous monitoring of university course offerings, we build and operate the pipeline. Tell us what you need.

Start a edx.org pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

Services

Data Extraction for Every Industry

View All Services →

🛍️ eCommerce → 🔍 Search Engine → ⚽ Sports Data → 📱 App Store → 🍕 Food Delivery → 📉 Betting Odds → ✈️ Aviation & Flight → 🛒 Grocery → 🎓 E-Learning → 💹 Stock Market → 🏠 Real Estate → 🤖 AI Training Data → 🧠 LLM Data → 📰 News → ⭐ Reviews → 💼 Job Board → 🏥 Healthcare → 💊 Pharma → 🏢 Company Data → 🤝 B2B Marketplace → 🚗 Automotive → 🌍 Travel → 🏨 Hospitality → 🪙 Cryptocurrency → 💡 IP & Patents → 📈 SEO Data → ⚖️ Legal → 🛡️ Insurance → 📲 Mobile App → 📸 Influencer → 🏛️ Government → 🚚 Transportation → 🎟️ Events → 📂 Directory → ⚡ Dynamic Websites → 📄 PDF Extraction → ✍️ Blog Content → ☁️ Weather → 🖥️ Cloud Scraping → 👨‍💻 Managed Service →

edX course data, at warehouse scale.

Every field we extract from edx.org

Complete edX catalogue extraction

From course URL to warehouse record

Bypassing edX extraction hurdles

Who uses edX data

edX scraper - technical capabilities

Infrastructure powering the edX pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

edX course data,
at warehouse scale.

Tell us what
to extract.
We do the rest.