JD.com Scraper - Product, Pricing & Review Data Extraction

Data Dictionary

Every field we extract from jd.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Product Listings objects from jd.com. All fields typed and schema-versioned.

sku_idtitlebrandcategorysub_categorypricejd_plus_pricein_stockself_operatedjd_deliveryratingreview_countspecsimage_urlsvariation_countparent_skuweightpage_url

"sku_id": "100012043978",
"title": "Apple iPhone 14 Pro Max 256GB",
"brand": "Apple",
"price": 8999.0,
"jd_plus_price": 8899.0,
"self_operated": true,
"jd_delivery": true,
"rating": 98.5,
"review_count": 2000000,
"in_stock": true

#	sku_id	title	brand	category	sub_category	price
1
2
3

Complete list of extractable fields for Pricing & Offers objects from jd.com. All fields typed and schema-versioned.

sku_idcurrent_priceoriginal_pricediscount_pctflash_saleflash_sale_endjd_plus_pricecoupon_detailsbulk_discountbuybox_priceprice_timestampcurrency

"sku_id": "100012043978",
"current_price": 8999.0,
"original_price": 9899.0,
"discount_pct": 9,
"flash_sale": false,
"jd_plus_price": 8899.0,
"coupon_details": "Minus 200 over 4000",
"price_timestamp": "2026-05-12T09:14:00Z"

#	sku_id	current_price	original_price	discount_pct	flash_sale	flash_sale_end
1
2
3

Complete list of extractable fields for Reviews & Ratings objects from jd.com. All fields typed and schema-versioned.

review_idsku_iduser_nameuser_levelplus_memberstar_ratingcontentcreation_timehelpful_votesreply_countimage_urlsvideo_urlvariant_reviewed

"review_id": "1847294719",
"sku_id": "100012043978",
"star_rating": 5,
"plus_member": true,
"content": "Excellent battery life and camera.",
"helpful_votes": 42,
"creation_time": "2026-04-18 14:22:10",
"user_level": "Diamond"

#	review_id	sku_id	user_name	user_level	plus_member	star_rating
1
2
3

Complete list of extractable fields for Seller Data objects from jd.com. All fields typed and schema-versioned.

shop_idshop_nameshop_urlself_operatedrating_productrating_servicerating_logisticsfollower_countcompany_namebusiness_licensejoined_dateactive_sku_count

"shop_id": "1000000127",
"shop_name": "Apple JD Self-operated Store",
"self_operated": true,
"rating_product": 9.9,
"rating_service": 9.9,
"rating_logistics": 9.9,
"follower_count": 45000000,
"company_name": "JD.com"

#	shop_id	shop_name	shop_url	self_operated	rating_product	rating_service
1
2
3

Complete list of extractable fields for Search Results objects from jd.com. All fields typed and schema-versioned.

keywordpositionsku_idtitlepricecomment_countshop_nameself_operatedad_flagjd_logisticsthumbnail_urlscraped_at

"keyword": "smartphone",
"position": 1,
"sku_id": "100012043978",
"ad_flag": false,
"self_operated": true,
"price": 8999.0,
"comment_count": "200W+",
"scraped_at": "2026-05-12T09:14:33Z"

#	keyword	position	sku_id	title	price	comment_count
1
2
3

Capabilities

Everything you need from JD.com - nothing you don't

Our JD scraper navigates the Chinese e-commerce ecosystem: handling heavy JavaScript, slide CAPTCHAs, dynamic pricing widgets, and geo-fenced content to extract structured electronics data.

Full Product Extraction

Title, specifications, dimensions, weight, images, variations, and metadata fields - scraped at SKU level with parent-child variant mapping.

Real-Time Price Tracking

Capture current price, original price, flash sale windows, coupon details, and bulk discount tiers - timestamped per crawl.

JD Plus Pricing

Extract member-exclusive pricing and discounts, providing a complete view of the pricing hierarchy.

Review & Rating Mining

Full review text, star ratings, helpful vote counts, plus member attribution, and media URLs - paginated across all review pages.

Seller Intelligence

Shop name, self-operated flags, follower counts, and tripartite ratings (product, service, logistics) for every listing.

Logistics & Fulfillment

Identify JD Delivery eligibility, warehouse origin, and expected delivery windows across geographical zones.

SERP Scraping

Track organic versus sponsored position for any keyword, with self-operated and JD Logistics badge capture.

Cross-Border E-commerce

Support for Joybuy and JD Worldwide listings, tracking import taxes and international shipping metadata.

Scheduled + Streaming Modes

Run one-off bulk exports or configure continuous pipelines at hourly, daily, or real-time cadences with change-detection diffing.

// engagement pipeline

From SKU list to warehouse record

Brief in. Clean data out.

Define Scope

d 0

Provide SKU lists, category URLs, keyword sets, or shop IDs. We design the extraction schema together.

Pipeline Build

d 2–4

We configure Scrapy / Playwright crawlers, proxy rotation, session management, and CAPTCHA handling for jd.com.

Validation & QA

d 4–6

Schema validation, null-rate checks, price-outlier detection, and sample reviews before full launch.

Delivery

ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our JD.com pipeline handles the hard parts

JD.com deploys aggressive anti-scraping measures, including complex slide CAPTCHAs and behavioural tracking. Here is how we maintain extraction stability.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

Anti-bot layer

Residential proxy rotation + fingerprint spoofing

JD.com bot detection operates on TLS fingerprints, browser headers, mouse-movement heuristics, and IP reputation. Our crawlers use residential ISP proxies from mainland China and Hong Kong with realistic browser fingerprints.

JavaScript rendering

Full Playwright execution for dynamic content

JD product prices and stock levels are heavily JavaScript-rendered via asynchronous API calls. We run full Playwright browser sessions with JavaScript execution to capture data that headless HTTP clients miss entirely.

Slide CAPTCHA bypass

Automated solver integration

JD frequently interrupts sessions with complex slide CAPTCHAs. We integrate CapSolver and 2Captcha to process these challenges automatically, maintaining high throughput without manual intervention.

Schema stability

Resilient selectors with fallback chains

JD changes its DOM structure frequently. Our selector strategy uses multiple fallback chains per field - CSS selectors, XPath, and text-pattern matching - so a layout change does not break your data pipeline.

Change detection

Only re-scrape what has changed

For large SKU catalogues, we maintain a hash index of last-seen values per field. Subsequent runs only push diffs - reducing compute cost, storage bloat, and downstream processing load.

Applications

Who uses JD.com data - and how

Teams across industries use jd.com data to build competitive products and smarter operations.

Price Intelligence & Repricing

Electronics brands and third-party sellers monitor pricing, flash sale windows, and coupon stacking to reprice and protect margin.

Brand & MAP Monitoring

Brands audit third-party sellers for MAP violations, counterfeit listings, and unauthorised resellers - protecting brand equity at scale.

Market Research & Category Analysis

Analysts track review velocity, new entrant launches, and category saturation trends to identify whitespace and investment opportunities.

AI Training Data

ML teams use JD datasets to train recommendation engines, NLP classifiers, and sentiment models on Chinese language text.

Demand Forecasting

Supply chain teams correlate review velocity and stock depth indicators with sales velocity to improve procurement models.

Investor Due Diligence

PE firms and analysts track category leaders, seller growth curves, and review-to-rating ratios to evaluate marketplace companies.

Technical Spec

JD.com scraper - technical capabilities

Everything supported by our jd.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

JavaScript rendering

Full Playwright sessions - required for price widgets, stock availability, and dynamic content

Supported

Slide CAPTCHA bypass

Automated 2Captcha + CapSolver integration for JD specific slide mechanics

Supported

Residential proxy rotation

ISP-grade residential IPs from CN / HK pools - rotated per request

Supported

Variant/variation mapping

Parent to child SKU relationships with all colour and storage combinations

Supported

JD Plus pricing extraction

Capture of member-exclusive tier pricing alongside standard pricing

Supported

Review pagination

Full review corpus including image and video attachments

Supported

Seller storefront scraping

All active listings per seller, sorted by any criterion

Supported

Change detection (diffs)

Hash-based diff: only emit records with changed fields since last run

Supported

User purchase history

Gated data requiring SMS verification and active user sessions

Partial

Real-time shopping cart validation

Gated functionality requiring authenticated user state

Partial

Infrastructure

Infrastructure powering the JD.com pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusFastAPI

Scrapy + Playwright Stack

Scrapy handles crawl orchestration, deduplication, and retry logic. Playwright handles JavaScript rendering, cookie sessions, and interaction flows. Combined via scrapy-playwright middleware.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies across CN/HK regions. Rotation happens per-request with sticky sessions where required. IP score monitoring prevents blacklisted pool contamination.

Cloud-Native Orchestration

Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Newline-delimited or nested - schema versioned per run

CSV

Flat file with typed columns - Excel/Sheets compatible

XLS

Standard Excel format for business analysts

Parquet

Columnar format for BigQuery, Snowflake, Athena

AWS S3

Direct bucket delivery - compatible with any data lake

Webhook

HTTP POST per record for real-time downstream processing

API

REST endpoints for on-demand data retrieval

BigQuery

Streamed directly into your dataset with schema auto-detect

PostgreSQL

Upsert into your existing schema with conflict resolution

Snowflake

Stage + COPY INTO workflow - incremental or full-replace

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About jd.com scraping, legality, and pipeline operations.

Ask us directly →

Is scraping JD.com legal?

Scraping publicly available information from JD.com is generally permissible under applicable law, focusing on public, non-authenticated product, pricing, and review data. We do not extract personal data or circumvent SMS authentication walls. Clients should review JD's ToS and consult legal counsel for specific use cases.

How do you handle JD's slide CAPTCHAs?

We use CapSolver and 2Captcha integrations trained specifically on JD's slide and puzzle mechanics. When a challenge is presented, the Playwright session pauses, the solver calculates the trajectory, and executes the slide with human-like mouse movements.

Can you extract JD Plus member pricing?

Yes. We configure specific crawler sessions to capture the JD Plus tier pricing alongside the standard retail price, allowing you to map the full discount structure.

How fresh is the data?

Real-time streaming pipelines achieve sub-60-minute latency for price and availability signals on a defined SKU set. Full catalogue refreshes at daily cadence complete within a 6-12 hour window depending on size.

Do you extract user-uploaded review images?

Yes. We capture the source URLs for all user-uploaded images and videos attached to reviews, which is critical for product quality monitoring and sentiment analysis.

What is the minimum viable engagement?

Our smallest packages start at a defined SKU list (typically 1,000-50,000 SKUs) with weekly delivery. For larger catalogues or custom schema requirements, we price based on volume and delivery frequency.

JD.com data,
at warehouse scale.

Every field we extract from jd.com

Everything you need from JD.com - nothing you don't

From SKU list to warehouse record

How our JD.com pipeline handles the hard parts

Who uses JD.com data - and how

JD.com scraper - technical capabilities

Infrastructure powering the JD.com pipeline

Your data, your destination

Common questions.

Tell us what
to extract.
We do the rest.

Data Extraction for Every Industry

JD.com data, at warehouse scale.

Every field we extract from jd.com

Everything you need from JD.com - nothing you don't

From SKU list to warehouse record

How our JD.com pipeline handles the hard parts

Who uses JD.com data - and how

JD.com scraper - technical capabilities

Infrastructure powering the JD.com pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

JD.com data,
at warehouse scale.

Tell us what
to extract.
We do the rest.