SYSTEM all green source jd.com queue 82,419 pages p99 latency 214ms dataflirt.com · scraper/jd-com
RUN - 118 active pipelines - jd.com live

JD.com data,
at warehouse scale.

We extract product listings, dynamic pricing, JD Plus rates, seller intelligence, and multi-media reviews from JD.com. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Products extracted
1.2M /day
Price updates
4.7M /24h
Review records
340K /run
Active pipelines
118
Uptime
99.94%
Data Dictionary

Every field we extract from jd.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Product Listings objects from jd.com. All fields typed and schema-versioned.

sku_idtitlebrandcategorysub_categorypricejd_plus_pricein_stockself_operatedjd_deliveryratingreview_countspecsimage_urlsvariation_countparent_skuweightpage_url
product_listings
● 200 OK
"sku_id": "100012043978",
"title": "Apple iPhone 14 Pro Max 256GB",
"brand": "Apple",
"price": 8999.0,
"jd_plus_price": 8899.0,
"self_operated": true,
"jd_delivery": true,
"rating": 98.5,
"review_count": 2000000,
"in_stock": true
# sku_idtitlebrandcategorysub_categoryprice
1
2
3

Complete list of extractable fields for Pricing & Offers objects from jd.com. All fields typed and schema-versioned.

sku_idcurrent_priceoriginal_pricediscount_pctflash_saleflash_sale_endjd_plus_pricecoupon_detailsbulk_discountbuybox_priceprice_timestampcurrency
pricing_& offers
● 200 OK
"sku_id": "100012043978",
"current_price": 8999.0,
"original_price": 9899.0,
"discount_pct": 9,
"flash_sale": false,
"jd_plus_price": 8899.0,
"coupon_details": "Minus 200 over 4000",
"price_timestamp": "2026-05-12T09:14:00Z"
# sku_idcurrent_priceoriginal_pricediscount_pctflash_saleflash_sale_end
1
2
3

Complete list of extractable fields for Reviews & Ratings objects from jd.com. All fields typed and schema-versioned.

review_idsku_iduser_nameuser_levelplus_memberstar_ratingcontentcreation_timehelpful_votesreply_countimage_urlsvideo_urlvariant_reviewed
reviews_& ratings
● 200 OK
"review_id": "1847294719",
"sku_id": "100012043978",
"star_rating": 5,
"plus_member": true,
"content": "Excellent battery life and camera.",
"helpful_votes": 42,
"creation_time": "2026-04-18 14:22:10",
"user_level": "Diamond"
# review_idsku_iduser_nameuser_levelplus_memberstar_rating
1
2
3

Complete list of extractable fields for Seller Data objects from jd.com. All fields typed and schema-versioned.

shop_idshop_nameshop_urlself_operatedrating_productrating_servicerating_logisticsfollower_countcompany_namebusiness_licensejoined_dateactive_sku_count
seller_data
● 200 OK
"shop_id": "1000000127",
"shop_name": "Apple JD Self-operated Store",
"self_operated": true,
"rating_product": 9.9,
"rating_service": 9.9,
"rating_logistics": 9.9,
"follower_count": 45000000,
"company_name": "JD.com"
# shop_idshop_nameshop_urlself_operatedrating_productrating_service
1
2
3

Complete list of extractable fields for Search Results objects from jd.com. All fields typed and schema-versioned.

keywordpositionsku_idtitlepricecomment_countshop_nameself_operatedad_flagjd_logisticsthumbnail_urlscraped_at
search_results
● 200 OK
"keyword": "smartphone",
"position": 1,
"sku_id": "100012043978",
"ad_flag": false,
"self_operated": true,
"price": 8999.0,
"comment_count": "200W+",
"scraped_at": "2026-05-12T09:14:33Z"
# keywordpositionsku_idtitlepricecomment_count
1
2
3

Capabilities

Everything you need from JD.com - nothing you don't

Our JD scraper navigates the Chinese e-commerce ecosystem: handling heavy JavaScript, slide CAPTCHAs, dynamic pricing widgets, and geo-fenced content to extract structured electronics data.

Full Product Extraction

Title, specifications, dimensions, weight, images, variations, and metadata fields - scraped at SKU level with parent-child variant mapping.

Real-Time Price Tracking

Capture current price, original price, flash sale windows, coupon details, and bulk discount tiers - timestamped per crawl.

JD Plus Pricing

Extract member-exclusive pricing and discounts, providing a complete view of the pricing hierarchy.

Review & Rating Mining

Full review text, star ratings, helpful vote counts, plus member attribution, and media URLs - paginated across all review pages.

Seller Intelligence

Shop name, self-operated flags, follower counts, and tripartite ratings (product, service, logistics) for every listing.

Logistics & Fulfillment

Identify JD Delivery eligibility, warehouse origin, and expected delivery windows across geographical zones.

SERP Scraping

Track organic versus sponsored position for any keyword, with self-operated and JD Logistics badge capture.

Cross-Border E-commerce

Support for Joybuy and JD Worldwide listings, tracking import taxes and international shipping metadata.

Scheduled + Streaming Modes

Run one-off bulk exports or configure continuous pipelines at hourly, daily, or real-time cadences with change-detection diffing.

// engagement pipeline

From SKU list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide SKU lists, category URLs, keyword sets, or shop IDs. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy / Playwright crawlers, proxy rotation, session management, and CAPTCHA handling for jd.com.

Validation & QA
d 4–6

Schema validation, null-rate checks, price-outlier detection, and sample reviews before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our JD.com pipeline handles the hard parts

JD.com deploys aggressive anti-scraping measures, including complex slide CAPTCHAs and behavioural tracking. Here is how we maintain extraction stability.

pipeline-monitor · jd.com · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Anti-bot layer
Residential proxy rotation + fingerprint spoofing

JD.com bot detection operates on TLS fingerprints, browser headers, mouse-movement heuristics, and IP reputation. Our crawlers use residential ISP proxies from mainland China and Hong Kong with realistic browser fingerprints.

JavaScript rendering
Full Playwright execution for dynamic content

JD product prices and stock levels are heavily JavaScript-rendered via asynchronous API calls. We run full Playwright browser sessions with JavaScript execution to capture data that headless HTTP clients miss entirely.

Slide CAPTCHA bypass
Automated solver integration

JD frequently interrupts sessions with complex slide CAPTCHAs. We integrate CapSolver and 2Captcha to process these challenges automatically, maintaining high throughput without manual intervention.

Schema stability
Resilient selectors with fallback chains

JD changes its DOM structure frequently. Our selector strategy uses multiple fallback chains per field - CSS selectors, XPath, and text-pattern matching - so a layout change does not break your data pipeline.

Change detection
Only re-scrape what has changed

For large SKU catalogues, we maintain a hash index of last-seen values per field. Subsequent runs only push diffs - reducing compute cost, storage bloat, and downstream processing load.

Applications

Who uses JD.com data - and how

Teams across industries use jd.com data to build competitive products and smarter operations.

01
Price Intelligence & Repricing

Electronics brands and third-party sellers monitor pricing, flash sale windows, and coupon stacking to reprice and protect margin.

02
Brand & MAP Monitoring

Brands audit third-party sellers for MAP violations, counterfeit listings, and unauthorised resellers - protecting brand equity at scale.

03
Market Research & Category Analysis

Analysts track review velocity, new entrant launches, and category saturation trends to identify whitespace and investment opportunities.

04
AI Training Data

ML teams use JD datasets to train recommendation engines, NLP classifiers, and sentiment models on Chinese language text.

05
Demand Forecasting

Supply chain teams correlate review velocity and stock depth indicators with sales velocity to improve procurement models.

06
Investor Due Diligence

PE firms and analysts track category leaders, seller growth curves, and review-to-rating ratios to evaluate marketplace companies.

Why DataFlirt

"JD.com holds the definitive pricing baseline for electronics in Asia - but extracting it requires bypassing some of the most aggressive anti-bot systems deployed today."

Most teams underestimate the investment required: reliable JD.com scraping requires mainland China residential proxies, full JavaScript rendering, slide CAPTCHA handling, and daily selector maintenance. DataFlirt absorbs that complexity so your engineers can focus on the analysis - not the infrastructure.

Technical Spec

JD.com scraper - technical capabilities

Everything supported by our jd.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

JavaScript rendering
Full Playwright sessions - required for price widgets, stock availability, and dynamic content
Supported
Slide CAPTCHA bypass
Automated 2Captcha + CapSolver integration for JD specific slide mechanics
Supported
Residential proxy rotation
ISP-grade residential IPs from CN / HK pools - rotated per request
Supported
Variant/variation mapping
Parent to child SKU relationships with all colour and storage combinations
Supported
JD Plus pricing extraction
Capture of member-exclusive tier pricing alongside standard pricing
Supported
Review pagination
Full review corpus including image and video attachments
Supported
Seller storefront scraping
All active listings per seller, sorted by any criterion
Supported
Change detection (diffs)
Hash-based diff: only emit records with changed fields since last run
Supported
User purchase history
Gated data requiring SMS verification and active user sessions
Partial
Real-time shopping cart validation
Gated functionality requiring authenticated user state
Partial
Infrastructure

Infrastructure powering the JD.com pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusFastAPI
Scrapy + Playwright Stack

Scrapy handles crawl orchestration, deduplication, and retry logic. Playwright handles JavaScript rendering, cookie sessions, and interaction flows. Combined via scrapy-playwright middleware.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies across CN/HK regions. Rotation happens per-request with sticky sessions where required. IP score monitoring prevents blacklisted pool contamination.

Cloud-Native Orchestration

Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested - schema versioned per run
CSV
Flat file with typed columns - Excel/Sheets compatible
XLS
Standard Excel format for business analysts
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery - compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
REST endpoints for on-demand data retrieval
BigQuery
Streamed directly into your dataset with schema auto-detect
PostgreSQL
Upsert into your existing schema with conflict resolution
Snowflake
Stage + COPY INTO workflow - incremental or full-replace
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About jd.com scraping, legality, and pipeline operations.

Ask us directly →
Is scraping JD.com legal?

Scraping publicly available information from JD.com is generally permissible under applicable law, focusing on public, non-authenticated product, pricing, and review data. We do not extract personal data or circumvent SMS authentication walls. Clients should review JD's ToS and consult legal counsel for specific use cases.

How do you handle JD's slide CAPTCHAs?

We use CapSolver and 2Captcha integrations trained specifically on JD's slide and puzzle mechanics. When a challenge is presented, the Playwright session pauses, the solver calculates the trajectory, and executes the slide with human-like mouse movements.

Can you extract JD Plus member pricing?

Yes. We configure specific crawler sessions to capture the JD Plus tier pricing alongside the standard retail price, allowing you to map the full discount structure.

How fresh is the data?

Real-time streaming pipelines achieve sub-60-minute latency for price and availability signals on a defined SKU set. Full catalogue refreshes at daily cadence complete within a 6-12 hour window depending on size.

Do you extract user-uploaded review images?

Yes. We capture the source URLs for all user-uploaded images and videos attached to reviews, which is critical for product quality monitoring and sentiment analysis.

What is the minimum viable engagement?

Our smallest packages start at a defined SKU list (typically 1,000-50,000 SKUs) with weekly delivery. For larger catalogues or custom schema requirements, we price based on volume and delivery frequency.

$ dataflirt scope --new-project --source=jd.com ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off electronics catalogue dump or a continuous price-monitoring feed across 1M SKUs - we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →