← All Posts Scrape eCom stores to build a custom competitor research dataset

Scrape eCom stores to build a custom competitor research dataset

· Updated 13 Jun 2026
Author
Nishant
Nishant

Founder of DataFlirt.com. Logging web scraping shhhecrets to help data engineering and business analytics/growth teams extract and operationalise web data at scale.

TL;DRQuick summary
  • Spreadsheet-and-browser competitor checks stop working past a handful of SKUs; a scraped competitor price monitoring dataset turns the same question into a query.
  • Schema design and product matching, not extraction, are where competitor data projects succeed or fail, so define fields and matching rules before the first crawl.
  • Source difficulty varies wildly, from Shopify stores with open JSON endpoints to marketplaces with aggressive anti-bot stacks, and it shifts without notice as stores adopt CDN-level blocking.
  • DataFlirt scopes the schema, runs the crawls, matches products with confidence scores, and delivers warehouse-ready competitor data as a file, feed, or API.

Why competitor price monitoring beats checking sites by hand

Your competitor repriced 300 SKUs last night. You found out from a sales dip two weeks later. That lag is the whole case for competitor price monitoring built on scraped data.

Manual checking caps out fast. An analyst can eyeball maybe 30 competitor listings a day, and the numbers go stale before the spreadsheet saves. Pricing teams need the opposite: every tracked SKU, on every competitor store, on a schedule, landing in one table they can query.

That table is a custom competitor data asset, and you build it when you scrape ecommerce stores on a schedule instead of by hand. The rest of this guide covers what goes in it, where the data comes from, what breaks, and the honest build-vs-buy math. DataFlirt builds competitor price monitoring datasets for a living, and the hard-won detail below comes from that practice.

Decide what goes in the dataset before you scrape anything

Most failed competitor data projects die at the schema stage, not the scraping stage. Teams crawl first, then discover the fields don’t line up across stores. Fix the schema first.

A workable competitor research dataset needs four field groups:

GroupFieldsWhy it matters
IdentitySKU, GTIN/EAN, brand, title, variantMatching across stores
Commercialprice, list price, promo flag, availabilityThe pricing decisions
Trustrating, review count, seller nameBuy-box and credibility
Provenancesource, URL, scraped_atAuditability, time series

The fields teams forget

List price matters as much as sale price, because discount depth is the signal in promotion analysis. Seller name matters on marketplaces, where a third-party undercutting you is a different problem than the platform itself doing it. And scraped_at timestamps turn snapshots into a time series, which is where ecommerce competitor analysis actually pays off.

DataFlirt scopes this schema with you before any crawl runs. You define the fields, format, and delivery cadence; DataFlirt validates every delivered row against that schema, so a missing GTIN or malformed price gets caught before it reaches your warehouse.

Pick your sources: marketplaces, brand stores, and D2C sites

Source selection drives both the value and the difficulty of the project. Most teams scrape ecommerce stores across three tiers, and the mix decides what your ecommerce competitor analysis can actually answer.

Marketplaces

Marketplaces show you the competitive set in one place: who sells the product, at what price, with what rating. An Amazon scraper is the backbone of most US programs, with an eBay scraper covering the resale floor. India-focused teams track a Flipkart scraper alongside Myntra and Nykaa for fashion and beauty. Regional programs add Mercado Libre for Latin America or Allegro for Poland.

Big-box retail

Retailer sites reveal assortment and stock strategy, not just price. A Target scraper or Best Buy scraper shows category depth and in-store availability; a Wayfair scraper does the same for home goods.

D2C and fast-fashion

Direct competitors on their own storefronts are often the easiest crawls and the most strategically interesting, since their pricing isn’t mediated by a marketplace. Fast-movers like Shein, Temu, and Zalando reprice and rotate assortment fast enough that weekly snapshots miss the story.

Relevance beats coverage. Ten stores your buyers actually cross-shop beat fifty random ones, and DataFlirt will say so in scoping; its ecommerce scraping service prices per source, so trimming the list saves you money.

How the data comes out: JSON-LD, hidden endpoints, and rendered pages

Extraction difficulty splits into three paths when you scrape ecommerce stores, and picking the cheapest one per store is where engineering cost gets controlled. DataFlirt audits every target site for the cheapest workable path before quoting, which is why two stores of the same size can carry very different prices.

Structured data is already there

Most product pages embed Schema.org Product markup for Google Shopping. JSON-LD extraction pulls name, price, currency, availability, and GTIN from a script tag without parsing fragile HTML. When a store renders JSON-LD server-side, a plain HTTP fetch beats a browser every time.

Set up an environment with pinned dependencies:

python -m venv venv
source venv/bin/activate
pip install httpx==0.28.1 parsel==1.9.1

Then pulling the Product object from a page is short:

import json
import httpx
from parsel import Selector

resp = httpx.get(
    "https://example-store.com/products/some-product",
    headers={"User-Agent": "Mozilla/5.0"},
    follow_redirects=True,
)
sel = Selector(resp.text)

for raw in sel.css('script[type="application/ld+json"]::text').getall():
    data = json.loads(raw)
    if data.get("@type") == "Product":
        offer = data.get("offers", {})
        print(data.get("name"), offer.get("price"), offer.get("availability"))

The script fetches the page, finds JSON-LD blocks, and prints the product name, price, and stock status. Production versions need retry logic, proxy support, and handling for stores that nest Product inside an @graph array.

Hidden endpoints

Platform-built stores often expose catalog APIs. Shopify storefronts auto-generate a /products.json endpoint, paginated up to 250 products per request, and per-product .json URLs that expose all product data including fields merchants assumed were private. Some stores shield or disable these, but where they work, one request replaces an entire category crawl.

Rendered pages

JavaScript-heavy stores and most marketplaces need a headless browser. This is the expensive path: slower, heavier, and the one anti-bot systems watch hardest. DataFlirt runs all three paths on open-source tooling, Scrapy and httpx for the cheap fetches, Playwright with stealth patches where rendering is unavoidable, so clients get auditable pipelines rather than a black box, and never pay browser costs for a JSON-LD job.

Product matching: the hard part nobody budgets for

Extraction gets the attention. Matching decides whether the competitor data is usable for competitor price monitoring at all. Your “Acme ProBlend 600W Blender, Black” is “ProBlend600 Blender by Acme (Black, 600 Watt)” on one store and a bundle with free cups on another. Until those rows link to one product ID, price comparison is meaningless.

The matching ladder

GTIN and EAN codes match cleanly when listings include them, so capture them everywhere you can. Below that, fuzzy title matching on normalized brand-plus-model strings catches most of the rest, and perceptual image hashing breaks ties. Variants are the trap: matching a 500ml listing against your 1L SKU quietly poisons every downstream price chart.

Bundles and multipacks

A competitor selling a 3-pack at a “lower unit price” is a different commercial event than a straight price cut. The schema needs a pack-size field and unit-price normalization, or your competitor price monitoring will fire false underpricing alerts weekly.

DataFlirt treats matching as its own QA stage, with confidence scores per match and human review on low-confidence pairs. That scoring layer is the difference between a dataset your pricing team trusts and one they quietly stop opening. It is also why price matching projects are scoped around match accuracy, not page counts.

What breaks: anti-bot walls, schema drift, and stale listings

Plan for these three failure modes up front, because all three will happen.

Anti-bot escalation

Marketplaces defend hard: rate limiting per IP, browser fingerprinting, and CAPTCHA walls on suspicious traffic. Sustained marketplace crawls need a rotating proxy layer with residential exits; a small D2C store usually needs none of that, and paying for residential proxies to crawl it is over-engineering. The ground also shifts: Cloudflare now blocks AI crawlers by default for new domains on its network, and stores adopting CDN-level bot management can harden overnight. DataFlirt maintains per-site crawl strategies and adapts them when defenses change, which is the maintenance burden you are really outsourcing.

Schema drift

Stores redesign, and selectors die silently. Worse than a crash is a crawl that keeps running while writing nulls into the price column. Schema drift detection catches this: field-level validation that flags when extraction rates drop on any column. DataFlirt’s QA layer alerts on drift before delivery, so broken competitor data never reaches your dashboards.

Stale and zombie listings

Delisted products linger in sitemap crawls, and out-of-stock items keep old prices. Without availability checks and deduplication logic, your “market average price” includes products nobody can buy. Tag availability on every crawl and age out listings that stop appearing.

Prices that hide from crawlers

The price on the product page is not always the price shoppers pay. Stores show member-only prices behind login, reveal discounts only in the cart, vary prices by region and currency, and serve different numbers to app users. A naive crawl records the sticker price and misses the real one. Decide per source which price your ecommerce competitor analysis needs, then capture it deliberately: geo-targeted crawls for regional pricing, cart-stage checks where the discount lives there. DataFlirt handles locale and currency normalization as part of delivery, so a EUR cart price and an INR list price land in comparable columns instead of poisoning the index.

The question every serious buyer of competitor price monitoring asks, so here is the straight orientation.

Prices and availability are public, factual data, and facts are not copyrightable. US courts have repeatedly declined to let platforms turn public data into private property via terms of service; in 2024 a federal judge dismissed contract claims against a data-collection company over scraping public content, warning against information monopolies. The hiQ v. LinkedIn line of cases points the same direction on public-data access.

Where caution is real

Logged-in scraping changes the analysis, since an authenticated session means accepted terms. Reviews carry usernames and opinions, which makes them personal data under GDPR, CCPA, and India’s DPDP Act; aggregate where you can and set retention limits. And robots.txt is not law, but respecting crawl-delay signals and keeping a low request footprint is both good practice and good risk posture.

None of this is legal advice. Jurisdictions and use cases differ, so put your specific design in front of qualified counsel. DataFlirt scrapes publicly available data, keeps a low footprint on target servers, and documents provenance on every delivery, which keeps your audit trail clean when counsel asks.

Crawl cadence: how fresh does competitor price monitoring need to be?

Freshness costs money, so match it to the decision speed of the category.

CadenceFitsTypical use
IntradayPrice-war SKUs, electronicsRepricing triggers
DailyMost retail assortmentsCompetitor price monitoring
WeeklyCatalog and assortment researchRange and gap analysis
One-offMarket entry, due diligenceSnapshot studies

Tier your SKUs rather than crawling everything at the fastest cadence. The 200 products that drive 80% of revenue earn intraday checks; the long tail holds up fine on weekly runs. DataFlirt runs mixed cadences per source and per SKU tier inside one engagement, weekly, fortnightly, monthly, or hourly where it earns its cost, and will tell you plainly when intraday is wasted spend for your category.

From raw rows to analytics: storing and serving the dataset

A competitor dataset earns its keep inside your analytics stack, not in a downloads folder.

For most teams, the right shape is an append-only fact table keyed on product, source, and timestamp, which makes price history, promo frequency, and stock-out analysis simple window queries. Choosing where it lives is a DBMS decision driven by your existing stack; competitor data should land next to your sales data, because the joins are the point. DataFlirt hands over data you can query, not raw HTML to clean, and for warehouse-first teams DataFlirt lands deliveries analytics-ready in BigQuery, Snowflake, or Postgres.

DataFlirt delivers in whatever shape that stack needs: CSV for analysts, JSON Lines for pipelines, or direct writes to your database or warehouse, with the option of a live API endpoint when pricing tools need to query current competitor data on demand. Feeds arrive analytics-ready, deduplicated and schema-validated, which minimizes the ETL work between delivery and dashboard. Review streams can join the same warehouse through DataFlirt’s review scraping service when sentiment belongs in the competitive picture, and B2B sellers extend the same model to wholesale platforms via the B2B marketplace service.

What the dataset answers: five ecommerce competitor analysis plays

A clean competitor dataset turns recurring strategy questions into queries. Five plays cover most of the ROI.

Price index per category

Your average price versus the market’s, per category, per week. One query once the fact table exists, and the standard first deliverable DataFlirt clients put in front of leadership. It answers “are we expensive” with a number instead of a feeling.

Assortment gap analysis

Products competitors range that you don’t, surfaced by joining matched catalogs. Catalogue managers use this to scrape ecommerce stores for whitespace before a category review, and it is where the GTIN discipline from the schema section pays off.

Promotion calendar reconstruction

Promo flags plus timestamps reveal each competitor’s discount rhythm: how deep, how often, around which events. Plan your own calendar against their pattern instead of reacting to it.

Stock-out exploitation

When a competitor’s bestseller goes out of stock, that is a window for ad spend and price firmness on your matching SKU. Availability tracking at daily cadence catches these windows; DataFlirt clients wire the feed into alerting so merchandisers hear about it the same day.

Review-gap targeting

Joining competitor ratings against yours flags products winning on price but losing on trust. Pair the price feed with customer review data and the fix becomes specific: which SKU, which complaint, which competitor benefits.

Each play needs the same underlying asset, which is the argument for building one competitor data pipeline instead of five point solutions. DataFlirt delivers that single asset and lets your analysts run every play on top of it.

Build it or buy the feed

  • Price the in-house route honestly. A production competitor price monitoring pipeline needs per-site extraction logic, proxy spend, matching infrastructure, drift monitoring, storage, and an engineer on call for the week Amazon changes its layout. At serious scale scraping, that is a standing engineering commitment, conservatively a part-time role plus infrastructure before anyone analyzes a single price.

  • SaaS price trackers sit in the middle. They work well for small, marketplace-only programs, but most lock you into their schema, their source list, and a per-SKU subscription that climbs with your catalog. The moment you need a custom field, an unsupported store, or your own warehouse as the destination, you are back to scraping ecommerce stores yourself or hiring it out.

  • Building can still win if you have idle scraping expertise and a small, stable source list. For everyone else, DataFlirt typically costs less than the engineering role alone, quotes per project with no minimum spend, and owns the maintenance as part of the service rather than an upsell. A six-month internal build becomes a first data drop within days, and the engagement scales from a one-off snapshot to a multi-source daily feed without re-platforming, because DataFlirt’s pipeline architecture runs a 50-SKU pilot and a 5-million-SKU rollout on the same stack. When build-vs-buy math gets real, that total cost of ownership is usually where DataFlirt wins it.

Get a competitor dataset sample this week

The fastest way to test everything above is a sample against your own SKUs. Send DataFlirt your product list and the competitor stores you care about; scoping comes back within 48 hours, and most projects see a sample dataset the same week. You check match accuracy and field quality against your own catalog before committing to anything, and the same engagement scales from a snapshot study to continuous competitor price monitoring when the data proves out.

Talk to DataFlirt about your competitor price monitoring project. Bring the SKU list and the store list; leave with a schema, a cadence plan, and a quote.

Frequently asked questions

How do I build a competitor research dataset from ecommerce stores?

Define the schema first, then the sources. A workable competitor dataset tracks product identity (SKU, GTIN, brand, title), commercial fields (price, list price, promo flag, availability), trust signals (rating, review count), and provenance (source, URL, timestamp). DataFlirt scopes this schema with you before any crawl runs, so every store feeds the same table.

Public, factual product data like prices and availability sits on strong legal footing in the US, where courts have repeatedly declined to let platforms convert public data into private property through terms of service. Reviews add personal data, which brings GDPR, CCPA, and DPDP obligations. Get your specific design reviewed by qualified legal counsel before production.

What is the hardest part of competitor price monitoring?

Product matching. Linking your SKU to the right listing on each competitor store is harder than extraction. GTIN and EAN codes match cleanly when present, but many listings omit them, forcing fuzzy title matching and image comparison, with variant traps like size and color. DataFlirt treats matching as its own QA stage with confidence scores per match.

How often should I scrape competitor prices?

Match cadence to how fast the category moves. Price-war categories like electronics justify intraday checks on key SKUs, most retail assortments are fine on daily crawls, and catalog or assortment research holds up on weekly runs. DataFlirt sets different cadences per source and per SKU tier inside one feed, so you pay for freshness only where it earns money.

How hard is it to scrape ecommerce stores like Amazon or Shopify sites?

Marketplaces like Amazon defend aggressively with fingerprinting and rate limits, while many standalone stores are far easier; Shopify storefronts even expose a products.json endpoint. Difficulty also shifts without notice as stores adopt CDN-level bot blocking. DataFlirt maintains per-site crawl strategies so feeds keep running when a store hardens its defenses.

How does DataFlirt deliver a competitor dataset?

DataFlirt delivers the dataset as a one-off extraction, a scheduled feed, or a live API, in CSV, JSON Lines, or written directly to your warehouse. Scoping comes back within 48 hours and most projects see a sample dataset the same week, so you validate fields and match quality before committing to a feed.

More to read

Latest from the Blog

Services

Data Extraction for Every Industry

View All Services →