← All Posts Scraping products for Shopify migration? Here is the Shopify-ready schema you need

Scraping products for Shopify migration? Here is the Shopify-ready schema you need

· Updated 12 Jun 2026
Author
Nishant
Nishant

Founder of DataFlirt.com. Logging web scraping shhhecrets to help data engineering and business analytics/growth teams extract and operationalise web data at scale.

TL;DRQuick summary
  • One-time extractions suit point-in-time research; periodic feeds suit ongoing monitoring.
  • Cost depends on SKU count, JS rendering, image extraction, and anti-bot complexity.
  • Always validate with a sample extraction before committing to the full run.
  • Legal risk is lower for publicly available product data than for personal or login-gated data.
  • DataFlirt scopes and delivers in 48 hours with a free 100-row sample.

Your original developer is completely unresponsive. The bespoke ecommerce platform they built has no export button. You are staring at a massive catalog of products trapped inside a system you no longer control. Can you get your products back without re-entering them manually? Yes, you can. You do not need backend database access to execute a flawless catalog migration. DataFlirt extracts your catalog directly from the public frontend.

Key takeaways

  • Scraping bypasses missing backend access by extracting data directly from your public HTML code.
  • Shopify imports reject any file that deviates from their rigid, handle-based CSV architecture.
  • Image URLs must be publicly accessible; Shopify servers fetch them during the actual import phase.
  • Variant rows demand strict sequential grouping; splitting a variant family across two uploads corrupts the entire catalog.

Why no export means scraping is your only realistic option

Custom content management systems and older, heavily modified installations rarely offer native export functionality to modern systems. Web scraping solves this by pulling catalog data directly from the public-facing pages, bypassing the inaccessible database completely.

The danger of the stranded catalog

Currently, an estimated 14% of ecommerce businesses are satisfied with their legacy ecommerce platforms. The vast majority of merchants operate on outdated infrastructure. Gaining direct database access usually requires the original developer. If that developer goes out of business or simply stops responding, you are locked out of your own backend.

Without a clean database dump, merchants panic. They assume they must rebuild the entire catalog by hand. Manual data entry is a catastrophic alternative for any serious business.

Re-entering 500 products takes weeks of tedious labor. It introduces immense human error into your core inventory system. Formatting mistakes will inevitably ruin your catalog structure. You will lose precious SEO metadata in the process.

Reading the live site as a rescue path

Scraping ignores the backend entirely. An automated script reads the live HTML of your site exactly as a customer sees it. It extracts titles, prices, descriptions, and images. This extraction acts as the ultimate rescue path for stranded merchants.

You do not need a working API. You do not need server credentials. The public website holds every single piece of data required to rebuild your store elsewhere. Web scraping translates that visual information into a highly structured database format.

DataFlirt utilizes advanced crawling techniques to systematically index every product page. A DataFlirt extraction maps your entire site architecture efficiently. We secure the critical product data you thought was lost forever. You can read more about the underlying mechanics in our guide on how web scraping works.

The Shopify product import CSV every column you need

A successful Shopify upload demands a highly specific CSV structure with precisely named column headers. Miss a single required column, or misspell a header, and the entire file will reject your upload instantly.

The rigid column architecture

With 77% of businesses feeling a sense of urgency to migrate platforms within the next year, understanding target schemas is non-negotiable. Shopify dictates exactly how product data must be organized. The platform processes millions of rows daily. It expects absolute conformity to its ingestion logic.

DataFlirt engineers format extracted data to match this exact template flawlessly. The table below outlines the core architecture of a Shopify-ready import file.

Column HeaderPurposeRequirement Status
HandleThe unique URL slug for the item.Mandatory
TitleThe core product name.Mandatory for new items
Body (HTML)The product description including formatting.Optional
VendorThe brand or manufacturer name.Optional
Product CategoryThe standard taxonomy classification.Optional
TagsComma-separated descriptors.Optional
PublishedTrue or false visibility state.Optional
Option1 NameThe variant category like size or color.Required for variants
Option1 ValueThe specific variant value like large or red.Required for variants
Variant SKUThe stock keeping unit code.Optional
Variant PriceThe numerical selling price.Optional
Image SrcThe absolute URL of the product photo.Optional
StatusActive, draft, or archived state.Optional

Handling complex variant expansions

Variants aggressively complicate your file structure. They expand a single product entity into multiple vertical spreadsheet rows. If a jacket has four sizes and three colors, that jacket requires twelve distinct rows. Every single row must share the exact same Handle.

The Option1 Name and Option1 Value columns dictate your variant types. If you are uploading products with variants, you must include these option columns accurately. DataFlirt strictly validates these values during the initial extraction phase.

If you alter these option names later during an update, Shopify will delete your existing variant IDs. It will create entirely new ones. This silently breaks third-party app dependencies, recurring subscriptions, and customer wishlists.

Image URL validation rules

The Image Src column must contain a publicly accessible URL starting with http:// or https://. Shopify actively fetches these images from the provided links during the import phase. The importer cannot read local desktop file paths.

Private file-sharing links will fail completely. Standard Dropbox preview links are useless here. Supplier URLs requiring a login prompt will block the Shopify server. DataFlirt verifies every single image link for public accessibility before finalizing your dataset.

Shopify enforces a strict 15 MB file size limit for product CSV uploads. Massive catalogs exceeding this limit must be strategically broken down into multiple smaller CSV files before uploading.

Shopify holds a 28.8% market share among the top 1 million high-traffic ecommerce websites globally. Their infrastructure demands efficiency to maintain stability. DataFlirt automatically segments massive extractions into safe 14 MB chunks. We ensure no variant family is ever split improperly across two separate files.

How scraped data maps to Shopify columns

Mapping extracted web data to Shopify’s strict CSV schema requires a deliberate, programmatic transformation phase. Raw text scraped from a webpage rarely matches Shopify database requirements out of the box.

Transforming raw titles and descriptions

The extracted product name maps directly to the Title column. The description requires significantly more nuance. You map description text to the Body (HTML) column. You must preserve structural HTML tags like paragraphs and unordered lists.

You must simultaneously strip inline CSS styles from the source code. Bringing over old styling will create massive visual conflicts with your new Shopify theme. DataFlirt applies specific html-parsing rules to clean your descriptions thoroughly. We ensure the text looks native in your new storefront.

Cleaning pricing parameters

Extracted prices almost always contain currency symbols and formatting commas. These characters must be stripped completely before mapping to the Variant Price column. Shopify expects raw numerical values to process transactions correctly.

When scraping a massive retailer like Walmart or Target for market data, formatting varies wildly. Replatforming requires the exact same string cleaning discipline. DataFlirt scripts isolate the integer values reliably. We remove all localized currency markers automatically.

Structuring images and SKUs

Images map sequentially to the Image Src column. You must ensure absolute URLs are used. Relative image paths extracted directly from your source code will fail immediately upon import.

If a product features multiple images, each additional image requires a totally new row in the CSV. These extra rows simply share the parent Handle. SKUs are frequently missing from public pages on sites like Best Buy or Home Depot.

If your old site hides the SKU from the public DOM, you can safely leave the column blank. DataFlirt structures the delivery file so Shopify will auto-generate internal IDs. Defining your own SKUs is always preferable for inventory management, but the system handles blank entries gracefully.

Extracting data from dynamic catalogs

Modern legacy platforms often load their catalog data asynchronously. This means a simple HTML request will return an empty page. You need a full browser environment to render the products properly before extraction begins.

DataFlirt deploys advanced headless-browser technology to render these complex pages fully. We parse the DOM after all elements have loaded. If your stranded site relies heavily on asynchronous loading, our dynamic website scraping service captures the catalog flawlessly.

The three things that most commonly break a Shopify import

Handle collisions, improper variant row ordering, and inaccessible image URLs account for nearly all failed Shopify CSV uploads. These three formatting errors will halt your migration instantly.

Fixing handle collisions

Shopify relies entirely on the Handle column to identify unique products. If two completely different items share the exact same handle, Shopify merges them together disastrously. A men’s black shirt and a women’s black shirt might both generate the handle black-shirt.

The system will treat them as bizarre variants of a single product. You must ensure absolute uniqueness in your handle generation logic. DataFlirt algorithms append unique internal hashes or categorical prefixes to handles when we detect overlapping product names. This guarantees total separation.

Consider a catalog manager migrating 4,000 auto parts from a bespoke platform. Two different manufacturers use the exact same name for a specific spark plug. A basic extraction script generates the same handle for both items; the import engine merges them blindly, permanently destroying the inventory count for both manufacturers.

Mastering variant row ordering

Shopify expects strict, sequential grouping for variant items. All variant rows for a single product entity must be grouped continuously together within the spreadsheet. The parent row comes first. The specific variants follow immediately after in subsequent rows.

All rows in this continuous block must utilize the exact same Handle. If you split a product’s variants randomly across your file, the import mechanism will fail. DataFlirt enforces precise row sorting prior to delivery. We sequence the entire dataset by handle alphabetically to ensure perfect ingest logic.

Rescuing unreachable image URLs

Shopify actively fetches your product images during the actual import process. If an image URL returns a 403 Forbidden error, the image simply will not load. If the URL enters a redirect loop, the fetch fails. The product will import successfully without any visual assets attached.

You must test every single image URL before uploading the CSV. Many legacy platforms employ basic anti-hotlinking protections. These protections view the Shopify server as a hostile bot and block the automated request. DataFlirt anticipates this exact scenario. We validate all asset endpoints rigorously during our ecommerce data extraction process.

WooCommerce import format how it differs

WooCommerce relies on parent-child database relationships instead of Shopify’s handle-based grouping logic. This fundamental shift requires a completely different approach to schema mapping and data transformation.

Contrasting the structural requirements

Many stranded merchants pivot between various modern platforms during a migration. Shopify is the 4th largest platform worldwide holding a 10.32% overall market share. WooCommerce represents another massive segment of the market. If you are migrating your catalog to WordPress, your extracted CSV must change shape drastically.

WooCommerce uses a highly specific ID mapping system. It relies heavily on the Type and Parent columns to build product families. The table below illustrates the primary differences in formatting between the two platforms.

Feature CategoryShopify CSV SchemaWooCommerce CSV Schema
Grouping identifierStrictly the HandleThe Parent ID or parent SKU
Variant architectureMultiple rows sharing one handleRows with Type set explicitly to variation
Asset handlingOne image URL per single rowPipe-separated image URLs contained in one column
Unique ID requirementsHandle is strictly mandatoryBlank IDs are tolerated during initial creation

Adapting to the WooCommerce syntax

WooCommerce allows all image URLs for a single product to exist within a single cell. You simply separate them with pipe characters. Shopify demands a completely new spreadsheet row for every additional image on a product.

WooCommerce also tolerates blank ID fields during the initial creation phase. It automatically assigns sequential database IDs upon a successful import. Shopify will aggressively reject any row missing a valid handle value.

DataFlirt adjusts the delivery schema based entirely on your target destination. Whether you need a Shopify layout or a WordPress structure, DataFlirt provides detailed audits of the output format. You can learn more about scoping these format choices in our guide on in-house vs hosted scraping.

What DataFlirt delivers for a migration extraction

DataFlirt delivers a finalized, platform-matched CSV tailored exactly to your target system. You receive highly structured data ready for a one-click import, eliminating manual formatting entirely.

Engineering a flawless dataset

Formatting a raw web extraction into a strict Shopify schema takes immense technical effort. Cleaning messy HTML descriptions requires precise regular expressions. Standardizing pricing syntax across thousands of pages requires robust logic.

Verifying image availability requires automated testing at scale. DataFlirt handles this entire transformation layer seamlessly. We provide a platform-matched CSV specifically structured for Shopify or WooCommerce. Our pipelines validate all extracted image URLs systematically.

If your old platform hosts images unreliably, we extract the raw image files directly. We can then host them temporarily on a secure content delivery network. This guarantees Shopify can fetch them perfectly during your migration window.

Delivering the final migration file

We structure your variant rows flawlessly. Every handle is unique. Every option column aligns perfectly. A freelancer on a gig platform might successfully scrape a flat list of 200 products from a basic site.

When you have thousands of complex variants, missing SKUs, and a strict file size limit, the technical burden explodes. Extracting massive sites like eBay, Amazon, Lowe’s, or Alibaba demands robust enterprise architecture. Migrating your own catalog requires the same exact rigor.

That is the specific range where DataFlirt’s dedicated data-extraction protocols and quality assurance checks start paying for themselves. Roughly 90% of recent ecommerce migrators experienced revenue improvements after successfully switching platforms. Reaching those gains requires moving your data safely and completely.

DataFlirt guarantees your catalog migrates securely. We provide a fully formatted sample row for your review before processing the full catalog. We do the heavy technical lifting so you can focus on designing your new storefront.

FAQ

What if the source site uses JavaScript rendering?

DataFlirt handles JS rendering natively as it is one of the most common migration requirements. JS-rendered product pages require a headless browser to execute the code before extraction, which our infrastructure manages automatically without any configuration on your end.

Can I migrate to BigCommerce or WooCommerce the same way?

Yes. The field mapping differs by platform but the scraping method is the same. DataFlirt delivers in the target platform import format, adjusting the column headers and variant grouping logic to match whichever system you choose to adopt.

My products have 50+ variants each. Will the CSV get complex?

High-variant products produce long CSVs but Shopify handles them cleanly as long as Handle is consistent and Option columns are correctly sequenced. Our extraction pipelines enforce this sequential grouping automatically to ensure smooth ingestion.

If you would rather not scope this entire migration logic yourself, DataFlirt’s ecommerce scraping service handles the extraction, QA, and final delivery. We seamlessly translate your stranded legacy website into a perfectly formatted, import-ready Shopify file. Reach out to our team today for a free scoping call.

More to read

Latest from the Blog

Services

Data Extraction for Every Industry

View All Services →