Product Data Extraction: Why It's Harder Than It Looks

Product pages are the most standardised content format on the public web. Every product page has a name. A price. A description. Usually an image. Often a stock status and a set of variants. The schema is so consistent that Schema.org codified it formally in 2026, and major retailers have been implementing it ever since.

And yet product data extraction is genuinely hard. Inconsistency across sites. JavaScript-rendered prices. Dynamic stock states. Variant configurations that require user interaction to display. The contradiction between how structured the underlying data is and how difficult it is to reliably extract deserves examination.

Everyone Has a Different Way to Write a Price Tag

The standardisation of product data at the conceptual level, name, price and description, does not translate into standardisation at the technical level. Every ecommerce platform represents these fields differently. Shopify stores use one set of CSS classes. Magento stores use another. Custom-built retail sites use whatever the engineering team decided. Marketplace listings on Amazon, eBay, and similar platforms have their own structures that change when the platform updates its front end.

Web data extraction software designed for product data handles this variation in different ways. Some use pattern recognition to identify price elements by their visual context rather than their markup. Some rely on structured data, JSON-LD or microdata, where it exists. Some build and maintain per-site rules that require updating when the target site changes its HTML structure.

Each approach involves tradeoffs. Structured data is reliable when present, but incomplete when fields are missing or when the implementation is inconsistent with the spec. Pattern recognition generalises well but can misidentify sale prices as regular prices. Per-site rules are accurate but brittle. Robust product data extraction systems usually combine approaches.

The Standard That Helped (Partially)

The Schema.org Product type gives extraction tools a standardised way to find product data on sites that implement it. When a retailer includes a JSON-LD block with a Product schema, the name, price, currency, availability, and often the SKU are in a predictable, machine-readable location that does not change when the site's visual design is updated.

Adoption is meaningful but uneven. Large retailers are more likely to implement schema correctly. Smaller independent stores, particularly those on custom platforms, often have incomplete implementations or none at all. Aggregators handling ecommerce data scraping at scale deal with this regularly. The structured data path works well for a portion of sources and falls back to HTML extraction for the rest.

The Variant That Doesn't Exist Until You Select It

Most product data extraction tools handle simple product pages well: one price, one stock state, one configuration. Variants are harder.

A shoe with five sizes and three colours is technically fifteen products sharing a page. The prices may differ by variant. The stock status almost certainly differs. Often the variant data is not in the page HTML at all. It loads from an API call when the user selects an option, meaning it is absent for an extractor that only processes the initial HTML response.

Extracting complete variant data typically requires either rendering JavaScript and simulating user interactions, or identifying and calling the underlying API that populates the variant information. Both approaches work. Both add complexity. Browser-based tools that run inside a real browser have an advantage here: they see the page the same way a user does: JavaScript already executed, variants already loaded after any initial selection.

Three Places the Data Ends Up

Competitor price analysis is the most common destination for extracted product data. Retailers want to know how their prices compare to competitors' prices at the variant level: not just whether a product is cheaper overall, but whether the specific size or configuration they sell most frequently is being undercut.

Catalogue data collection is the second major use case. Distributors, marketplaces, and comparison sites collect product names, descriptions, specifications, and images from manufacturer pages to populate their own listings. This is largely automated data collection work: scheduled, ongoing, usually involving hundreds of source URLs.

Market research is the third. Analysts studying how a product category is evolving, which features manufacturers are emphasising, how price positioning has shifted, which variants are being discontinued, extract product data as input to analysis rather than as data to republish.

For the research use case, SiteScoop handles product pages directly. Navigate to a product listing or category page. The tool detects the repeating product structure: name, price, and other available fields. Export to a spreadsheet. The variant challenge is addressed by reading the page as rendered, which means the data visible to the user is the data the tool sees.