Web Scraping Glossary

A

API (Application Programming Interface)

A structured interface through which software systems exchange data. When a website offers an API, developers can request data in a predictable format — usually JSON — without parsing HTML. Most public-facing websites do not offer APIs for their product or pricing data, which is why web scraping exists: it extracts the same data directly from the page as it renders in a browser.

Automated data collection

The process of gathering data from websites or other sources on a recurring schedule without manual intervention. In the ecommerce and market research context, this typically means a script or tool that visits specified URLs, extracts defined fields, and stores or exports the results — running daily, hourly, or continuously depending on how quickly the underlying data changes.

B

BeautifulSoup

A Python library for parsing HTML and XML documents. One of the most widely used tools for web scraping in code-based workflows. BeautifulSoup parses the raw HTML of a page and provides methods for finding and extracting elements by tag, class, attribute, or text content. It handles malformed HTML gracefully, which matters because most real-world web pages contain markup errors that strict parsers reject. Requires Python knowledge to use.

Background script

In a Chrome extension, the background script runs in a persistent or event-driven worker context separate from any browser tab. It handles tasks that need to persist across tab navigations — network requests, message passing between components, state management. In SiteScoop's architecture, the background worker handles cross-origin fetch requests that the popup cannot make directly due to browser security restrictions.

Browser extension

A software module that extends browser functionality, typically distributed through stores like the Chrome Web Store. Extensions can interact with web page content, modify the browser interface, and perform background tasks. For web scraping, browser extensions have a significant advantage over standalone scripts: they run inside the browser with the same credentials and session state as the user, so they can access pages that require login or that load content dynamically via JavaScript.

C

Chrome extension

An extension built for Google Chrome and Chromium-based browsers (Edge, Brave, Arc) using the WebExtensions API. Chrome extensions consist of a manifest file, background scripts, content scripts, and optionally a popup interface. They are distributed through the Chrome Web Store and can be installed in one click. Chrome has over 3 billion active users, making it the dominant platform for browser-based tools. See also: browser extension, content script, manifest.

Clipboard

The operating system buffer used to transfer data between applications via copy and paste. In the context of data extraction, copying results to the clipboard in a tab-separated format allows users to paste directly into Excel or Google Sheets without saving an intermediate file. The pasted data lands in cells automatically because spreadsheet applications interpret tab characters as column separators. This is often the fastest export path for one-off extractions.

Content script

In a Chrome extension, a content script is JavaScript that runs in the context of a web page — with access to the page's DOM — but in an isolated environment separate from the page's own JavaScript. Content scripts can read and modify page content, listen for user interactions like element clicks, and communicate with the extension's background script. In web scraping extensions, the content script handles tasks like element highlighting, user selection of patterns, and reading the page's HTML for processing.

CSV (Comma-Separated Values)

A plain text file format where each row represents a record and fields within each row are separated by commas. CSV is the most portable data format for moving tabular data between applications — every spreadsheet application, database, and data analysis tool can import it. One complication: Excel on Windows expects a UTF-8 BOM (byte order mark) at the start of the file to correctly interpret special characters and accented letters. Without it, characters outside basic ASCII can display incorrectly. See also: UTF-8 BOM, XLSX.

CSS selector

A pattern syntax used to identify HTML elements based on their tag, class, ID, attribute, or position in the document. Originally designed for applying styles in CSS, selectors are also used extensively in web scraping to target the specific elements containing data. For example, .product-title selects all elements with the class "product-title," and table tr td:first-child selects the first cell in every table row. CSS selectors are simpler than XPath but less powerful for complex navigation through document structure.

D

Data extraction

The process of retrieving structured data from a source — in the web context, pulling specific fields from web pages into a usable format like a spreadsheet or database. Data extraction differs from web scraping mainly in emphasis: scraping refers to the collection mechanism, extraction refers to the selection and structuring of the data. In practice the terms are used interchangeably. See also: web scraping, ETL.

Data harvesting

A broader term for the systematic collection of data from online sources. Often used in contexts where the scale is large — harvesting implies gathering in bulk across many sources or over extended periods. In marketing and SEO, the term sometimes carries a negative connotation (as in harvesting email addresses without consent), so context matters. For legitimate competitive research and market analysis, the practice is simply a form of automated data collection.

Data normalisation

The process of transforming extracted data into a consistent format. Raw scraped data often contains inconsistencies: prices formatted as "$1,299.00" on one site and "1299" on another, dates written as "Apr 24, 2026" versus "24/04/2026," product names with inconsistent capitalisation or extraneous whitespace. Normalisation resolves these inconsistencies so the data can be compared, merged, or analysed reliably. A basic step in most data pipelines after extraction.

Data pipeline

A sequence of automated steps that move and transform data from a source to a destination. In a web scraping context, a pipeline typically includes collection (fetching pages), extraction (pulling out the relevant fields), transformation (cleaning and normalising the data), and loading (writing it to a database, spreadsheet, or reporting system). The term comes from the ETL pattern used in data engineering. Most businesses doing competitive monitoring at scale run some version of a data pipeline, even if they don't call it that.

Deduplication

The process of identifying and removing duplicate records from a dataset. Scraped data commonly contains duplicates because the same product or listing appears on multiple pages (search results, category pages, individual product pages), or because a recurring collection job captures the same items across multiple runs. Deduplication is usually done by matching on a unique identifier — a product ID, URL, or combination of fields — and keeping only one instance of each.

DOM (Document Object Model)

The tree-structured representation of an HTML page that the browser builds after parsing the raw HTML. The DOM is what JavaScript and browser extensions interact with — it's a live, queryable object model of every element on the page, including elements added dynamically after the initial page load. Web scraping tools that run inside a browser (like extensions) can access the fully-rendered DOM, including JavaScript-generated content. Tools that only fetch raw HTML miss anything added after the initial load. See also: dynamic content, JavaScript rendering.

Dynamic content

Web page content that is loaded or modified by JavaScript after the initial HTML is delivered by the server. Product prices, inventory levels, and personalised recommendations are frequently loaded dynamically — the raw HTML contains a placeholder, and JavaScript fills it in after the page loads. Simple HTTP-based scrapers that fetch raw HTML miss this content entirely. Browser-based scraping tools, which wait for the JavaScript to execute before extracting data, handle dynamic content naturally. See also: JavaScript rendering, headless browser.

E

Endpoint

A specific URL that an API exposes for a particular type of data or action. For example, a product API might have an endpoint at /api/v1/products/{id} that returns data for a single product. When websites load data dynamically, they often fetch it from internal API endpoints — which are sometimes discoverable in the browser's network tab and can be called directly, bypassing the need to parse HTML at all. This approach is faster and more reliable than HTML scraping when it's available, though it requires technical knowledge to identify and use.

ETL (Extract, Transform, Load)

A standard pattern in data engineering for moving data from sources into a usable destination. Extract: collect the raw data. Transform: clean, normalise, and reshape it into the required format. Load: write it to a database, data warehouse, or reporting tool. Web scraping is the Extract step when the source is a website. ETL pipelines are typically automated and run on a schedule, making them suitable for ongoing competitive monitoring where the data changes frequently.

H

Headless browser

A web browser that runs without a visible user interface. Headless browsers execute JavaScript, handle cookies and sessions, and render pages fully — just like a regular browser — but do so programmatically without displaying anything on screen. Puppeteer (Chrome), Playwright (cross-browser), and PhantomJS (now deprecated) are common headless browser tools used in automated scraping. They can handle dynamic content and complex interactions but are resource-intensive and slower than simple HTTP requests. See also: Selenium.

HTML parsing

The process of analysing raw HTML text to extract its structure and content. HTML parsers convert the flat text of an HTML document into a navigable tree (the DOM) that can be queried with CSS selectors or XPath. Most web scraping libraries include an HTML parser — BeautifulSoup, lxml, and Cheerio are common examples. Parsers vary in how strictly they enforce HTML rules; real-world web pages frequently contain malformed markup, so practical scrapers use tolerant parsers that handle errors gracefully.

I

Infinite scroll

A page design pattern where additional content loads automatically as the user scrolls to the bottom, rather than being divided across multiple pages with numbered pagination. Common on social media feeds, image galleries, and some product listing pages. Infinite scroll complicates data collection because the full dataset is never present in the initial page load — content only appears in response to scroll events. Browser-based scrapers can trigger scroll events programmatically; simple HTTP scrapers typically cannot access content beyond the initial load.

J

JavaScript rendering

The process by which a browser executes JavaScript code to generate or modify page content after the initial HTML is loaded. Many modern websites deliver most of their meaningful content through JavaScript rather than in the raw HTML response — a practice sometimes called client-side rendering or single-page application (SPA) architecture. Scraping tools must execute this JavaScript to see the data the user would see. Tools running inside a real browser (extensions, headless browsers) do this automatically; tools that only fetch raw HTML do not. See also: dynamic content, headless browser.

JSON (JavaScript Object Notation)

A lightweight, human-readable data format using key-value pairs and arrays. JSON is the dominant format for data exchange between web services and APIs. In the scraping context, JSON is significant for two reasons: it's the format of many internal API responses that can be intercepted from a browser's network tab, and it's used in JSON-LD blocks embedded in web pages to encode structured product and business data. As an export format, JSON is preferred when the extracted data will be consumed by another system rather than by a human in a spreadsheet. See also: JSON-LD, CSV.

JSON-LD (JSON for Linking Data)

A format for embedding structured data in web pages using a <script type="application/ld+json"> block in the HTML. Websites use JSON-LD to communicate structured information to search engines in a machine-readable format — product names, prices, availability, reviews, business hours. Because JSON-LD is already structured and precisely defined, it's often the most reliable source for product data extraction when it's present. Not all pages include it; coverage is highest on ecommerce product pages and business listings. Defined by the Schema.org vocabulary. See also: Schema.org, structured data.

M

Manifest (Chrome extension)

The manifest.json file that defines a Chrome extension's metadata, permissions, and component structure. The manifest specifies which websites the extension can access, what browser APIs it can use, which scripts run in the background, and which scripts inject into web pages. Chrome Web Store reviewers examine the manifest carefully — extensions requesting broad permissions (like access to all websites) face additional scrutiny. Manifest V3, the current version, imposes tighter restrictions on extension capabilities compared to the older V2 standard, particularly around background scripts and network request modification.

MAP compliance (Minimum Advertised Price)

MAP is the lowest price at which a retailer is permitted to advertise a product, as set by the manufacturer or brand. MAP agreements are common in consumer electronics, sporting goods, and other categories where brands want to protect perceived product value. Brands monitor MAP compliance by tracking advertised prices across retailers — a task increasingly done through automated ecommerce data scraping rather than manual checking. MAP violations typically show up in public product listings before anyone reports them, making continuous monitoring significantly more effective than periodic audits.

N

No-code scraping

Web scraping tools and approaches that do not require writing code. The no-code scraping category includes browser extensions with visual point-and-click interfaces, desktop applications with drag-and-drop configuration, and web-based tools where users specify what to extract by clicking on page elements. The category has grown significantly as the underlying technical complexity has been abstracted away — tasks that previously required Python scripts and server infrastructure can now be completed by non-technical users in a browser. The primary trade-off compared to code-based scraping is flexibility: no-code tools handle common patterns well but may struggle with highly unusual page structures. See also: browser extension.

P

Pagination

The division of a large dataset across multiple pages, typically navigated with numbered page links or "Next" buttons. Product listing pages, search results, and directories commonly use pagination. For data collection, pagination means the full dataset requires visiting multiple URLs in sequence. Scrapers handle this either by detecting and following "Next" links automatically, or by constructing page URLs programmatically (e.g. appending ?page=2, ?page=3) when the URL pattern is consistent. See also: infinite scroll.

Pattern detection

In the context of web scraping tools, automatic pattern detection is the process of analysing a web page's HTML structure to identify groups of elements that share the same repeating layout — product cards in a grid, rows in a table, listings in a directory. The tool infers a selector or template that describes the pattern, then applies it to extract data from all matching elements. Pattern detection varies in sophistication: some tools find only clean, identical structures, while others handle variations in element counts or optional fields within a repeating pattern.

Price monitoring

The systematic tracking of product prices across websites over time. Businesses use price monitoring to understand competitor positioning, detect promotional activity, enforce MAP compliance, and inform their own pricing decisions. At the enterprise level, price monitoring involves continuous automated collection across thousands of SKUs. At the individual seller level, it may mean checking a dozen competitor listings weekly. The data is only useful if collected consistently — a single price snapshot tells you where prices are now; repeated collection tells you how they move. See also: ecommerce data scraping, automated data collection.

Proxy

A server that forwards network requests on behalf of the requester, masking the originating IP address. In web scraping, proxies are used to distribute requests across multiple IP addresses to avoid rate limiting or blocks from websites that restrict access based on request frequency from a single source. Proxy services range from shared pools of residential IP addresses to dedicated datacenter servers. Browser-based scraping that runs as a logged-in user with a real browser fingerprint typically requires proxies less often than headless or HTTP-based scraping, because the traffic pattern resembles normal browsing.

Public data

Information that is openly accessible on the internet without authentication, login, or special permission. Product prices, business listings, public social media profiles, and government datasets are examples of public data. The legal status of scraping public data varies by jurisdiction and context, but the general principle established in cases including hiQ Labs v. LinkedIn (US) is that accessing publicly available data does not violate the Computer Fraud and Abuse Act. Scraping data behind login walls, or using scraped data in ways that violate a site's terms of service, raises different considerations. See also: robots.txt.

R

Rate limiting

The practice of restricting how many requests a client can make to a server within a given time window. Websites implement rate limiting to prevent automated tools from overloading their servers or harvesting large volumes of data quickly. Scrapers that exceed rate limits typically receive HTTP 429 ("Too Many Requests") responses or get their IP address temporarily blocked. Responsible scraping respects a site's rate limits by introducing delays between requests. Browser-based scraping tools that operate at human-like speeds naturally stay within typical rate limits without additional configuration.

Regex (Regular Expression)

A pattern syntax for matching and extracting text based on character sequences and rules. In web scraping, regex is used to extract specific values from text — pulling a price from a string like "Was £45.99, now £29.99," or extracting a product code from a longer description. Regex is powerful for text pattern matching but fragile when applied to HTML structure directly — small changes in the page markup can break a regex pattern. It's generally used in combination with HTML parsers rather than as a replacement for them.

robots.txt

A text file at the root of a website (e.g. https://example.com/robots.txt) that communicates crawling preferences to automated bots. Using the Robots Exclusion Protocol, site owners specify which paths automated agents may or may not access, and at what crawl rate. Well-behaved web crawlers — including search engine bots — respect these directives. robots.txt is advisory rather than technically enforced; scraping tools are not required by any law to follow it, though ignoring it may factor into a site's legal position in disputes. See also: public data, rate limiting.

S

Schema.org

A collaborative vocabulary for structured data markup on the web, developed by Google, Microsoft, Yahoo, and Yandex. Schema.org defines standard ways to describe entities — products, businesses, events, recipes, reviews — in machine-readable formats embedded in web pages. Product pages that implement Schema.org markup include fields like name, price, availability, and review data in a predictable structure, making them significantly easier to extract reliably than pages without it. Schema.org data is typically embedded as JSON-LD. See also: structured data, JSON-LD.

Selenium

A browser automation framework originally built for automated web application testing. Selenium controls a real browser (Chrome, Firefox, Edge) programmatically, allowing scripts to navigate pages, fill forms, click buttons, and extract content from the fully-rendered DOM. It handles dynamic content and JavaScript-heavy sites effectively. In web scraping, Selenium is used when the target page requires interactions (scrolling, clicking to load more results, handling login flows) that simpler tools cannot perform. It's slower and more resource-intensive than HTTP-based scraping. See also: headless browser.

Structured data

Data organised in a predictable, defined format with consistent fields and types — as opposed to unstructured data like raw text or images. In the web context, structured data refers specifically to machine-readable markup embedded in web pages using formats like JSON-LD, Microdata, or RDFa, following the Schema.org vocabulary. Search engines use this data to understand page content and generate rich results. Scrapers can use it as a reliable extraction source when it's present, because it's already structured and doesn't require HTML parsing heuristics. See also: Schema.org, JSON-LD.

U

User agent

A string sent in HTTP request headers that identifies the client making the request — browser type, version, and operating system. Web servers use user agent strings to serve appropriate content and to identify and block automated scrapers. Scrapers sometimes set their user agent to mimic a real browser to avoid detection. Browser-based scraping tools that run inside an actual browser send the browser's real user agent automatically, making them indistinguishable from normal browsing traffic at the HTTP level.

UTF-8 BOM (Byte Order Mark)

A sequence of three bytes (EF BB BF) at the beginning of a UTF-8 text file that signals to applications how the file's text encoding should be interpreted. Excel on Windows requires the UTF-8 BOM to correctly display characters outside basic ASCII — accented letters, currency symbols, non-Latin scripts — when opening a CSV file directly. Without the BOM, Excel defaults to a system encoding that often corrupts these characters. Scraping tools that export CSV for Excel use should include the BOM. This is a known quirk; applications other than Excel typically handle UTF-8 files correctly with or without it.

W

Web crawling

The automated process of systematically following links across websites to discover and index pages. Search engine bots are the most well-known crawlers — they start from known URLs, follow every link they find, and index the content of each page discovered. Web crawling differs from web scraping in purpose and scope: crawling is about discovery and traversal across many pages, scraping is about extracting specific data from specific pages. In practice, large-scale data collection often combines both: crawl to find the pages, scrape to extract the data from each.

Web scraping

The automated extraction of data from websites. A scraper fetches a web page, parses its content, identifies the elements containing the desired data, and outputs that data in a structured format. Web scraping can be done at any scale — from a one-off extraction of a single product listing into a spreadsheet, to continuous collection of millions of data points per day across thousands of sites. The techniques range from simple HTTP requests with HTML parsing to full browser automation. The term is sometimes used interchangeably with data extraction and web crawling, though each has a distinct technical meaning.

WebAssembly (WASM)

A binary instruction format that runs in web browsers at near-native speed. WebAssembly allows code written in languages like Rust, C, and C++ to be compiled and executed in the browser — without requiring a server. In the context of browser extensions and web tools, this enables computationally intensive processing (parsing, pattern matching, data transformation) to run entirely client-side. SiteScoop is built in Rust and compiled to WebAssembly: all data extraction processing happens inside the browser tab with no data sent to any server. See also: browser extension.

XLSX

The file format used by Microsoft Excel since 2007. XLSX files are ZIP archives containing XML data and can store formatting, multiple sheets, formulas, and other spreadsheet features that CSV cannot. For data extraction purposes, XLSX is useful when the output needs to include formatting, column widths, or multiple sheets — or when recipients expect an .xlsx file rather than a CSV. The trade-off: XLSX is a binary format that requires a library to generate programmatically, whereas CSV is plain text that any tool can produce. See also: CSV, UTF-8 BOM.

X

XPath

A query language for navigating XML and HTML document trees. XPath expressions can locate elements based on their position, attributes, text content, or relationship to other elements — with more expressive power than CSS selectors for complex traversals. For example, XPath can select "the second <td> in any row where the first <td> contains the text 'Price'" — a query that CSS selectors cannot express. XPath is commonly used in scraping frameworks like Scrapy and in Selenium. It is more verbose than CSS selectors but more capable for sophisticated data location tasks.

Web Scraping & Data Extraction Glossary