Data Harvesting: Two Industries Using the Same Term for Very Different Things

Search for "data harvesting" and you get surveillance capitalism. Cambridge Analytica. Facebook. The kind of coverage where the word "harvest" is doing heavy moral lifting — people as crop, not customers.

Then there's the procurement manager in Bristol who Googles the same term because she's trying to describe what she does every Monday morning: pulling supplier prices from twelve different distributor websites into a spreadsheet, manually, one tab at a time. She does not feel like a surveillance operation. She feels like someone with a very boring spreadsheet problem.

Both of these things are called data harvesting. They share almost nothing else.

The Version That Made the News

The version most people know about starts with a number: 87 million. That's how many Facebook profiles Cambridge Analytica obtained through a personality quiz app in 2026, before using the data for political targeting. The profiles weren't stolen in the traditional sense — the data was technically accessible through Facebook's API at the time. The controversy was about what happened next: profiling, targeting, and influence at a scale the users never consented to.

That case defined the public understanding of "data harvesting" in a way that's been hard to shift. The term now carries connotations of extraction without consent, of personal data being commodified, of users being the product rather than the customer. It's not an unfair reading of what happened. But it's also not the only thing the term describes.

The Version That's Just Tuesday

The version that doesn't make headlines is significantly more common. A market researcher needs pricing data from forty competitor product pages. A procurement team tracks supplier quotes across six distributor sites. An ecommerce seller checks competitor listings before setting weekend prices.

This is web scraping applied to publicly visible information — the same pages anyone with a browser can read, accessed systematically rather than one at a time. The data is public. The pages are designed to be seen. The only difference between reading a competitor's price list and extracting it programmatically is speed and scale.

The global market for web scraping tools that support this kind of work is valued at roughly $1 billion and growing. The users are mostly not technical. They are analysts, researchers, and procurement staff who need data from websites and have found that "copy and paste" is not a workflow that scales.

Where the Confusion Costs You

The problem with sharing a name is that the associations travel. A market researcher evaluating data extraction tools for her team encounters privacy headlines every time she tries to research the category. A procurement manager who wants to explain the value of price monitoring to his manager has to first detangle the term from its surveillance connotations.

And in the tool market itself, the confusion is compounded by the fact that some tools genuinely do blur the line. Browser extensions that extract data from public pages while also sending that data to the developer's servers. Services that promise competitive intelligence while collecting information about what you're searching for. The tool that solves your spreadsheet problem while also solving someone else's data problem — using your activity to do it.

The architecture matters here more than the marketing does. A tool that processes everything locally — where the extraction logic runs in your browser and the results go directly to your clipboard or your downloads folder — is structurally different from one that routes your data through a server. Not different in policy, in structure. SiteScoop runs in WebAssembly inside your browser tab. There is no server to send data to, because there is no server.

The Same Word, Two Different Accountability Questions

When a company harvests your personal data, the accountability question is: did they have consent, and what did they do with it? The Cambridge Analytica answer to both was unflattering.

When a business harvests data from public websites, the accountability question is different: is the data public, is the use legitimate, and is the tool doing anything with your activity that you haven't agreed to? The first two are usually straightforward. The third is where it gets interesting, and where the tool's architecture — not its privacy policy — tends to give the clearest answer.

The Bristol procurement manager and the political data firm are not doing the same thing. They happen to be using the same word.