The Issues With Buying Datasets For AI

Valerias Bangert
DataSeries
Published in
3 min readOct 13, 2021
Photo by EKATERINA BOLOVTSOVA from Pexels

Labeled data is fueling the latest and greatest innovations in AI.

As data-heavy deep neural networks become mainstream, particularly in areas like self-driving cars and recommender systems, labeled data is becoming a key component of AI projects. After all, the ability to get accurate training data is crucial for building accurate machine learning systems.

One need only look at recent headlines to see how important labeled data is going to be for our future: Tesla recently hired 1,000 data labelers to improve their self-driving systems, while Facebook is hiring 20,000 data labelers to label posts as fake or real, and improve their classification systems.

Many startups are making big money selling pre-labeled datasets that they’ve created for these systems. These datasets include everything from CCTV scenes to dashcam footage with lane information recorded.

I spoke with Adhip Ray, the founder of a startup consultancy, around privacy and copyright concerns when it comes to pre-labeled data. The takeaway from our conversation is that, with pre-labeled data, there’s a major caveat: All too often, pre-labeled data runs afoul of privacy and copyright concerns. Regulations protect our privacy and safety, and they need to be followed.

Privacy Is Important

Pre-labeled datasets are often very expensive, but may contain rich contextual information, which is why they’re often used. However, many of these datasets have not had personally identifiable (PII) de-identified before sale/use. Here’s where things get a bit tricky. As Adhip noted, “privacy matters sometimes crop up, in a weird and often unexpected variety of ways.”

Indeed, there are a myriad of laws and regulations when it comes to privacy concerns.

For example, GDPR (the General Data Protection Regulation) compliance requires that data be de-identified before it can be shared with third parties. This means that a dataset such as a set of GPS coordinates of points of interest may not be compliant if matching those coordinates to a person’s identity is still possible.

More commonly, CCTV or dashcam footage may include people’s faces, license plates, and other personal information. So what does this mean for the AI community? In many cases, it means that the datasets being used have not been thoroughly reviewed and audited for privacy concerns, so you should proceed carefully.

Businesses that fail to comply with these regulations face steep fines. For example, Amazon recently faced a nearly $1 billion fine for GDPR violations.

Concerns With Data Protected by Copyright

Another problem is that the data being used may be protected by copyright law. If there is any doubt about whether or not a dataset you are working with has been properly handled under these regulations, you should seek permission before using it in your product or service.

Using databases of public images or videos like Flickr or YouTube could lead to problems because there isn’t always adequate context provided with these datasets; for instance, often all you see is a single image but no indication of where that image came from and therefore who owns the copyright.

Further, the so-called sui generis database rights in Europe mean that the fact the mere fact of information being aggregated in a database means that it may be protected by copyright. This is enabled by the European Parliament’s Database Directive.

The Alternative

For all these reasons — privacy concerns, copyright concerns, and lack of context surrounding some pre-labeled data — many businesses have been turning to labeling data themselves. That said, it’s not a good use of a data scientist’s time to do the menial task of data labeling, so many businesses turn to data labeling platforms like Toloka.ai.

By labeling your data instead of buying it, you can ensure that it is properly labeled and complies with local privacy and copyright regulations. Then, you can just focus on building your AI system without having to worry about whether or not some other company has been sloppy with their handling of PII data.

Ultimately, data is the fuel for modern AI projects, but all too often, pre-labeled datasets don’t adhere to privacy and copyright laws.

--

--

Valerias Bangert
DataSeries

Valerias Bangert is an award-winning content specialist with experience bringing dozens of companies to #1 in Google rankings.