Garbage In, Garbage Out: The Importance of Good Data

6 min readJun 4, 2019

Data is generated from every single digital interaction. And thanks to computer vision and millions of cameras deployed worldwide, we can generate data from nearly every physical interaction as well. While these devices can make inferences about who you are in the world, soon enough, those cameras will be able to detect what you’re thinking. Monica Rogati, a former data scientist at LinkedIn, created a simple illustration of the how data feeds into AI. The bottom four out of the six stages are simply the collection and refining processes of data, highlighting the pivotal role it plays in the development of any algorithm. Ultimately, the data is crucial. As Peter Norvig, Google’s Director of Research told the HBR regarding Google’s leadership in the field, “We don’t have better algorithms…we just have more data.” Additionally, it’s not just about the amount of data; the quality is just as important. Garbage in, garbage out.

Challenges Specific to Facial Recognition

Data, its source and diversity have a profound impact on facial recognition algorithms. Failure to be inclusive of all categories of demographics in the data during training leads to biased outcomes. The number of parameters like skin tone, color, gender, age, etc. is therefore crucial when collecting data sets and training models. But it’s not just that the lack of diversity of engineers; the available labeled data does not exist. Therefore, it has been a challenging task for even large firms to come up with solutions that are free from any biases and have uniform performance across all social, ethnic, and other classification categories. As an example, an MIT study found that the facial recognition systems of Microsoft, IBM, and Amazon had considerably lower performance identifying dark-skinned women in comparison to other genders and ethnicities.

Why Labeled Data is Important

Labeled data is a set of data that has been tagged or categorized and can be readily fed into a machine learning (ML) model. For example, in facial recognition, feeding a model data that has been accurately labeled with gender, age, and ethnicity is one way to train it. A labeled dataset will have images of all genders with predefined labels about whether a particular image is of a male or female or other genders. Labeled datasets are required for supervised or semi-supervised learning (where only some of the data is labeled).

Fixing the issue of inadequate data is a priority for any research team, and the entire industry. Different people have come up with different ways to get data from controlled environment-produced, diverse datasets from researchers at different universities and random scraping of images from the Internet. The more common approach is the latter as it has limited cost and theoretically ensures an endless supply. However, this has many limitations.

For one, the data collected from scraping isn’t always reliable. For example, a common way to fetch celebrity images is by typing their names into Google search and applying facial recognition algorithms. This is limited as more advanced algorithms need more than just a name and an image. There are many other important features that render this labeled data to fall under the unlabeled category, therefore making it useless for training and testing any meaningful facial recognition algorithms.

Secondly, for better obtained and labeled datasets, there have been many privacy concerns with data being used without the consent of individuals involved. A recently released dataset by IBM for research has come under criticism as the dataset had over a million photographs obtained from Flickr users with none of them asked for their active consent. However, the company insists that it has done so to make facial recognition algorithms more accurate and fairer for different social groups.

But having a lot of data also isn’t always the answer. For example, at Kairos, having 100,000’s of labeled images of different faces only helps as a baseline. Having multiple images of one individual in different environments, from different angles and with different facial expressions is far superior.

Analysis of Data Sources

Kairos’ analysis of the top open source datasets suggests that most include a limited number of data with images numbering in a few thousands on average. These datasets are mostly compiled by researchers or individual companies for research purposes — not for training commercial algorithms. While datasets attempt to include people from diverse backgrounds and ethnicities, they fall short of the diversity required to train a viable model. However, some of the newer generation datasets from Facebook for example, are easing these concerns and are increasingly becoming more reliable.

The analysis of how bigger firms obtain their datasets reveals a more complex structure. The most advanced companies in the field, namely Google and Facebook, use the data that is available to them from their respective users and label the same using semi-supervised learning or crowdsourcing a part of it.

However, an increase in the creation and enforcement of privacy laws threaten the sustainability of this practice, and it will become increasingly difficult to use data in the way it is being used now. Under the General Data Protection Regulation in Europe for example, photos of individuals are considered “sensitive personal information” if they are being used to confirm the individual’s identity. In the US, some states are also following a new convention, like the Illinois Biometric Information Privacy Act that prohibits the capturing, storing, and sharing of biometric information without the written consent of the individual. This definition includes iris scans as well as facial geometry.

It is evident that laws, privacy concerns, as well as the quality of the datasets will ensure that there will need to be an alternative way to collect or possibly generate larger quantities of more diverse datasets.

Can Generative Adversarial Networks Be a Solution?

Generative Adversarial Networks (GANs) are a type of generative neural network that are efficiently able to learn from a present dataset and reproduce similar data points (e.g. images of faces in face recognition). Previously, methods like the artificial introduction of Gaussian noise were used to generate more datasets, but the methods had their limitations which can be overcome by GANs.

A GAN consists of two separate neural networks, where one generates a distribution and the other attempts to determine if the distribution generated is real or fake. In a way, they compete with each other for better accuracies in their respective tasks, hence the name “adversarial.” They work by taking any random numbers (or noise) as inputs that are then transformed using the network, and the generated distribution is compared with the original dataset and the error is backpropagated to improve the network’s performance. An example of outputs of digits and faces generated by a simple network will be:

Source: Generative Adversarial Nets (University of Montreal)

The results of more advanced GANs that use deconvolutional and convolutional neural network pairs are more promising than the original paper. It is now possible to generate faces with specific features (e.g. darker-skinned women) to make the algorithms more accurate and reduce the risk of bias in the identification for a diverse population.

With the onset of more privacy concerns and possibly stricter legislation against the use of scraped data for machine learning and advancement in the field of GANs, it is likely that they will replace the conventional sources of data gathering and introduce more fairness and transparency in the coming times.

Garbage In, Garbage Out: The Importance of Good Data

Written by Mary Wolff