8. Common training data errors / AI Product Management

Hima
6 min readJul 10, 2023

--

In the last post, we talked about building a dataset with the business problem in mind. In this post, we will discuss the common errors that can creep into your training datasets and how to avoid them. These common issues in training data can significantly impact the outcomes of our AI models and products.

Here are the 5 common data errors and what can we do to avoid them.

1. Mislabeled Data

Annotating and labelling data correctly is a prerequisite for a well-functioning model. Accurately labelling data when parsing entities from the text can be challenging because it may be unclear where specific entities should be classified. Similarly, if we are annotating an image or video or audio, there are several ways in which we can get labelling wrong.

Source: Telus International

For example, in the above text annotation, People/band names are in orange, countries are lighter orange, cities are yellow, and album titles are in red. If the annotator is not given the right instructions, it can be difficult for someone without much knowledge of music or the artist to decipher that “Crisp” is a band name and “Healing is Difficult” is indeed an album name.

Therefore, it is crucial to exercise caution when assigning labels to training data. We must ensure that the labels returned during data collection and annotation, whether done manually or through a large platform provider, are correctly classified as expected.

How do I avoid such errors?

To reduce the risk of making mistakes, we can provide clear instructions to annotators (humans or platforms) to avoid mislabeled data scenarios. It would also be helpful to have quality controls in place to ensure that labelling mistakes don’t repeat often and hence, are not detrimental to the accuracy of the model and business outcomes.

2. Unbalanced training data

Suppose you are building a model to classify different indoor plants. If your training data distribution is as shown above, with a significant imbalance between the classes, then it is an example of unbalanced data. In such cases, the model will tend to learn more about the over-represented classes, such as Spider Plant or Peace Lily, and may exhibit bias towards those classes when deployed in real-world scenarios.

How do I avoid such errors?

To prevent this, we need to either collect more training data for the lacking classes or reduce the amount of training data for those that have too much. While having more training data often results in a better model, it is equally essential, if not more so, to balance the data to produce an unbiased model.

3. Bias in Training Data

Bias in training data is a topic that keeps cropping up now and then.

The bias can be introduced in possibly 2 stages:

  1. By having unbalanced training data (as mentioned above) wherein you only train the model with predominately one class of data
  2. Bias could be induced during the labelling process or by annotators. Data bias can occur when using a sizeable heterogeneous team of annotators or when a specific context is required for labelling.

How do I avoid data labelling biases?

To ensure that mistakes are caught before they influence your model, implement quality checks throughout your data labelling process. We can also leverage AI to double-check annotators’ judgments (known as smart labelling) before they are submitted.

4. Training data differs from real data

Another scenario is when the training examples differ from real-world examples, even if they are technically identical. For example, if you are building an AI model to detect anomalies in heavy mining machinery noises in audio, you must provide the model with examples of all the types of noises it needs to track in real time. This is necessary in order to detect any machine faults.

Consider the two audio recordings shown here. The recording on the top is a simulated malfunctioning sound of heavy machinery, which was collected in a recording studio. The recording on the bottom, however, was collected on-site, and as you can see, it has a great deal of noise, just as in real-life situations. If we were to train the model only on studio audio, it may fail to detect audio collected from mobile devices. This is an example where we need to be aware of what the real-world data will look like so that we can collect appropriate data.

A quick note on data recency (how current is the data that the model is trained on):

Data recency matters because all models degrade over time as the world evolves and moves forward. For example, after the onset of the pandemic, recognizing human faces became increasingly challenging with the addition of face masks and PPE equipment.

How do I avoid such errors?

Always make sure that your training datasets are

  1. highly representative of the real world, i.e. data is collected on the point of origin in real world scenarios.
  2. and recent. The timestamp on data collection and annotation matters. The data should be kept recent to avoid any degradation over time or due to external cultural/economic/social changes.

5. Insufficient Data

Finally, let’s consider the amount of training data used to train our models. This can be a bit of a grey area, as the amount of data required varies widely based on several factors, including the complexity and type of data, the real-world applications of the data, and the model and its architecture itself. It’s important to consider what the real-world data will look like when determining how much training data to use.

Suppose you are building a self-driving car and the car would have to detect any animals that come in the way. If you are driving in outback Australia, being able to detect a kangaroo becomes very relevant and hence we need to include enough examples to train the self-driving car on that. If your training dataset does not have enough pictures/videos of kangaroos on the road, then your model may not be able to detect them and in turn your self-driving car may have trouble manoeuvring around them (unless you have a safe default built into the model to avoid unknowns)

How do I avoid such errors?

While there is no clear-cut rule on how much data is needed, we generally start with a few hundred examples of each target class and then scale up the amount of training data until we reach the desired accuracy.

Summary:

In machine learning, the training data is crucial for building robust AI. We rely on the data to provide the machine with the insights it needs to process real-world data.

By considering the types of data that the model will encounter in the real world, we can tailor our training data to encompass as many cases as possible. Although it may not be feasible to cover every case, we can monitor the model’s performance over time and update it as necessary to address any weaknesses.

Thanks for reading! In the next post, we will discuss training and evaluating a model.

--

--

Hima

A business and product strategist living in Melbourne, exploring my curiosity at the intersection of business and technology and an occasional matcha latte.