Impact of dataset errors on model accuracy

Published in

Deelvin Machine Learning

5 min readJul 27, 2021

Working with data is essential for ML researchers because the resulting model depends on the choice of data and the training strategy. Consequently, each researcher faces the following questions:

How much data is needed to obtain high accuracy?
Can a network learn from erroneous data?
If so, what is the maximum error rate that can be present in the data for the accuracy of the network to remain high?

computer image with handwritten and typed digits

In this article, we will try to find answer to these questions by training a network on datasets with a different number of samples and a different number of errors.

We used MNIST dataset which is used for classifying handwritten digits. MNIST contains 60,000 elements in the train sample and 10,000 elements in the test sample. Figure 1 demonstrates examples of images from the dataset, these are single-channel images of 28*28 pixels size.

handwritten digits from MNIST dataset — Fig. 1 MNIST dataset samples

The distribution by class in the train sample is as follows:

‘0’ — 5923, ‘1’ — 6742, ‘2’ — 5958, ‘3‘ — 6131, ‘4’ — 5842,

‘5’ — 5421, ‘6’ — 5918, ‘7’ — 6265, ‘8’ — 5851, ‘9’ — 5949

Network description:

We created a convolutional neural network from 2 convolutional layers and one fully connected one using PyTorch (torch.nn.Module). We used the activation function Relu and MaxPooling.

conv1 = nn.Sequential(nn.Conv2d(in_channels=1,                                out_channels=16,
                                
                                kernel_size=5,                                stride=1,                                padding=2),                      nn.ReLU(),                      nn.MaxPool2d(kernel_size=2))conv2 = nn.Sequential(nn.Conv2d(16, 32, 5, 1, 2),                      nn.ReLU(),                      nn.MaxPool2d(2))out = nn.Linear(32 * 7 * 7, 10)

To find out how much data is required for high accuracy of the network we conducted the following experiment. N elements of each class were randomly selected from the whole dataset (60,000 samples), the size of the new dataset = N * 10 classes. For each mini-dataset, a network was trained (10 epochs) and the accuracy was obtained on the test sample (Fig. 2, 3). Each point on the graph corresponds to the obtained accuracy on the test sample when training on each of these mini-datasets.

graph — Fig. 2 Dependence of accuracy of the test sample (10,000) on the number of elements in the class (from 1 to 3500 elements in the class)

The prediction accuracy of the test sample equals 99% when the network is trained on MNIST dataset. When training the network on mini-datasets with more than 100 elements in each class (or 1000 elements in the dataset), we get a test accuracy of more than 90% (Fig. 2, 3). In this case, the larger the training sample is, the more accurate the model predictions are. If the dataset contains more than 250 elements of each class, the accuracy reaches 95%. Using only a third of the original dataset, without any network complications, we nearly reach the accuracy on the original dataset, i.e. 2000 elements in each class give the accuracy of 98%. In Fig. 2 the horizontal step equals 100 (or +10 elements for each class at each step), in Fig. 3 the step is 10 (or +1 element to each class at each step).

When working with a large amount of data in datasets, errors may appear due to incorrect automatic processing or due to human error during manual tagging. It is intuitively clear that when erroneous data is present in the dataset, the accuracy of the network prediction is expected to fall. Let’s check how the accuracy will drop when training the considered network on datasets that contain errors.

MNIST dataset was corrupted in a certain way. In each class, a correct element was selected and the wrong random class value was assigned to it (out of the remaining 9 classes), thus in each class there was the same number of errors. For example, the picture shows “1”, and we change the name of the class to “5”. Then we ‘spoil’ 10 more elements of the dataset. Continuing in this way, we increase the number of errors in the dataset.

The training results are shown in Fig. 4. Each point on the graph corresponds to one training on the dataset with errors, while the horizontal indicates the number of errors in each class of the dataset, and the vertical indicates the accuracy on the train and the best accuracy on the test sample at 10 and 20 epochs.

We found that on this dataset that the network can be error resistant. If there are 1000 errors in each class (15–18% of errors in each class), the accuracy on the test sample is more than 92% (Fig. 4). When the number of incorrectly tagged elements of the dataset is increased, the accuracy on the test and train sample gradually decreases. At the same time, having 2000 errors in each class (i.e. a third of the dataset with errors), we still get high accuracy on the test sample — 90% (Fig. 4, green curve).

As shown above in Fig. 2 and 3, the network was trained on a dataset of 1000 images with the test set accuracy of 90%. Possibly, this explains why, even if there are 1000 or 2000 errors in each class of the dataset, the network had enough other correct elements for training.

It is worth noting that with an increase in the number of errors in the dataset, the accuracy on the train sample decreases faster than the accuracy on the test sample. This means that the network was able to learn from the corrupted data and was not just memorizing the training set. At the same time, more complex networks with a large number of parameters are able to memorize a training set with random (incorrect) labels, giving out almost 100% accuracy during training, but low accuracy on the test set. Such a case is described, for example, here.

Thus, we demonstrated that a convolutional neural network is capable of learning on datasets with errors, but its prediction accuracy decreases.

Impact of dataset errors on model accuracy

Written by Maiya Rozhnova