How to make the most of data surplus

Tom Ash
Speechmatics
Published in
5 min readMar 14, 2019

A three-part blog series addressing the reality of having a surplus of data in the increasingly data-driven world. In this series, I’ll offer possible solutions to our excess of data, from tips on how to train on a surplus of data and advice on how to reduce your data down to the most important examples, to practical suggestions on how to use an excess of data in one domain to improve performance in a different domain. Can’t wait to read the entire series? Download my e-book here!

In Part 1 of my blog series, I discussed what you can do with a surplus of data, and some of the drawbacks of having too much of it. I established that more data does not always lead to improved performance when it comes to training AI models and proposed possible solutions on how to train on all that data. Part 2 will address how you can reduce your excess data down to just the examples that are most important.

Part 2

Filter out rubbish

The truth is that a lot of your data is probably bad data. That is not your fault. You did your best. But sometimes your HTML parser barfs on a particular malformed website. Or an angry user abuses your web feedback form. These things happen, but regardless of cause, you don’t want to train on that particular piece of data. If you could filter out all that rubbish you would not only reduce the amount of computational resource you need to build models on it, you would also end up with better models as they won’t be misled by training on bad examples.

This is becoming a big enough task that this year’s World Machine Translation conference had a corpus filtering competition for machine translation data. The task was to take 1 billion words of supposedly parallel English-German data that had been roughly scraped from the web and automatically filter it down to smaller corpus sizes. I entered that competition with Speechmatics as part of a team and have some tips coming out the back of it.

First off, start simple. You don’t necessarily need to break out Tensorflow and start training deep convolutional Siamese networks to discriminate your good from your bad data immediately. Our first step was to devise a bunch of simple rules that quickly eliminated the worst examples. The nature of these rules will depend on your particular use case, so use your common sense and old-fashioned eyeballing of the data.

For our machine translation corpus, for example, we could pretty quickly see that some sentences were not correct translations of one another simply because their lengths were very different, so we eliminated any sentence pairs for which one side was much longer than the other. Other rules were similar — we used edit distance to check for non-translation, simple regex to check digits matched, an off the shelf language identifier to eliminate badly labelled data — and quickly we had reduced our corpus to less than a fifth of its original size.

Once you have reduced your corpus using simple rules, there may be more to be gained by more intelligent methods. One of the more straightforward methods is to use your final task as a way to filter your training data. By this I mean train a model to solve whatever problem you are attempting to solve and then use that model on your training data to identify good and bad data examples.

This is probably best explained by example. Let’s assume we are building an ant photograph classifier. We have lots of photographs of different ants, each labelled with the species of the ant in the picture. However, we notice that some of the photographs have been mislabelled, and some are not even of ants at all! Once we have used simple rules to eliminate the worst offenders, we are left with a smaller dataset that we still believe to have some bad data in it. So, we build an ant photograph classifier on the mixed data we have. We may choose to make this a smaller than normal model or build on a randomly selected subset of the data to save time and resources. We then apply that classifier on our source data, applying probabilistic labels to each photograph. There are two types of data we may want to remove.

The first is that for which the classifier disagrees very strongly with the data label. In this case, the data label is probably incorrect (perhaps the photograph is of a termite rather than an ant?) and so that data should be removed. We should note though that we do have to be careful that we aren’t removing correctly labelled examples that happen to be unusual (for example an Army Ant photographed from below when most of the images are from above). These unusual examples are extremely valuable to keep — some form of tradeoff often needs to take place here which you will need to tune for your own use case.

The second is that you might actually want to remove some of the examples for which the classifier very strongly agrees with the data. In this case training on that particular example is probably not adding much to the overall model, and by removing it you will be allowing your model to generalise better, by instead emphasising the diversity of your dataset. Keeping just the examples that the classifier already does well on will not give the best overall performance in real world tests.

So far, the discussion has considered pre-processing the data, but the technique of removing low/high classifier confidence examples is also starting to be used on the fly during training. Recent papers have shown great benefit in adapting which data you train on as you go along. Between epochs, you could check which of your training data you are struggling on and overemphasise those in the next epoch, whilst down-weighting those that the model is already very confident in.

Curriculum learning is a recent advance that takes this a step further and orders the training examples you present to your model training, with simple ones coming earlier in the training regime and more difficult ones later. This can be compared to how we learn at school or university — starting from more simple, general examples then gradually moving to more nuanced and difficult tasks as we learn more. The early examples lead to generalised abstractions, which are then useful in understanding and modelling the more complex or unusual examples later in the curriculum.

End of Part 2/3.

Download my e-book to get early access to Part 3 of my blog series. Alternatively, Part 3 will be published on 21 March.

--

--

Tom Ash
Speechmatics

With a PhD from Cambridge University in Clinical Neurosciences, Dr Ash is a leading machine learning specialist working on speech technologies at Speechmatics.