How to make the most of data surplus

Tom Ash
Speechmatics
Published in
5 min readMar 21, 2019

A three-part blog series addressing the reality of having a surplus of data in the increasingly data-driven world. In this series, I’ll offer possible solutions to our excess of data, from tips on how to train on a surplus of data and advice on how to reduce your data down to the most important examples, to practical suggestions on how to use an excess of data in one domain to improve performance in a different domain. Want to read the entire series? Download my e-book here!

In Part 2 of my blog series, I discussed what you can do with a surplus of data, specifically, how you can reduce your excess data down to just the examples that are most important i.e. filtering out the rubbish. I established that, by no fault of your own, your dataset probably contains a lot of rubbish which must be filtered out in order to more effectively train your models on. I proposed a number of methods for filtering data based on my own experiences, providing examples where appropriate. Part 3 will address how you use an excess of data in one domain to improve performance in a different domain.

Part 3

Domain adaptation

It may be that your large amounts of data are just not the type of data you think you need. Don’t despair! There are ways you may be able to use it anyway. In my primary field of speech recognition, this is a common occurrence — I may have lots of recordings of people speaking on broadcast news, for example, but few examples of people talking on a telephone, which may be the area I really want to target. In this case, you need to leverage that larger ‘out-of-domain’ corpus by taking the smaller ‘in-domain’ corpus as a reference.

In the simple case, this may just mean paying attention to any metadata you have. If you have data prelabelled in different categories, you may be able to simply select data that is close enough to your target domain that way. However, this throws away a lot of the use that the larger pool of out-of-domain data might offer.

The next step — which we at Speechmatics regularly use in our language modelling — is domain filtering. This means taking your small in-domain corpus and filtering your larger out-of-domain corpus against it to find the subset of the data that is most similar. For language modelling we do this by entropy filtering. Entropy filtering works by building small language models on both your in and out-of-domain corpora. Then you measure how well each model performs when modelling every sentence in the out-of-domain corpus. Those sentences for which the difference between these two measures is below a certain threshold are kept and the others discarded. This then leads to a smaller corpus which actually gives better results than using the entire out-of-domain corpus, as well as a smaller training footprint.

An even further step is to map your out-of-domain data into the same space as your in-domain data. Correlation Alignment (CORAL) is a ‘frustratingly easy’ first technique you could use. It’s called ‘frustratingly easy’ because it can be implemented in as little as 4 lines of MATLAB code and outperforms many more complex methods! The core algorithm can be understood as first whitening your out-of-domain data then re-colouring it with the covariance of the in-domain data and has been shown to be surprisingly effective in fields such as object recognition from images.

Other methods have also been shown to be effective — such as subspace alignment, where both in and out-of-domain data are mapped to a shared subspace and that mapping is iteratively repeated until the distance between the two datasets is minimised. The performance is typically improved by use of ‘anchor’ examples which can be used to measure how well the two datasets have mapped onto one another. Facebook’s MUSE uses similar techniques to map vectoral representations of words in different languages onto one another, with improved performance by providing an initial seed dictionary of translations.

Finally, if you have already trained a model for a different task using your large dataset, you can use transfer learning. This means training a model for a particular task and/or domain and then using that model as the basis for your target task or domain. In the simplest case, this means training a model on your out-of-domain dataset then retraining the weights on your in-domain dataset, essentially using the larger dataset to produce sensible initial values for your final model.

A really interesting example of transfer learning I found recently took this even further. The task was discrimination of audio clips into classes, with not much data available to train on. Rather than try to leverage audio data, they took the interesting step of stepping really far out-of-domain — into image recognition. They took spectrographic representations of their audio signals, turned them into images with a green tint, then used a pre-trained and very accurate image classifier as the basis for the audio classifier! The system trained to classify images was able to break the spectrogram images into useable sub features sufficiently well to massively improve performance over just using the limited dataset. This really shows that if you have sufficient data in any task it may be worthwhile to leverage it in completing whatever other limited data tasks you have.

Conclusions

We all know data is hugely valuable to those of us training artificial intelligence systems of any kind. Too much data can feel like a burden, either in training time, hardware requirements or simply output model size. However, I hope that over this three-part series I have shown you some tips for how to deal with that data. Whether that be reducing the data size into a highly efficient subset or using the data to bootstrap models in radically different domains and tasks, I hope you can learn to love your big data again.

End of Part 3/3.

Download my e-book to get full access to the rest of my blog series.

--

--

Tom Ash
Speechmatics

With a PhD from Cambridge University in Clinical Neurosciences, Dr Ash is a leading machine learning specialist working on speech technologies at Speechmatics.