How to make the most of data surplus

Published in

Speechmatics

5 min readMar 7, 2019

A three-part blog series addressing the reality of having a surplus of data in the increasingly data-driven world. In this series, I’ll offer possible solutions to our excess of data, from tips on how to train on a surplus of data and advice on how to reduce your data down to the most important examples, to practical suggestions on how to use an excess of data in one domain to improve performance in a different domain. Can’t wait to read the entire series? Download my e-book here!

Part 1

What to do with all that data?

For most of the history of machine learning, data has been a precious commodity. By necessity, the field has had to spend time developing techniques that made optimal use of small amounts of data. Of late, however, large amounts of data are becoming increasingly available. On a global scale, there are commentators talking about data following Moore’s law, with the amount of raw data doubling roughly every two years.

This is great, right? With the explosion in the use of deep learning, which is even more data hungry than more traditional machine learning methods, more data will help us learn better, more nuanced models! Well yes, but only up to a point.

The CPUs cannae take it, Captain!

The immediate problem is that of computational resource. The demands of modern deep learning systems have already meant that traditional CPUs are insufficient for the task of training models. There are also signs that Moore’s law for computational resource is slowing down, with physical limitations starting to have an impact and costs have been rising (see Rock’s law) even as compute gains are released.

The computational limits have been circumvented to some degree by mass parallelism of parts of model training by using GPUs — graphics cards developed originally to produce cutting-edge video game visuals but now repurposed as the core engine driving the latest AI revolution. There are now even custom build tensor processing units that Google has made available on their cloud platforms, with academic papers reporting results of models built on hundreds of these devices.

However, even these hardware accelerants have their limits. Amdahl’s law tells us that speeding up one individual part of your compute pipeline will start to have diminishing returns fairly rapidly, so even if more speculative technologies such as quantum computing come along, we will still have a bottleneck somewhere (disk I/O, or data transfer from standard CPU to the hardware accelerator are often candidates, in our experience). At some point, the sheer mass of data available to an AI engineer will overwhelm their available compute resources. I won’t predict exactly when that will happen — predictions on technological topics tend to come back and haunt their creators — but I believe it will come.

So far, I have just talked about training. At inference time, more problems crop up. In order to make better use of more data, typically we increase our model sizes, so we have more parameters to better learn the data. That then leads to a bigger computational footprint when those models are in use. At a time when AI is trying to break into lower specification and low powered devices in the Internet of Things and mobile phones, larger models that require more power are not welcome.

Does all that data even help?

The previous discussion has the implicit assumption that more data leads to better models and artificial intelligence systems. This has led to the presumption that always throwing more data at a problem will lead to better results. However, that is not necessarily the case. If your data is of poor quality, adding more of it may actually harm your performance as your model will learn irrelevant or even incorrect associations. Alternatively, if your data is of high quality but all very similar to each other, then adding more of it will not help — your model will have learned one particular pattern very well, but slightly atypical examples at run time will confuse it.

This is a key point, so we’ll state it again:

More data does not always lead to improved performance.

To continue advancing the field we are going to have to start thinking about solutions to our excess of data. I will cover three possibilities. The first is how to train on all of that data (Single Pass Training), the second is how to reduce your data down to just the examples that are most important (Filtering out rubbish — part 2) and the third is how to use an excess of data in one domain to improve performance in a different domain (Domain Adaptation — part 3).

Single Pass Training

The simplest solution to having too much data to train on is to break the paradigm of training to convergence and where progress is measured in terms of the number of epochs completed. An epoch is a full pass over all your available data, updating your parameters as you go along. Typically, you use many epochs to converge your model towards an optimal solution, where convergence is defined as occurring when an epoch no longer improves your model on some validation dataset compared to the previous epoch.

If you have very large amounts of data, this level of iteration to convergence over your training set may not be possible within your hardware and time constraints. This is starting to become the norm in machine translation for example, where users are talking in terms of the number of hours or days a model has been trained for, rather than the number of epochs. To make this work practically, you need to have regular checkpointing, rather than waiting until the end of an epoch as was historically standard practice. This means outputting interim models at regular points — perhaps after every few hours or after a set number of pieces of data have been processed — then benchmarking those models against a validation set to track progress to convergence and also whether any hyperparameters need to be dynamically adjusted, such as learning rate decay.

These changes are not yet available ‘off the shelf’ in most machine learning toolkits. In the main, the paradigm of training multiple epochs over your data still rules. You may need to either hack the code to make it possible or adjust how you deal with your data — perhaps chunking your data into pieces and feeding in one at a time, ‘tricking’ your model into thinking each one is a full epoch.

In the extreme, this can be pushed to single pass training, where your models only ever see any individual piece of data once. This can work if your data is all of equal quality and relevance to your use case and there are no particular constraints on the order you train over it.

However, it is rarely the case that all of your data is of equal quality and relevance for your use case, which brings us to the next couple of tricks.

End of Part 1/3.

Download my e-book to get early access to Part 2 and Part 3 of my blog series. Alternatively, Part 2 will be published on 14 March, and Part 3 on 21 March.