How long will it take?

Addressing and predicting the execution time of deep learning models

Stephen Mcgough
Digital Catapult
4 min readJul 14, 2020

--

Deep learning is rapidly becoming the go-to tool for many artificial intelligence (AI) problems. This is mainly due to its ability to outperform other AI approaches and in many cases outperform humans on the same task. Despite major inroads into the accuracy of such deep learning approaches, little work has been performed in determining how long a deep learning network will take to reach a particular level of accuracy. This can have a significant impact on whether we choose to use deep learning — as time here directly relates to cost either in terms of hardware which needs to be bought or time that is required within the cloud for training.

The simple answer to how long it will take is to say that deep learning is based around linear algebra, and that Mathematicians and Computer Scientists have known for many years how many operations are required to compute a linear algebra operation. Just sum these up and you have your answer. However, this fails to take into account such things as how many epochs (the number of full passes through the data) are required to reach a desired level of accuracy, or the subtleties of the hardware that the deep learning is performed on.

In addressing the first issue of the number of epochs, two approaches are commonly adopted. In the first, a user will pre-select a fixed number of epochs (often chosen based on prior experience) and train their deep learning network for this number of epochs. However, this can lead to a significant waste of computational time (and money) as the deep learning network may have converged on the best accuracy long before this. Alternatively, the network may not have reached convergence and the network will need to be further trained (by performing further epochs). The other commonly adopted approach is to keep performing epochs until a desired outcome has been achieved — for example, the accuracy on a held-out evaluation set is over a given threshold. This has the disadvantage of not knowing a priori how many epochs will be needed or even if the desired outcome will ever be achievable — leading to the training continuing indefinitely.

The second issue of not taking into account the subtleties of the hardware used is an area that researchers at the Digital Catapult and Newcastle University have been working on. The subtleties come from issues such as the optimisation provided within the Graphical Processing Unit (GPU) cards which is often used in deep learning. These optimisations allow for faster computation under certain circumstances, although they make it harder to accurately predict the execution time of an epoch. Instead of trying to model a deep learning network as a set of linear algebra operations, we see the network as a system that we wish to predict an execution time for. By doing this we can use deep learning itself as a mechanism for predicting the execution time of another deep learning network.

As this can get a little confusing to talk about, we’re going to refer to the original deep learning network that we want to predict the execution time for as the ‘system’. In fact, we can use this approach for any system be it a deep learning network or not. For our system, we have a number of characteristics: the type of GPU card that it’s running on, the memory of the GPU card, the type of deep learning we’re performing, and the volume of data we wish to pass through the network at each training step.

By running our system for many different input characteristics (e.g. GPU cards, deep learning networks, parameters), we can create our training data: input (set of characteristics), output (execution time). This training data can then be used to train a new deep learning network to predict the execution time. Figure 1 illustrates the difference between our approach (left) and a linear approach (right). In these graphs we are comparing the actual times of a ‘system’ with the predicted time our approach produces. In the ideal case the two values would be the same and all points would lie along the red line. As can be seen the points in the linear case fall further from the line.

Our approach can also help with predicting execution time for unseen deep learning networks and unseen hardware. It can also be adapted to predicting inference time (the time to make a prediction). This is especially of interest in cases where the deep learning model is going to be deployed into a device with low power availability (such as an embedded device which is running off a battery).

Figure 1: Comparing our Deep Learning approach (left) with a Linear regression approach (right)

If you’re interested in this work there is more information in the paper “Big Data 2018: Predicting the Computational Cost of Deep Learning Models”, Daniel Justus, John Brennan, Stephen Bonner, Andrew Stephen McGough. https://arxiv.org/pdf/1811.11880, https://ieeexplore.ieee.org/abstract/document/8622396

--

--