So, You Want to Reduce the Cost of Deep Learning Experiments?

Published in

Chowagiken Tech Blog

6 min readJul 29, 2020

In this era of Deep Learning and AI, training a model with millions of parameters is not a “cheap” task. I mean this both metaphorically and literally. And if you don’t own a personal setup for that, training a model on cloud-based resources may cost you a fortune.

There are plenty of services that provide compute-engines with CUDA supported GPUs. But all of them are almost equally expensive. And if your code is not optimized well it’s going to be a curse for you.

Many of our Deep Learning enthusiasts don’t really focus on the optimization of code and try to allocate an excessive number of resources for just a tiny improvement on overall performance. But with a little optimization, they can achieve the same and save a lot of time and money.

So, here I’m going to point out some of the best practices to reduce the cost of the training of Deep Learning models.

Profiling the Code

Code profiling is not that popular among the deep learning practitioners, but it can give one a better insight into what’s happening under the hood. Profiling your code can mean many things, but for now, the most important part is to know which functions or methods are taking most of your time.

Every Deep Learning framework is highly optimized for doing matrix manipulation tasks. But a model has many other things to offer. For example, a convolution operation usually takes way more time than just a simple matrix transformation. In another case, maybe you’ve written a new technique that involves loops iterating through your parameters multiple times. Or maybe you’re applying a lot of content-aware data preprocessing on the fly.

When you’re profiling your code, you’ll be able to find out all these bottlenecks easily. Maybe some of the blockers will be unavoidable, but you’ll definitely find some that are easily avoidable.

Writing a Good Data Loader

When you’re training your model, your attention should be on how well the model is learning, it’s your top priority. So, your data loader should provide the necessary data whenever the model needs it, it should not keep the model waiting. Investing a little bit of time on writing a good data loader will eventually benefit you in many ways.

You can assign multiple workers for processing your batches fast. Data prefetching is also a good technique to reduce inactive time during training.

If your method of training allows, you can also save your data in HDF5/Parquet/Feather containers and read it back later in the training process. In HDF5 format, you can save data in different groups to fetch data even faster.

Monitoring System Usage

System monitoring will give you a better understanding of how well you’re doing in utilizing your resources.

For example, in the figure above, we can see that due to our CPU’s inability to deliver data in time, the GPU can’t function in its full capacity. In this case, adding another GPU or replacing the GPU with more memory and power will not solve the problem.

Now, as CPUs are much cheaper than GPUs, after adding a powerful CPU, the result should look like this

Fig: Significantly improved GPU duty cycle

The plots show a significant improvement in the GPU utilization, but a decrease in the duty cycle of the CPU. Although it’s just an example, it demonstrates how we can make better trade-offs to reduce the wastage of resources.

Mixed Precision Training

Mixed-precision training¹ is now a recognized technique to train Deep Learning models faster and reduce computational complexities. We can take advantage of this technique if we’re doing a PoC or we’re in a situation where we’re bound to make a trade-off between accuracy and speed.

Reduced precision improves training time significantly, and also takes less amount of memory. On top of that, if we use on-board-computers to deploy such models, the inference time will also go down by a large amount.

Allocating GPUs Later

When you’re in the primary phase of your development, you might not need a GPU at all. Transfering codes from CPU mode to GPU mode is just a matter of seconds. So, don’t waste your money on unutilized GPU.

Do all your data cleaning, preprocessing, and sanity checkings in CPU mode, and finally, when you’re confident about your code, allocate a GPU according to your budget and requirement.

Logging as Much Information as Possible

Try to reduce the number of meaningless experiments. Beginners do this mistake more frequently. They don’t log any information about the training and eventually end up doing the same experiments multiple times.

At least log the hyperparameters and major changes in the training strategy to eliminate such events.

Calculating Accumulated Gradient

Accumulating gradients can be a lifesaver if you’re running out of GPU memory while trying large batch size. For example, If you can only pass a batch of 4 samples without any problem, but you want to pass batches of 16 samples, then you can make 4 forward passes, and accumulated the loss of each pass. Then perform only one backward pass for those 4 batches.

Sadly not all the Deep Learning frameworks provide this gradient accumulation feature, so you might need to write some additional functions to implement this.

Using PyBind to Speed Things Up

Don’t let your loyalty towards a programming language cost you more than it deserves. Python gives you a lot of options and flexibilities, but in terms of speed, it’s significantly slower than C/C++, and many other programming languages.

So, learn to write some code in C++ to speed up the overall process of training. PyBind11 is a great tool to combine C++11 with Python. Give it a try!

Using Hyper Parameter Tuning Tools

Invest some time in hyperparameter tuning tools⁴. Believe me, it’ll save a lot of time later.

Improper data initialization with a default hyperparameter setting can cost you an additional ten to hundreds of epochs to reach your desired goal. So, rather than wasting time in every experiment try to lose it once in the beginning.

Using a Recognized Framework for Automating Things

There are some awesome libraries and frameworks to automate many of the things discussed above. For example, if you’re using Pytorch, then there are libraries like Pytorch Ignite and Pytorch-lightning to do many tedious tasks for you.

Pytorch-lightning³ has gained a lot of popularity lately for combining many useful functionalities that can give you a strong ground to do your experiments in a more organized way. Different useful hooks with third-party library support have made it one of the best ML libraries.

Optuna, on the other hand, is a dedicated hyperparameter tuning framework that can run sophisticated hyperparameter search algorithms even for computationally expensive models. Integrating it with the popular Deep Learning framework is very easy and intuitive.

See the Big Picture

Finally the cliché!
Try to visualize your progress by some standard metrics and scores. It’ll give you a wider perspective of the behavior of your model. Most probably you won’t like to continue training a model that is overfitting from the beginning.

I hope this post will help you to perform your Deep Learning experiments more efficiently. If you have any questions or suggestions regarding this post, please feel free to leave a comment. Have a nice day!