Optimizing Cloud Costs for Deep Learning Trainings

Cloud Computing is becoming increasingly popular day-by-day (Image Source: Pinterest)

Selecting a cloud instance configuration that suits your workload and at the same time, taking care of your budget is an overdo that data scientists face when they set foot on conducting experiments. In fact, how do you choose which cloud provider works best for your business?

The major cloud providers — AWS, Azure, and GCP — look similar on the surface offering -

  • Huge compute power
  • Distributed resources
  • Large storage spaces

along with facilities like autoscaling and flexibility that reduce the burden of a developer. But when you dig deeper, you will find stark differences in their pricing that can send your company’s bills to a toss. Deploying AI applications in the real world requires extensive training and testing to reach considerable accuracy in order to deliver trustworthy results. Therefore, before going on to choose one cloud infrastructure, you will need to have a clear understanding of the dataset, framework, and workload that you will need. Compute will always be your highest cost driver. Minor changes in per-hour pricing and switching to alternate instances can save hundreds of dollars.

We at NetBook identified this problem and tried to create a simple solution that can save you hours of researching, documentation scraping and then selecting an architecture to go on with a Data Science task. We compiled open-source data of multiple GPU/CPU instances on different frameworks and datasets, clubbed it with the per-hour pricing information of each cloud provider and bundled it all into a Python package that will provide you with alternate cheaper and/or faster cloud instances for the task at hand.

NVIDIA houses training performance data on NVIDIA A100, A40, T4, V100 (and many more) GPU instances with benchmarking experiments conducted on numerous networks and datasets on major frameworks like PyTorch, MXNet, and Tensorflow. We combined this extensive data with the hourly pricing information of 4 regions:

  • US-East-1
  • US-East-2
  • US-West-2
  • AP-South-1

While collecting the pricing data, we observed a huge disparity in cloud costs region-wise and also across different cloud providers. This affirmed our hypothesis on how choosing the right instance is of primary importance for your business alignment. We tried to create a multi-card instance suggestor.

For instance, the one that could convert A100s into cheaper K80s, thus trading your time with cost. We provide you with an option to choose the region that has the lowest costs based on your experiment specifications.

When data scientist takes hold of a job, they usually are not aware of the infrastructure design to house their experiment. Therefore, another version of our Instance Suggestor takes into account your preferred framework and dataset and thereby suggests you the GPU instances which you can utilize to run your deep learning experiment. We also provide you an option to choose a fast or a cheap alternative from the suggested instances.

We were able to develop an Instance Suggestor that takes in your specifications about the region, dataset, framework, and cloud provider and informs you about alternate instances on either the same provider and/or a different provider that is cheaper or faster, depending upon your bandwidth and budget.

For now, the product suggests instances based on single-node training costs and training times. Although, in the future, we plan to extend it to

  • Multi-node training convergence statistics.
  • Incorporating more datasets and regions for holistic instance suggestions.
  • Running the experiment for a few epochs and suggestions based on the estimated time of convergence
  • Incorporating instance utilization levels on different experiments to refine the suggestions.
  • Dynamic updating of pricing data

The entire approach has been published as an open-source package on

Github :

PyPi :

We invite collaborations and suggestions on improving the current state of the module. Feel free to raise an issue on Github or connect with me on LinkedIn. You can also keep a track of updates on our product here: NetBook




Our Engineering team handles all the infrastructure so that Data Scientists can focus on their model building process. We will be writing about how we are building the world-class engineering solution here.

Recommended from Medium

Introduction to Support Vector Machine (SVM)

Remote face detection system with OpenCV and Twilio

Handling Class Imbalance — All You need to know

Random Forests — Explained

A clean stone path with stairs leads straight through a forest filled with fog.

Confusion Matrix : Must For Data Scientist and Machine Learning Engineer

Turing Award | A Deep Dive Into Levoy and Hanrahan’s 1996 Paper on Light Field Rendering

Feature Discrimination in Tweet Sentiment Classification

Classifying flowers using Deep Learning in Pytorch

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shruti Sharma

Shruti Sharma

More from Medium

How to setup Airflow Sensor’s mode as Reschedule

How to get VS Code working with GCP

A Comprehensive Guide to MLOps: Infrastructure As Code(IaC)

Challenges in data sharing and transfer