Deep-Learning in the Cloud vs. On-Premises Machines

Mark Palatucci
10 min readNov 12, 2019

--

(And how to get the best of both worlds)

Should you just use the cloud for all your deep-learning training? Or should you consider buying or building your own hardware?

That’s a common question I hear, and what I often see is most businesses go straight to the cloud without considering purchasing their own equipment, and most academic departments often use their own machines.

Going to the cloud was also my first inclination in my previous company as I helped build out our training infrastructure. We had a strong cloud team, a great relationship with AWS and a lot of experience there. But after a few years, my thoughts have changed, and I’m now a big advocate of adding on-premises machines to your team’s workflow, and augmenting with the cloud for certain things. And there are ways to be clever with containers so you don’t really care where things are running.

AWS, Azure, and Google Cloud Platform (GCP) have some incredibly powerful tools for training and deploying machine learning/deep-learning models. I know the AWS stack pretty well, from the raw DL AMIs through the higher level tools like Sagemaker. In my previous company, I helped build out our deep learning infrastructure all on AWS, and can talk through the strengths and weaknesses of these tools — and also the on-the-ground reality of how our researchers and developers used it, and what the pain points were.

TLDR; After many years I’ve come to the conclusion that the ideal workflow for much deep-learning work is some dedicated on-prem machines, combined with containerized workflows with Horovod and NGC for super-scalable cloud bursting/training.

The cloud is great to just get going quickly, with the deep-learning AMIs you can spin up an instance in a manner of minutes (in theory) with all the latest and greatest software packages, drivers, etc. all pre-installed and tested. If you want super-quick, you could just launch a Sagemaker instance, with a GPU backed Jupyter notebook (or use Google’s Colaboratory). Once you start doing serious work, you’ll likely want your own dedicated cloud instance. You’ll gain the ability to persist whatever code/data/packages/etc. that you want more easily, and make it possible to train on tens (if not hundreds) of GPUs in parallel when it comes time to scale. The cloud will also make it fairly easy to create repeatable, testable, production workflows for training and model deployment. But you’ll also hit a bunch of annoying issues too that will make you want to buy (or learn how to build) your own on-prem machines. (See my other post: How to Build a Silent, Multi-GPU Water-Cooled Deep-Learning Rig for under $10k)

TLDR; The cloud is like cake, with hair on it.

Cloud Cons:

Availability — Try spinning up a single GPU box in AWS, and it will probably just work regardless of availability zone. Try spinning up a multi-GPU box (either 4 gpus with the p3.8xLarge or 8 gpus with the p3.16xLarge) and it’s a crapshoot. Want to train on 64 gpus across 8 p3x16xLarge systems? It’s a roll of the dice if you can get them (as of 2019). This was probably the thing I expected the least with the cloud. Try to get cheaper spot instances will make your life even harder. Today maybe the instances are available in the us-east-1 availability zone, tomorrow that doesn’t work but us-west-2 does. Oh well, your data is all living in EFS on us-east-1, and you can’t mount that on us-west-2. Maybe Amazon will improve the situation, but just realize the reality might be different than what’s advertised.

To make matters worse, I’ve noticed that constantly spinning up and down instances creates a distraction for developers/researchers. Maybe you’re in the middle of some great experiment but have to run to a meeting now — well time to spin it all down — “don’t want to waste money and god-forbid forget to shutdown later.” And well, no matter how hard you try and how many notifications and automations you put in place, every one will screw up eventually and leave an instance running over the weekend. We’ve all done it. And it only takes one mistake on a large instance to completely change your cost equation.

Baseline, the reality of getting your instances and your data when you want them is harder than you might expect. Moving data across zones adds headaches and time. Spinning up and down instances hurts productivity of researchers. Forgetting to shutdown an expensive instance is an inevitability.

Data/IO — moving large chunks of data into and out of the cloud can take days if you have hundreds of gigabytes to move, along with the zone issues mentioned above. You also have to be strategic: should you use EBS GP2 or provisioned IO volumes, S3, EFS, Lustre, or spin-up your own BeeGFS cluster? The baseline is that many things in AWS are bandwidth limited, and you have to be clever to get reasonable bandwidth for your team without creating a huge, ongoing financial liability. Unfortunately, that means your ML research folks also need to understand the details of AWS data management and billing.

Cost — if you’re a well-funded startup, have some AWS/Brex startup credits, or someone else is paying the cloud bill, you might not be too concerned with cost. But for everyone else, the costs can quickly get into the tens of thousands of dollars, even for a small team of less than 10 people. Train a single model for 4 weeks on a 4-GPU machine in the cloud (~$12/hr), it’s roughly the same cost as buying/building a similar machine that you could own and use for years. The payback time can be very short, and that’s even ignoring data storage of EBS, EFS, and S3. Another subtle thing is how organizations budget — in some places it’s a lot easier to have a fixed, one-time capital cost, than to have a potentially lower but uncertain on-going bill that’s very hard to predict in advance. Consider an academic department where a graduate student advisor gets a grant for $10k. Easy to buy a machine using that once, but much harder to bill against that grant for a monthly, multi-year, bill in perpetuity. You’ll never get a run-away bill with your own on-prem machine.

When people transition out or graduate, their cloud instances/data/etc. become on-going financial liabilities that generate monthly costs, which can be hard to track down and eliminate (who’s responsibility does that become?). Over time, your monthly bill will go up and it will be hard to know why.

DevOps support — if you already have a bunch of cloud workflows, your infrastructure or DevOps team probably has a way of doing things — which involves security, deployment, support/maintenance, logging, etc. The idea that you’ll just create a cloud account, spin up a box and start training might be fine in a small startup, but in any reasonable sized company with cloud engineers you’re going to create dependencies that will slow everything down and could cause big blockers. Need your DL AMI rev’d to the latest version so you get that new Tensorflow feature? Well file a ticket, and maybe someone will get to it after your next product launch. (We’ll solve this problem later with containers…).

Baseline, if you don’t mind the challenges above, and you can get some free credits in the cloud through the AWS startup program/Brex, then it’s a perfectly good (and silent option). You’ll need the cloud eventually to make things really production ready, testable, and scalable. The ability for a single individual to spin up 128 GPUs with super fast interconnects with a few lines of infrastructure code is nothing short of a miracle, assuming the machines are available. You can effectively rent a multi-million dollar supercomputer for a couple hundred bucks an hour with no up-front. But for routine training and exploration, it’s not necessarily the best option if you don’t have an infrastructure team to support you (i.e. working at Google, Facebook, etc…). For smaller orgs, you may start to feel bottlenecked on workflow, high costs, and the risk of a permanent monthly liability — and that’s when you realize you might need to consider some on-prem machines.

On-Premises (i.e On-Prem Machines)

If you’ve read this far, then maybe you’ve been convinced that the cloud is not the answer to all of your needs. As I mentioned above, I’ve become convinced that the ideal workflow is a mix of on-prem and cloud machines. This is both for cost, flexibility, and productivity.

Now if noise and/or sound is not a constraint — then it probably makes the most sense to just go out and buy a machine from one of the many systems integrators out there. I attended the Nvidia GTC conference last March, and I was blown away how many companies were there that are still able to survive doing this. Two that come to mind are Lambda Labs in San Francisco, and Puget Systems in Seattle. I don’t have any affiliation with either of these companies, but I’ve spent a lot of time on their websites looking at prices and find them very competitive. Also, the HPC blog posts and benchmarks from Donald Kinghorn at Puget, have been super valuable for setting up your own workstation.

If you live in San Francisco Bay Area, another good option is Central Computers, a small local chain with a handful of stores over the Bay. This type of store wouldn’t exist anywhere outside of Silicon Valley, but it’s an invaluable resource where I buy many of my parts. You can configure any system you want from any parts you want, and they’ll put it together for you for $100. They also have an express return warranty program — for a little extra money you can get a 3 year return plan where you can just swap the part for another one in store, or it might take a few days if they don’t have it in stock. I did this for a motherboard once, and had a new replacement in 3 days without having to deal with crappy manufacturer support.

As for costs on build vs. buy — there is very little cost benefit to building yourself. For whatever config you’re looking for, the potential savings is likely about $1000 or less. For a new builder, it will take 1–2 days to put everything together and run a bunch of benchmarks and/or burn-in tests (assuming it all goes well). Now maybe you want to learn something or do it for fun, and those are perfectly good reasons. Or maybe you want to buy some used parts and build yourself because you’re a student and every dollar counts. But for many businesses, the value proposition of building yourself is not that high.

As for handling all your software installation and support — this isn’t as hard as it used to be. Companies like Lambda Labs talk about their ‘Lambda Stack’ and AWS has their Deep Learning AMIs with all the drivers, packages, etc. pre-installed and tested. Now a few years ago, there was a lot of value in these stacks — getting version alignment of the Linux Kernel, Nvidia driver, CUDA, CuDNN, CuBlas, Tensorflow, model code, etc. could be a real-nightmare.

Luckily, this has all gotten dramatically easier because of the Nvidia GPU Cloud (NGC). NGC is a set of Docker containers that have complete software stacks for popular machine learning workflows pre-built and ready to go. You only need the OS, the Nvidia Driver, and Docker. For example, on my silent multi-GPU build, I was able to install Ubuntu 18.04.3 LTS version, use the built-in Nvidia 3rd party driver (430), and quickly install the latest Docker with Donald Kinghorns great instructions. In a couple hours, I had a complete software stack that allowed me to train some reasonably complicated models in parallel using Horovod on a multi-GPU system. I probably won’t have to think about upgrading the OS/kernel, Nvidia driver, etc. for the better part of a year. But I can change the rest of the stack with a single line using Docker.

Baseline, Nvidia GPU Cloud containers have completely changed the game on managing deep-learning software stacks.

There are a couple other benefits of NGC — one is that this really allows less maintenance of the underlying OS, whether it’s your own on-prem machine or a DL AMI running in AWS. As long as the driver is pretty recent, you can quickly use newer containers to get latest features of your latest package like Tensorflow, PyTorch, etc. These rev every month — but they often do not require any other OS or driver updates. So the maintenance cycle is much easier than before. It also makes replicating results much easier, as you can bundle up the majority of the stack, and move it with your model and data.

Another key benefit of using NGC containers is that you no longer care about whether your job is running locally under your desk, on a single GPU in AWS, or a cluster of 64 GPUs spread across eight p3.16xlarge instances with 8 GPUs each. Combined with a tool like Horovod, you can write your code once, and run it anywhere, regardless of how many GPUs you have available. I talked to one engineer at AWS Sagemaker who also claimed you can run NGC containers there, and avoid managing any machine instances. I haven’t seen that work personally, but that’s certainly the direction these things are going.

With NGC, you can write-once, run anywhere, on any number of GPUs.

This means a single developer can now have training infrastructure similar to what is available at only the largest, most sophisticated AI companies like Google, AWS, Facebook, etc. This doesn’t mean that they can afford it — but the benefit is you can start on a single GPU under your desk, and if you use NGC and Horovod, you’ll be able to scale it dramatically as soon as you have some dollars available with little or no software changes. You only trade dollars for bandwidth, without having to make other changes.

The net result is that the case for managing your own on-premises machine is now better than ever. You can augment and move to the cloud with little/no changes using Docker & NGC containers, for near infinite scalability.

And there you have it — you really can have your cloud cake, and eat it too. Mark Palatucci

--

--

Mark Palatucci

Roboticist, Co-Founder of Anki. Purveyor of AI and Machine Learning.