Machine Learning Development in the Cloud — Part 1

Published in

CodeX

4 min readAug 23, 2021

It’s remarkable how much variation there can be in different machine learning development structures used by data scientists in different contexts or organizations.

In some places, data science code is run almost entirely on laptops, and maybe some of the code is stashed on github if you’re lucky.

Data is copied and transferred around, because there’s no effective cloud storage solution in place. Everyone’s development environment is a little different, with versions and libraries unique to their workstation.

In other places, like my workplace, everything or nearly everything is in the cloud.

Our data is in S3, our code is in Github, our images and environments are in dockerhub, and our IDE is our own product, Saturn Cloud. There are still times where I use a local IDE (VS Code usually), but these are more often cases where I’m doing website design or writing docs, not developing for machine learning projects in particular.

Reflecting on this spectrum of working approaches, I’m interested in exploring how and why we choose different tooling for these different critical functions.

In the enterprise setting, three major stakeholder groups have influence on the choice of tooling in these areas, cloud or otherwise:

End users — data scientists, developers, data engineers, and so on
Management — people who are responsible for the budgets and ensuring the big picture job gets done
Compliance — people who predict, manage, and reduce risk for the entire business

These groups all have different priorities, and can all in some form be veto authorities over the use of a tool. At the same time, no tool is going to be perfectly fit to everyone’s needs and wants, so a compromise must be reached.

Broad considerations for “cloud” versus “local”

A Note on Meanings
Cloud is not the same as a remote service hosted in a datacenter owned by your organization. (This might be known as “on-prem”.) In this case, you are not contracting with a cloud provider who is a wholly distinct business, but you’ve got a part of your own business providing services to internal users.
Differences may include a variation in the liability/risk responsibility and which parties take that on, as well as the response to downtime and prioritization of functionalities. If you run the data center, your users are always the number one priority — for AWS or GCP, this isn’t likely to be the case.
In most of this series, I will not be referring to on-prem, even though some of the considerations for the end user may be overlapping.

Budgets

Businesses naturally have an interest in the cost of tools they adopt. Whether cloud or local choices are cheaper (in the short or the long term) is very dependent on the character of the organization and the type of business they are engaged in. I’ll discuss how this comes in to play for the areas of interest, particularly data storage.

Security and Compliance

In all the different areas to be discussed, cloud and local approaches have very different risk profiles. Neither is perfect or risk free, but the things to be concerned about are different. Combine the different risk profiles with the company’s priorities, and that makes a large impact on choices of tooling.

Usability

Any tool has to demonstrate that it meets the needs of the users in order to achieve adoption, no matter how much it might shine on risk reduction or price. I’ll discuss how this is manifest for all the different areas that we’re reviewing. This also has a direct line to the security and compliance concerns already mentioned — if a tool is implemented, but it is unusable for the employee, then it won’t be used. Instead, an unmonitored tool of unknown risk profile gets used, and the organization is left in greater danger than is officially realized.

Usability also has an angle for the management and compliance stakeholders — if the end user finds a product easy to work with, but managing and administering it is unbearable, this still counts as a usability failure.

Beyond these general categories, or across them, are some other factors that I’ll discuss. The ability to increase user productivity, save time, and better handle crises, for example, are all aspects of the choice of development tools.

What’s Next

In the series to come, I’ll be taking a close look at these 5 areas of the data science or machine learning development workflow, telling you what the cloud approach looks like versus the local approach, explaining the pros and cons of each, and describing a few tools you might want to try.

IDE: Development environments/tools
Code: the actual scripts and code you write
Data: the raw material for your machine learning
Environments/Images: the libraries, packages, and settings you need
Jobs and Automation: scheduled tasks that you don’t need to manage by hand

In the meantime, find me at www.stephaniekirmer.com and learn about my company and our platform for fast, easy data science and machine learning at www.saturncloud.io.

Thanks to my great friends Albert Xue, Bernard Beckerman, and Tony Bouril for their invaluable editing and feedback on this series!