Why data professionals should understand infrastructure?

Getting started with machine learning infrastructure.

Johannes Giorgis
Acing AI
4 min readApr 19, 2021

--

Engineering infrastructure is a uncool, but yet hugely important topic. 20 years ago it was the domain of the IT department. Today with the wide adoption of cloud, it is at everyone’s finger tips.

So why should you, the data professional care about this? Isn’t it enough to care about cleaning data, exploring data, modeling data and dealing with biases? Why add another entire field into the endless list of topics you need to keep up with?

In this article, we will look at what advantages that a knowledge of infrastructure, specifically cloud infrastructure can provide you with — in both professional and personal areas.

Photo by Michael Roach on Unsplash

What is infrastructure?

First, infrastructure in this context is cloud based infrastructure that can be used to build and deploy systems. Think Heroku, Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, Alibaba Cloud, DigitalOcean, Linode…etc. These all provide varying degrees of cloud based services starting with the most basic, computing and storage all the way up to managed Blockchain and Ground Stations (for satellites).

What a time to be in technology!

Just as the programming languages, frameworks and tools continue to evolve every day, so does the infrastructure that we can run all our amazing applications!

You built a simple Flask Machine Learning app and want to deploy it? Heroku will let you do it for free :)

You want to spin up a server to play around with and not worry about unexpected bills? DigitalOcean and Linode are the way to go!

You want to run more complex data enrichment pipelines and visualize them? AWS, GCP and Azure got you covered!

So, You the Data Professional/Builder have plenty of options. We will cover how to get started for free at the conclusion of this article.

Why Infrastructure for Data Professionals?

All this sounds great! But what does it have to do with data science?

Data Science has many various and at times conflicting definitions. For me, it is the art of using a combination of tools with data at the heart of them to create insights and solve problems.

Most data science solutions are either part of a web application or an API. These web based solutions need to be hosted somewhere. You can build the best object recognition algorithm in the world, but if it is sitting on your computer, it is like the book you bought which remains sitting on the bookcase. No one is getting any value out of either situation.

This doesn’t have to be super complicated either. As we saw in the previous section, there are several types of cloud providers covering a range of use cases — from the super simple to the extremely complicated.

You can start off by using Heroku to deploy your app. This article walks through how to deploy a flask application to Heroku for free using the Heroku CLI.

Simpler solutions continue to be created. Streamlit which has grown in popularity for enabling folks to build data apps quickly, has introduced Streamlit Sharing, a platform help you deploy, manage and share your Streamlit apps.

These avenues are great ways to build side projects and flex your technical skills.

On the professional side, learning more about whichever cloud provider your company uses will allow you more options. You can build better applications, being aware of the capabilities and limitations of specific services your company uses. Or you could dive in and tackle more infrastructure type work.

Some companies such as StitchFix aim to hire generalists — “full stack data scientists” that can work through a project from conception to production.

How to get started?

There are plenty of ways to get started with getting hands on experience with cloud based infrastructure. Many of the cloud providers that we covered earlier allow you to get started for free on their platforms:

  • Heroku can be used to host a project perpetually for free with certain caveats
  • AWS has a fix of always free, 12 months free and short-term free trial offers
  • GCP provides $300 in free credits and always free services with monthly limits
  • Azure provides 12 months of free services with $200 credit and 25+ always free services
  • Alibaba Cloud has a free trial that provides over 50+ products for free along with 20+ always free products

It has never been easier to spin up whatever you want to experiment with and get hands on experience :)

Conclusion

The field of data science continues to mature and evolve. A decade ago Data Scientist was the title that captured it all, it has now splintered into a growing list of titles — Data Engineers, Data Analysts, Machine Learning Engineers, Research Scientists and so on. Each of these roles in some degree rely on and need to know data infrastructure for their success.

--

--