Why your Data and ML platform should be in the cloud

Stefan Graf
CodeX
Published in
4 min readJun 17, 2024
Photo by Luke Chesser on Unsplash

Currently, I see the trend popping up, weither the cloud holds its promise, to make our IT initiatives and projects more productive, efficient and cost effective. Most of the discussions I have witnessed, evolved around web pages and “classic” applications. Here the argument is, that most businesses don’t require these complex (some might say overengineered to some extend) solutions, they will never scale like Netflix or Amazon. They can just rely on proven solutions, which work for a fraction of the cost.

While I’m no specialist in this field, I want to take this discussion to the world of data, analytics, ML and AI and want to give you an insight, why I believe the cloud is the right place for your data initiatives.

Organisational

If you don’t have the luck to work in a data driven and data native company, you have probably already experienced, that data analytics and ML topics are often not considered as a core part of the business. Even if there is emphasis on the importance of these topics in your organisation, there is rarely the budget or the willingness to take the effort of creating a state of the art self hosted data platform on their own.

Sure, there are exceptions here as always. FAANG are doing most of there data related workload on self developed systems, but these companies are almost always outliers, no matter which aspect.

Even if cloud data platforms are not simple to build by any means, they take away a lot of headaches away from you, compared to creating everything your own. If you want to host and build state of the art solutions like Lakehouses, Data Integration Solutions, ML platforms and modern BI solutions by yourself, you’d not only need experts in using these solutions, you’d need additionally need a lot of experts in building, maintaining, administrating and hosting these products. This would add a lot more complexity to your whole initiatives and would eventually slow you down.

The nature of data related workloads

One of the unique selling point of cloud services has always been the following story:

Imagine you are hosting an online shop website on your own machines. There can be spikes in usage, where your servers are not able to cope with this load and customers won’t be able to access your website. This will eventually hurt your business. Now you could spend more money on your servers to increase your capacity, but then most of the time your servers are only using low percentages of their capacity and are idling without any advantage. Here comes the cloud into play and you can just leverage the nearly unlimited scaling, without the need to pay for your resources, when you are not using them.

This kind of workload, where spikes in compute demand are occurring and times where not much is happening at all, is even more predominant in data platforms. Most of the data loads are only happening in a nightly or weekly frequency, streaming or near real time is still the exception. Analyses are often done after ad hoc requests. ML and AI developments are done in different scales, depending which usecase and which data is used. Gen AI popped up from nowhere, now you’d need a lot of GPUs to in your data center, to just test it, if your company could benefit from this new technology.

Now what I want to point out here is the fact, that it is almost not possible to predict what requirements in terms of computing you will have for your data initiatives. This field is evolving so fast, that it is super to follow it, even if you are not hosting your stuff by yourself.

Cloud exclusive solutions

Like it or hate it, it doesn’t really matter. Fact is, that for a lot of modern products in the data world, you don’t even have the option to host it outside of the cloud. Databricks and Snowflake, who are fighting for the throne of the best Data and AI platform are both only available inside the cloud. Microsoft Fabric is SaaS all in one data platform. Open AI, Gemini and other leading Gen AI solutions are also deeply integrated into their respective cloud ecosystems.

The Caveat

While I’m convinced, that the cloud is the right place for most of your data platform solutions, there are also some other things to consider. Sometimes it is just not possible to bring everything into the cloud, be it for regulatory or technical reasons. For cases like this, approaches like Hybrid cloud could still bring a lot of benefits to you.

--

--

Stefan Graf
CodeX
Writer for

Data Engineer Consultant @Microsoft — Data and Cloud Enthusiast