Cloud Native vs Cloud Agnostic in Data Engineering: Looking for the balance

Published in

EurowingsDigital

10 min readSep 5, 2024

The deployment of data platforms and analytics solutions in the cloud is now a widespread phenomenon. It seems like it has always been (and will always be?) this way. However, there is a constant (though not always obvious) tension between the Cloud Native and Cloud Agnostic approaches. Cloud Native supports the use of specific services from a particular cloud provider, while Cloud Agnostic focuses on creating applications that can run on various cloud platforms without modifications.

It is important to understand the differences between these approaches and make informed decisions when choosing one for your project. This choice can significantly impact the architecture, scalability of your data platform, and the cost of development and maintenance.

However, it is important to remember not to fall into extremes and be absolutely committed to just one of the approaches. Sometimes, the optimal solution may be a combination of both approaches to achieve the best balance between flexibility and efficiency.

In this article, we would like to share the approach used by one of our data teams.

Reflections

Any cloud provider tries to “hook” its clients on solutions that are specific only to its cloud. There’s no need to look far for examples: AWS Glue, AWS Kinesis, Azure Data Factory, Google Data Fusion, Databricks Unity Catalog (while the article was gathering dust in the drafts, it was made open source) and Jobs etc. These cloud-native solutions undoubtedly have their advantages (the most significant being the acceleration of getting a working solution “on the ground,” and I won’t list the others — esteemed readers can easily find them in the relevant marketing materials), but they also come with significant downsides that aren’t mentioned in the promotional materials. I will highlight only the most important (and obvious) ones, in my opinion:

Lack of flexibility and limited functionality. In the beautiful demo examples at conferences, training seminars from vendors and videos from bloggers on YouTube everything looks “great.” Additionally, at the initial stage of using these tools, many of these limitations often go unnoticed, but as the project develops, these constraints become increasingly apparent. For example, try to generate an array of dates within a given interval based on dateFrom and dateTo using standard tools in Azure Data Factory within a pipeline (difficulty — easy), or even better — generate an array of calendar week numbers that these dates fall into (difficulty — hard). The “pleasure” is guaranteed! :-)
Vendor lock-in. If you are heavily using many features of a specific cloud service, it becomes very labor-intensive to migrate not only from one provider to another but also simply from one service to another within the same cloud. (Or perhaps you have a case similar to a client the author of this article once worked with: a large company decided to migrate from on-premise to the cloud, only to find that none of the available cloud providers were ready to offer the required capacity with the necessary SLA. The company was forced to build a multi-cloud solution.)
There can often be issues with organizing the release management you need. Different tools have different and sometimes quite non-obvious, limitations, but this is especially relevant for tools that involve active “mouse programming through the UI” (hello, Azure Data Factory! which can only be deployed in its entirety without a lot of hassle, rather than as individual data pipelines). Moreover, setting up CI/CD for it can be quite challenging if your requirements go beyond the level of examples provided in the documentation — it’s enough to make you want to pull your hair out!

In my humble opinion, a good Data Architect should understand all of this and therefore strive to find a reasonable compromise, guided primarily by common sense, the maturity of technologies, their capabilities and limitations, and, of course, the goals and strategy of the company (both from a business development perspective and a technological advancement perspective). It is important not to underestimate the significance of the company’s goals and strategy, which is often overlooked by “100%-technical” people. As is often the case, the final choice will be determined by a compromise between short-term and long-term gains — essentially a specific instance of the well-known psychological dilemma between instant gratification and delayed gratification. We, as techies, will naturally tend to lean towards “long, expensive, and awesome”. However, the business, whose needs we primarily serve, will always expect us to deliver “quickly, cheaply, and effectively.” In this context, the experience and qualifications of the architect making the decision play a critical role — if, of course, such a role is even defined! :-)

Based on the years of experience of our Data Engineers, our team has identified the following key points that, in my opinion, can serve as the foundation for a reasonable compromise:

Avoiding vendor/cloud lock-in where it can be avoided with minimal effort or slight changes to processes.
Maximizing the avoidance of implementing any logic with a complexity above “easy” in low-code/no-code tools, as well as in services that involve “mouse programming through the UI.” This is primarily due to the challenges associated with the portability of logic implemented in this way, as well as the complexities of release management, debugging, and tracking changes.
Implementing business logic using tools/technologies/frameworks that are vendor-agnostic and/or cloud provider-independent. At the very least, the tool should be available from multiple providers.
The ease of maintaining the solution (both the overall solution and the specific code) takes precedence over non-critical gains in performance.

Practical implementation

Everything mentioned above is “abstract thoughts in a vacuum,” akin to “for all that is good against all that is bad.” But what about in practice? In practice, it was necessary to determine a combination of various tools and architectural patterns that the team will use. To tackle this step by step, let’s first identify the main components of any data platform:

Data Storage
Data Processing
Orchestration

* Here, it would be appropriate to also mention business analysis and visualization tools, as well as DevOps considerations. But let’s leave this for the next article :-)

Data Storage

This question is probably the simplest. For many years, the standard approach has been to store data in universal formats such as Parquet, Delta, Iceberg, JSON, or CSV in object storage solutions like S3, Azure Data Lake Storage, etc. This approach is already nearly cloud-agnostic in itself.

In our case, we store data in Azure Data Lake Storage, which is “out of the box” integrated with many services in Azure and has been mounted to the file system of the Databricks cluster. This is very convenient — there’s no need to write extra lines of code or deal with access keys every time. Interestingly, this approach is rarely seen in tutorials. Currently with an eye toward using it in the new group data platform, we are considering the possibility of using Unity Catalog. If such a decision will be made, it will be relatively easy to switch to Unity Catalog Volumes.

This way, we can minimize our reliance on Databricks-specific functionality, and if anyone ever wants to use something other than Databricks, such as HDInsight or Synapse, for various reasons, there will be no issues accessing the data. Even in the unlikely event of migrating to another cloud, we will also be able to transfer our data quite easily.

Data Processing

This question can be broken down into several aspects.

Data processing framework. Here, it’s obvious — Spark. No further comments needed :-)
Compute. In our case, this is the Databricks cluster. It may not be the most cost-effective solution, but it is very stable and alleviates the headaches associated with configuring and optimizing cluster settings. Another advantage of Databricks is its availability on all three major cloud platforms, where it can be used with minor differences in cost, functionality, and data center locations.
Programming language. There are three main universal languages for working with Spark: Scala, Python and Java (I could also mention the approach where these languages serve merely as “wrappers” for executing standard SQL-queries, but we do not consider this approach for developing production data pipelines). At Eurowings Digital we use Scala as main programming language for our data pipelines. Although PySpark is rapidly evolving, it still lags behind Scala in both capabilities and performance. However, I must admit that the lines are becoming increasingly blurred, especially with the release of Spark Connect (we are not looking in that direction for now, but the situation may change in the future) and because Databricks is striving to make Python the primary language for working with Spark.

One of the advantages of using Scala (as well as Java) is that it is a compiled language, which simplifies the development of environment-agnostic Spark applications. For me, this means that the same code should at a minimum:

run both on a local machine and on a cluster/in any cloud (how can I not recall the recent “worldwide IT destroy” by Microsoft & CrowdStrike :-) Regarding the development process, I didn’t even notice it, as although many of our cloud services went down, it didn’t hinder the current development and debugging of applications on my local machine; I can also continue to work in places without internet access — on a train, in an airplane, etc., which is important for us because we are remote-first company :-) ).
run through any scheduler (Oozie, Airflow, AWS Glue, Azure Data Factory, etc.).
read from and write to data stored in any cloud storage, HDFS or on a local machine.
and ideally without using Docker

This, by the way, clearly indicates that the use of notebooks in production should be avoided, partly due to the complexity of meeting all these requirements. But now vendors, in an attempt to further “hook” clients on their solutions, promote notebooks as almost a best practice for developing data pipelines. It’s a pity..

We have managed to achieve all of this quite simply through the following, fundamentally straightforward approaches:

Proper parameterization and thoughtful configuration options for Spark applications. In the context of this article, this primarily involves working with “secrets,” data paths, and passing command-line arguments. For example, all our Spark applications accept at least the date range and the path to the “root” of the Data Lake as command-line arguments. This path can be either the mount point of Azure Data Lake Storage to the Databricks File System or a “sandbox” for debugging, as well as a local path to a folder with test data.
“Internally,” we write the code in such a way that it can initially run both on a local machine and on a cluster. To achieve this, we try to avoid using any vendor/cloud provider-specific libraries. Here are some examples:
- For working with the file system, we use org.apache.hadoop.fs.FileSystem instead of java.io or dbutils.fs.
- For working with secrets, we use special wrapper functions that, when running on a cluster, retrieve secrets from the Key Vault, while when running on a local machine, they pull from a local file with “default” values for debugging.
- Another less-than-ideal case is when a service exists as a single instance in the cloud and only in production; we are forced to create “stubs” in the wrappers.
For deployment, we use only fat jars. In this case, the versions of all the base libraries that are installed by default on the cluster are fixed and marked as provided in build.sbt (typically, this includes the versions of Scala, Spark, Delta Lake, etc. — in our case we can check the documentation for Databricks Runtime), while the other libraries are included in the fat jar. This approach allows us to be confident that version conflicts will not arise 99% of the time without complicating matters with Docker (practice has so far confirmed this). Additionally, this enables different applications running on the same cluster to use different versions of the same library. For example, this currently simplifies our migration to a new version of Databricks Runtime.

Orchestration

Here, “by default,” preference should be given to open-source solutions, preferably those that can be deployed in the cloud as a service. The undisputed leader in this area is Apache Airflow. However, orchestration is precisely the case where the use of cloud-native solutions can be justified. This is primarily because orchestration involves interacting with a large number of other services, which means a significant amount of work related to their integration and addressing access control issues. Let’s face it, providers have typically already resolved all these issues, so it makes sense not to reinvent the wheel.

In our specific case, we use Azure Data Factory (ADF), which combines both orchestration and ETL/ELT functionalities and, unfortunately, is “mouse-programmed.” Our approach is as follows:

All business logic is encapsulated in Spark applications and ADF simply triggers them on the Databricks cluster. This way, the most intellectually demanding and labor-intensive part of our data pipelines remains cloud-agnostic.
We use ADF-activities in our pipelines only for the simplest actions that are not directly related to data transformation, such as “moving files from sFTP to Cloud Storage,” “sending a message to Slack about the failure/completion of a process,” “checking for the presence of a file on FTP,” “selecting to run JAR-1 or JAR-2 based on a parameter,” and so on.
We use the orchestration functionality without any limitations (most of our pipelines are scheduled, and we currently do not have event-driven triggers).

An example of a pipeline in ADF: the business logic is encapsulated in a Spark application that is triggered using a Jar-activity, while all other steps are auxiliary tasks, such as copying files from sFTP to Data Lake and sending notifications to Slack.

This approach allows us, on one hand, to avoid “reinventing the wheel” and quickly set up routine data transportation processes (without transformations) between FTP, Cloud Storage, SQL DB, etc., establish a launch schedule, configure error notifications, and so on. This minimizes the engineer’s effort on routine and uninteresting parts of the data pipelines while freeing up time for the development of more “intellectually demanding” transformations that are implemented in Spark applications.

Summary

As a brief summary of the above, it can be concluded that when choosing between Cloud Native and Cloud Agnostic in Data Engineering, it is important to consider the specifics of the project and the needs of the business. It is advisable to aim for a hybrid approach that combines the advantages of both strategies, ensuring flexibility and scalability of the infrastructure. In my opinion, a healthy compromise is to implement business logic in an environment-agnostic style, while using cloud-native tools for all “auxiliary” and less “intellectually demanding” tasks.