My key takeaways after building a data engineering platform

Published in

datamindedbe

6 min readFeb 15, 2024

Approximately two years ago, I shifted my career from consultancy to joining a product team, working dedicated on Conveyor. This shift required me to change my mindset from short-term project thinking to long-term planning and prioritizing user experience. In this blog post, I want to focus on three insights gained after two years of building a data engineering platform.

Adding code is easier than deleting code

A product consists of functionality that aims to solve a specific problem for a customer. This is why many engineers but also engineering managers think that only writing code is considered value-added work. However, I strongly believe that your product can only stay relevant in the long term if you are also able to delete code.

There are many incentives for a development team to delete code:

Getting rid of unused functionality or functionality that is no longer supported in your product.
Deleting legacy functionality that has been re-implemented in another way and thus become obsolete.
Reworking code for architectural/technical reasons without impacting the functionality of end users.

Deleting code is hard because it often impacts your users in some way. As a product development team, you want to minimize the impact on end users in order to keep them happy and thus using your product. People in general are reluctant to change and prefer to stick with what they are used to. For this reason, additional effort is required to convince customers to use the new feature.

As an example, in Conveyor, we implemented support for using Jupyter notebooks as well as remote development environments (IDEs). The IDEs are a generalization of notebooks as they can do everything that you can do with notebooks and more. Now, our main difficulty is migrating existing notebook users to use the new IDE feature such that we can remove notebooks at some point in the future. This will include adding scripts/commands to help users move their work from one place to another, as well as creating a template to set up Jupyter notebooks in IDEs.

Why do I insist on deleting code you might wonder? The answer is: code is a liability and requires constant maintenance and time from your development team. As long as the code exists, you will need to fix bugs as well as update it (e.g. refactoring the code, updating an external library,…). By removing the code, you free up time for the development team to work on the relevant parts of your code base. Keep this in mind the next time that you hesitate to go the extra mile to finally get rid of a piece of code.

Poor design decisions will bite you

In theory, this is a straightforward concept, but it can be challenging to put into practice because it’s nearly impossible to have all the information upfront. I will discuss 2 concrete cases where in hindsight I would have done things differently:

Everyone loves dark mode except for me

We added dark mode functionality to both the Conveyor UI as well as the Airflow UI, as we embed Airflow environments in our product. The reason for adding dark mode to both is to maintain consistency throughout the entire UI. Although the initial effort for adding dark mode to Airflow was manageable, keeping it working across Airflow upgrades has proven to be more challenging than anticipated.

Airflow is currently reworking its UI to React, which involves redesigning its UI components and styling. Consequently, we also need to adapt our dark mode implementation with nearly every upgrade to ensure it continues to function properly. Over the past year, we easily spent 5 to 10 times the effort in updates compared to the initial implementation. Each time new changes are required I think, wouldn’t be better to get rid of the Airflow dark mode. The downside is that this will impact our end users...

Minimize the public API

Similar to most products, Conveyor also provides its functionality through an API. Clients can interact with the API using our UI, CLI, and Terraform provider. It is important to clearly differentiate between external and internal endpoints of your API. A lightweight and useful start is to clearly document which parts of your API are public. This helps clients understand which endpoints they can rely on and it forces you to think about what you want to expose.

Unfortunately, we did not document which parts of our API is public, and as a result, one of our customers depends on our “legacy” implementation of OAuth 2.0 tokens. These tokens are used to decide whether a user has permission to perform a certain action. We never intended for these tokens to be part of our public API as they are a technical implementation detail. Initially, we used Auth0, but due to the exponential increase in pricing, we switched to AWS Cognito. Before we can remove the Auth0-related code, all customers need to switch to using Cognito. Unfortunately, one of our customers has created scripts that rely on the fact that we use an Auth0 token. This means that we are dependent on them before we can finish the migration.

A similar issue may have arisen with our managed Airflow instances, as we bundle Airflow core with various Airflow provider packages. To tackle potential issues with customers including hundreds of providers, we deliberately chose to only support a minimal list of Airflow providers. This allows us to update Airflow for all customers as we do not need to worry about random breaking changes in one of the providers.

Updating dependencies never stops

A significant portion of my time is dedicated to updating the dependencies of our product. These updates can be broken down into three categories: making changes in dependent libraries, updating the open-source components of our product, and updating Kubernetes.

For code dependencies, we rely on Dependabot to notify us of newer versions as well as track vulnerabilities in our dependent libraries. Dependabot helps us to minimize the time required to update external libraries. Only in 5% of cases, we do need to modify our code due to breaking changes introduced by the version update. To detect these issues, we use static code checks and an extensive test suite.

Updating the components of our platform is a manual task, as they typically require thorough testing before being included in a release. Examples of such components include Airflow or components installed in our kubernetes clusters (e.g., Karpenter, CSI driver, fluent bit,…).

Lastly, upgrading Kubernetes is the most challenging task, as it serves as the computing infrastructure for the Conveyor. Without an operational Kubernetes cluster, our customers cannot perform any meaningful operation. To minimize the impact on our customers, we rigorously test the Kubernetes upgrade on our staging environments. These staging environments mimic the different setups that exist at customers (e.g. different clouds, public/private access only,…). The steps for upgrading Kubernetes are generally as follows:

Update any Kubernetes components that rely on the deprecated functionality of the Kubernetes API.
Refer to your cloud provider’s migration guide for any necessary changes. This guide often outlines changes to addons, …
Update the Kubernetes control plane. If you use a managed Kubernetes service, this stype is typically handled by the cloud provider (e.g., AWS, Azure, …).
Configure the Cluster-autoscaler/Karpenter to use only nodes of the latest Kubernetes version.
After a day, cordon nodes still use the old Kubernetes version. A few hours later, you can drain and delete these nodes.

Conveyor primarily manages batch data pipelines, which consist of many “short” running tasks that cannot easily be restarted while processing. Simply draining and deleting the existing nodes is not feasible as it would cause the batch jobs running on those nodes to fail. We wait one day before deleting the old nodes to simplify our job as most of the nodes will already be replaced by new nodes.

In the future, we would like to automate the last part (cordon-drain-delete), but we have not yet implemented this feature.

https://www.linkedin.com/pulse/information-data-how-effectively-benefit-from-lessons-charlie/

Conclusion

In this post, I discussed three key takeaways that I have learned in two years of building a data engineering platform:

Adding code is easier than deleting code
Poor design decisions will bite you
Updating dependencies never stops

I hope that sharing these lessons will give you new insights and maybe prevent making some of our mistakes. If you think of another takeaway, please share it in the comments.