Why we decided on Kotlin in our Data Engineering architecture

Published in

YAZIO Engineering

5 min readMar 1, 2023

Choosing or changing a software architecture is not an easy task — it requires careful consideration of many factors. The decision can have long-lasting effects on the development process, maintenance, and scalability of the system but also on developer happiness and employer churn rate — nobody likes working with bad legacy code or fixing architectures not scaling well to the needs of an organisation.

Turn back time to 2021 and you’d see some basic Kotlin code, a pile of Python, and some SQL as well as a lot of YAML for Kubernetes resources in our tech stack. At this point in time the team already had a lot of experience in building software in general and data systems like ETLs in particular but we weren’t very happy about the current process and most of all with Python.

Although it allowed us to quickly ingest new data from HTTP APIs, we often experienced bugs in our software that were very basic: an AttributeError here, an edge case resulting in a KeyError there — typical things that can be prevented at compile time in another language.

What about a typed language then?

We tried annotating our code with type hints using the typing package, but they weren’t enforced by the interpreter at runtime anyway and can only really be helpful if PyPi packages also provide typings.

As mentioned earlier we already had some Kotlin code running in production due to the fact that we share code with our mobile clients for iOS as well as Android (using Kotlin Multiplatform) and we were very pleased with it. Also as most of the team already had prior experience running Kotlin backend services in production, we decided to go all-in on Kotlin (where possible™). These are the reasons why we think it’s a good idea to this day:

stability and consistency
ability to use the vast amount of libraries for the JVM
being able to share code (for event tracking) using Kotlin Multiplatform
previous positive experience running Kotlin in backend systems
still finding colleagues in the future (… looking at you there Scala!)
sharing knowledge across the company with mobile app developers

It just feels solid

Kotlin itself is a stable language and new versions mostly target improvements on the compiler as well as tooling itself but very seldomly introduce new language features and even more rarely new language constructs these days. It also is a very consistent language and if you learn it, there will not be many surprises in the naming of methods or how things work. I think it is easy to get started with but you also can get very far with it — Android, Kotlin/Native, or Kotlin/JS anyone?

Running on the JVM gives us the opportunity to use the vast ecosystem that the Java and JVM community provides, be it smaller libraries like Apache Parquet, the Hadoop file system abstractions to access AWS and GCS, or building on the giants the like of Spring Boot. There are also some libraries targeted at Kotlin for Data Engineering and although these are very young projects, they work very well for us. Most prominently, everything we build running on Spark we build with Kotlin for Apache Spark.

Let’s look at an example

Below you can see an example of what makes Kotlin so great: using Java classes generated with the Apache Avro tools, we can have a generic Parquet writer class written in Kotlin that takes care of creating .parquet files with the correct schema. It is very concise, easy to reason about, does not include any unnecessary clutter, and can be used with any type that implements SpecificRecord thanks to generics while the IDE can provide helpful insights when using the class.

You can also see the effects of integrating JVM libraries in this small example: thanks to the HadoopOutputFile from Apache Hadoop FS abstractions, the writer can handle S3, GCS, paths in your local file system, and many more.

Building this kind of abstraction is of course possible in Python, too. But it would not provide the level of safety Kotlin as a statically typed language does.

Imagine the following scenario: you build an ad analysis system consisting of two types of records: ad and campaign.The Kotlin solution would prevent writing ads into the campaign -writer as early as compile time. It would use the types provided via Avro schemas causing consistent types in the files being written. The Python solution wouldn’t provide that safety net at all and we’d need to check every type conversion ourselves, risking invalid incompatible schemas being persisted in our datalake.

… but Scala is JVM for Data Engineering, right?

Despite Scala being nice in terms of language features and it having a solid foundation in the Data Engineering community, we didn’t really consider it in our decision. The main reason is that it’s not that widely spread outside of Data Engineering and therefore makes it hard to hire for in the future.

Google Trends Kotlin vs. Scala 2004–2023

According to Google Trends, Kotlin (blue) has been increasing in popularity since its first stable release in 2016, while Scala (red) has been declining since around the same time. This suggests that Kotlin is gaining more attention in the JVM community, making it a more attractive option to build upon.

Similarly we also didn’t go for Java. Although it evolves quickly into a more modern version of itself (lately introducing Records, Sealed classes, Switch expressions etc.), many developers still hesitate to adopt new versions of the language and still run Java 8 (released 2014) or 11 (released 2018) in production.

There’s still some Python left

There definitely is and will still be Python in our tech stack in the future. The reason for that is that we also use Dagster, great_expectations, dbt, and other tools that are written in Python in our architecture. Python has always been and probably will always be the foundation of systems in data architectures.

And that’s okay as long as we avoid its pitfalls and have a tool at the ready that we can use whenever we feel we need more boundaries preventing us from doing stupid things causing basic errors all the time. You shouldn’t underestimate the confidence you get in your code when using static typing.