Building Data Platforms III — The evolution of the Software Engineer
Originally, I wanted to launch this article one month after part II. However, it took a lot longer for various reasons. Work got in the way and the pandemic was calmer so I was able to travel a bit and fully disconnect. That being said let’s go back to what really brought you here!
On the previous article on the Building Data Platform series I talked about the end of Data Engineers being just plumbers and the need to decentralize Data. In this article, I am going to tell you about one of the most important changes that must place in the next couple of years — the evolution of the software engineer.
Software Engineers, the eaters of the world
A Software Engineer is an individual that solves complex problems by designing a set of software components that are brought to life using code in a given programming language. Throughout the years, a lot of tooling, best practices and ways of working have been put together to allow Software Engineers to become better at what they need to do. The rise of language tooling like Gradle and npm to package and share software modules and amazing things as git to develop code as a team enabled Software Engineers to be a lot more productive in managing dependencies and collaborating. Powerful Virtual Machines like the JVM enable everyone to run software in pretty much every device so they don’t have to worry about the runtime and device specifications. Container orchestration technologies such as Kubernetes (sorry Mesos and Swarm but K8s has won) have totally changed the way we ship and scale applications. In the realm of best practices, things like TDD and CI/CD allow engineers to ship code faster than ever (without breaking things).
A lot has been done in order to make Software Engineers super productive and that has led to software eating the world. If you look at Software Engineering today we could say it contains some of disciplines mentioned in the figure below.
The fundamental truth about computer programs
Now that we cover a bit of the current Software Engineer let’s go a little bit deeper into the outcome of their work — computer programs.
If you think about computer programs you might be thinking about programming languages, frameworks, functions, methods, classes or about the endless debate between functional programming and object orientation. Although those are very interesting and valuable discussions, I believe a really it is a lot more interesting to see computer programs as transducers where instead of energy, one has data.
Today, regardless of your program being designed as purely functional or object oriented the most important thing that it does is transforming an input into a certain output. In other words
The most important side effect of any computer program is the data that it generates
The sentence above is quite simple and obvious but it is also quite powerful and has a lot of implications as you will see. It is very important to mention that we are considering all pieces of software, from the frontend code that lives in your React, Flutter or Vue apps all the way back to some Flask or SpringBoot endpoint that handles login requests. This means that the sentence above applies to all engineers: frontend, backend and fullstack. All the data that is produced from that code might be used for a wide variety of purposes such as business analytics, product features powered by Machine Learning, state of the art research, generation of high quality datasets that can be sold, legal and compliance requirements, etc.
Data has become the most value asset in the modern age and that data is produced by code that is written by Software Engineers. The relationship between code and data is so strong that I don’t resist in sharing one of the sentences that those who have been dragged into the LISP world (courtesy of Clojure) might know.
Code as data and data as code.
That dichotomy is very powerful and should be something that is embedded in the practices of today’s Software Engineers. However, I don’t think that is the case because most of the time the focus is only on building features and that requires:
- Business logic code — to a solve concrete business problem
- Tests — to ensure the code behaves as expected
- CI/CD pipelines — to ship business logic code faster
- Infrastructure — well…code needs to run somewhere (these days it would be most likely Kubernetes).
The modern Software Engineer
You might have figured by now that the fundamental thing that Software Engineers must change is their approach to Data because the balance of power has changed
It’s not Software that is eating the World anymore, it is Data
Because of this shift, modern Software Engineers must change the way they work to put Data as a key activity of their daily work.
Become Data aware
Data awareness is the most important capabilities Software Engineers need to have these days. Questions such as “What relevant Data my systems is producing?”, “Who might be interested in using this Data?”, “How can we leverage the Data that the system is producing to make it better?” will make Software Engineers more capable of seeing the true impact of the systems they are building.
A side effect of this will be the death of ETL or in some cases ELT (more info on ETL in Part I). The classic ETL is pull based where a centralized Data team has a set of components that call database replicas or Data Extraction APIs to pull data for a variety of cases. In the modern world, with everything moving so fast, the pull paradigm is outdated and doesn’t make sense anymore. We must change to a push based paradigm where Software Engineers make use of domain events to make their systems publish facts that happened.
The Extraction (E) of ETL/ELT will become Publish (P) leading to a new category PTL/PLT
Publishing domain events to a streaming platform, the P part, is not enough. Modern Software Engineers also need to ensure that Data is available to other purposes. This is best explained with a use case, let’s take Data Analytics for example. In the new world, modern Software Engineers are responsible for the tables in the Cloud Data Warehouse that belong to their domain. They need to understand and own the transformation logic, the T part, and ensure it is loaded, the L part, properly. To do it, they need to rely on standards, tools and best practices that are the responsibility of the modern Data Engineer.
Rely on Data Standards
In the new world I presented you one thing is clear: the scope of the work of Software Engineers has increased. The next decade will be the decade of Data and how to make Software Engineers productive in working and owning Data as they are today with features. For that they need to rely on standards. Here are some things I consider to be crucial.
Declarative definition of Data Products
A Data Product has three main components: one or more data sources, transformation/business logic and an set of outputs where end users to consume it. However, most Data Products aren’t build with that mindset and that makes it harder to engineers and everyone else to understand how they work, what do they depend on, what are the outputs. I believe, a declarative way to spin up Data Products will make everyone’s lives easier. An example is presented below.
What do we get from this simple YAML file? First, all the dependencies and components are visible. We know we are dealing with Kafka, Redshift and a stream processor. The three major components of the Data Product are identified. The transformation logic can be anything from a Kafka Streams application, ksqlDB or a classic Consumer. This declarative structure is agnostic to it. Now a lot more interesting is some metadata that we can incorporate here such as name, ownership, versioning and descriptions. Joining that metadata with some level of automation to ensure we can spin up a data product like running
kubectl apply -f <file_name> will lead to another level of productivity and will allow to tackle two very important aspects in the data space: lineage and quality monitoring.
Automatic support for Data Discovery, Lineage and Quality monitoring
“Where can I find information about X?”, “That number on the dashboard seems wrong?”, “Where does that data comes from?”. Those are questions that anyone that has worked as a Data Engineer was asked somewhere during her/his career. In order to allow modern Software Engineers to own their Data they need to have ways of providing those answers to the rest of the teams. The only way to do it is via standards and automation. Using the configuration like the one shared above have a solid foundation to start tackling lineage and discovery because we can start to build a dependency graph between Data Products. There are very interesting open source projects like OpenMetadata where they act as a central hub that aims to centralize relevant metadata that can be consumed by both technical and non-technical users. Another important aspect is monitoring Data quality. More than writing unit tests, it is crucial that engineers have tools to ensure that the Data they are providing is correct. This is a field that has seen great traction over the last year with open source solutions such as Great Expectations, TensorFlow’s Data Validation module and Apache Griffin as well as companies such as Monte Carlo .
Final note on Building Data Platforms
This article is the last on the series of Building Data Platforms where I tried to show why Software Engineers need to evolve their ways of working to help companies become more agile in building better products and services. As of today, all of those products is driven by Data and every piece system that is designed and built generates Data that might be used in a variety of scenarios. That is why this change is so important like DevOps was a decade ago.
This article is the conclusion of the journey that was started on “The ETL Bias” and followed on “The Age of Plumber is over”. All of it is based on my professional experience, lessons learned, books, podcasts, papers ,talks and thoughts that I have consumed and digested for many hours, nights and years. It aimed to be a reflection on what needs to change in the way our industry works and manages Data. I don’t expect you to agree to everything that was written here but I would like you to stop for a moment and question some of conclusions and assumptions that were presented here.