Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot.
In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. This post distills fragments of wisdom accumulated while working at Yahoo, Facebook, Airbnb and Lyft, with the perspective of well over a decade of data warehousing and data engineering experience.
Let’s start with a quick primer/refresher on what functional programming is about, from the functional programming Wikipedia page:
In computer science, functional programming is a programming paradigm — a style of building the structure and elements of computer programs — that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data. It is a declarative programming paradigm, which means programming is done with expressions or declarations instead of statements. In functional code, the output value of a function depends only on the arguments that are passed to the function, so calling a function f twice with the same value for an argument x produces the same result f(x)each time; this is in contrast to procedures depending on a local or global state, which may produce different results at different times when called with the same arguments but a different program state. Eliminating side effects, i.e., changes in state that do not depend on the function inputs, can make it much easier to understand and predict the behavior of a program, which is one of the key motivations for the development of functional programming. …
This post follows up on The Rise of the Data Engineer, a recent post that was an attempt at defining data engineering and described how this new role relates to historical and modern roles in the data space.
In this post, I want to expose the challenges and risks that cripple data engineers and enumerates the forces that work against this discipline as it goes through its adolescence.
Every once in a while I read a post about the future of tech that resonates with clarity.
A few weeks ago it was The Rise of the Data Engineer by Maxime Beauchemin, a data engineer at Airbnb and creator of their data pipeline framework, Apache Airflow. At Astronomer, Apache Airflow is at the very core of our tech stack: our integration workflows are defined by data pipelines built in Apache Airflow as directed acyclic graphs (DAGs). …
I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer.
I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely.
My team was at forefront of this transformation. We were developing new skills, new ways of doing things, new tools, and — more often than not — turning our backs to traditional methods.
We were pioneers. We were data engineers!
Data science as a discipline was going through its adolescence of self-affirming and defining itself. At the same time, data engineering was the slightly younger sibling, but it was going through something similar. The data engineering discipline took cues from its sibling, while also defining itself in opposition, and finding its own identity. …