Data pipeline programming principles

Clement Joudet
100m.io
Published in
5 min readOct 1, 2018

Building a data pipeline can be a long and tricky task. Without the right structure, the code can drift to a complex blackbox. While approaching a data-centric engineering problem, it is necessary to keep in mind some core values: reproducibility, atomicity, and readability. From my experience at 100m.io and from these values, I learnt some principles to build a robust and efficient data pipeline.

ETL

Extract, Transform and Load (ETL) is a time-proven common approach for data pipelines. The idea is to separate the pipeline into three big chunks:

  • Extract, where your get the raw data from heterogeneous sources
  • Transform, where you modify and compute the data you want
  • Load, where your save/move it where you need it

Organizing your code in an ETL structure is a good start for a data pipeline. Separating the processing steps like this allows to build independent blocks and make a more robust pipeline. Although ETL doesn’t necessarily fit every data project, it is always useful to have it in mind. ETL is only there to build more atomic blocks and to ensure that your pipeline is modular enough.

In data processes, once your data pipeline is set up, modifications to the code often take place in the Transform section. Extract and loads are in place and rarely need changing, so it is useful to save the data locally between every step of the ETL. Separating the pipeline in theses 3 parts allows you to only run the transformations you are working on, and gain a lot of time by ignoring all the other steps.

Functional programming

Before going further let’s remind what functional programming is, from wikipedia:

In computer science, functional programming is a programming paradigm — a style of building the structure and elements of computer programs — that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data. It is a declarative programming paradigm, which means programming is done with expressions or declarations instead of statements. In functional code, the output value of a function depends only on the arguments that are passed to the function, so calling a function f twice with the same value for an argument x produces the same result f(x)each time; this is in contrast to procedures depending on a local or global state, which may produce different results at different times when called with the same arguments but a different program state. Eliminating side effects, i.e., changes in state that do not depend on the function inputs, can make it much easier to understand and predict the behavior of a program, which is one of the key motivations for the development of functional programming.

Functional programming emphasizes on coding atomic functions without side effects. A function or expression is said to have a side effect if it modifies some state outside its local environment or has an observable interaction with the outside world besides returning a value.

Examples of side effects when in a program:
- performing I/O
- modifying a non-local variable
- modifying a static local variable
- modifying an argument passed by reference

Side effects are bad for reproducibility because they rely on states. You may run the same function with the same arguments twice, but if some side effects happen the output will be different. This can happens quite often in class functions that use class attributes in their logic.

In functional programming, all the arguments and outputs of a function are listed. The function is stateless and doesn’t depend on anything else than its arguments. An immediate consequence is readability: no knowledge from outside the scope of the function is needed to understand the code.

Functional programming is enabled by manipulating immutable objects and constants. Mutating objects can be seen as a side effect, and mutating variables makes it harder to prove formally what your program is doing. If you assign your constant only once, it is easier to understand how this constant what created and what led to its value.

Functional programming is also math. It came from lambda calculus, and it makes mathematical computations easier to read and understand. In a data pipeline, your core transformations are always mathematical. Functional programming provides a formal system to define and apply logic to your data in a robust way.

On top of the previous principles that comes with functional programming, we added a few to help us improve our code at 100m.io:

  • Functions should be kept small: 10–15 lines
  • Naming functions and variables is of foremost importance, and has to be meaningful.

Once we had these two principles, we understood that readability and maintainability were also core principles. This leads up to optimize readability when coding. Except for a few functions that need to be optimized for execution time, we always prefer to trade execution time for readability.

Functional programming brings clarity: when each function is atomic and independent from any event, debugging becomes easy. Don’t get me wrong, I’m not saying Oriented Object Programming (OOP) is bad. It has many useful applications and can actually be paired with functional programming in a meaningful way. But at heart, the core operations in your data pipelines should be described in simple and stateless functions.

Tests

When developing, testing is necessary to assure the quality of the processing and the output.

Tests on output data should focus on business: it should test the quality of the data. Theses tests range from simple checks (missing data, NaN values, data types) to business specific tests.

But your tests should not be limited to the output data. ETL and functional programming are a solid base to test your data processing. Short and atomic functions in your program are the easiest to test. Writing tests ensures your functions behave as predicted and makes your pipeline more robust. Functional programming paves the way for easy testing.

Having an ETL pipeline allows you to test output data at every step. The more atomic your processing steps are, the more in control you are. You can test inputs and outputs of every step.

Conclusion

An ETL structure, functional programming and tests allow to build a sane base for our data pipelines. All these principles come from core values: reproducibility, atomicity, and readability.

100m.io

--

--