Abstract and Conquer Data Engineering

Abstraction and simplification helped me discover a clear path to learn data engineering

Bernd Wessely
4 min readApr 2, 2024

I feel for you young data engineers. With the sheer amount of data tools and databases available on the market, how on earth should you get started. Sure, you could try to decipher one of the tremendously detailed landscape maps to get oversight… just an example:

I hear you: “Oh my god, this is insane” and yes, you’re right. This landscape picture may provide good information for experienced data engineers. But it does not qualify as a map to give clear overview for learners.

I used a different approach with some fundamental abstractions that helped me to build a simpler mental model. Let me show you these high level abstractions that probably also help you to find your way through the data engineering jungle.

Let’s start with compute (transformation logic) and storage (data saved in durable form). The transformation logic implements business requirements and consumes and produces data in the process. Data itself can be in motion — hence being processed or forwarded to subsequent processing — or come to rest and optionally be persisted on a storage medium. Data at rest is needed when there is no consumer to further process the data directly. Persisted data is always data at rest but it can also be kept in memory — hence not be persisted. Data at rest can at any time be again mobilized and then transforms back to data in motion.

Data is input to transformation logic that applies any (business) rules to produce output data without affecting the input data. The output data can again be input to a subsequent transformation step that produces further new output data. If we interconnect all transformation steps like this, we get the well-known directed acyclic graph (DAG) of tranformations connected with data inputs and outputs.

Applications act as a bracket around these individual transformation steps. The applications can exist as big monolithic software products or even be as little as a single microservice. In any case these applications aggregate and organize all available transformation steps for the enterprise (or even at larger realms…).

The data itself is organized and managed in data storage systems. Data storage systems — wether we have application specific storage systems or general database management systems — keep all data in motion as well as all data at rest. A data channel as a special data storage system transports data inside an application between its components and also between different applications in the enterprise. Data inside the data channel is always data in motion. Even when data is temporarily put at rest to be buffered, it’s still considered data in motion because the main purpose of the channel remains to transport data from source to destination. A data store on the other hand is a data storage system for data at rest. It optionally persists the data in several files limited both in terms of time and content. A data file can be updated and can then be identified as a new file version.

We differentiate applications between source systems and transformation systems. Source systems produce source data that has not yet been available in electronic form. This data is new in the sense, that it cannot be derived from already existent data — instead it is recorded by humans or automatically generated by sensors, optical- and audio systems, other devices (see IoT) or monitoring/logging systems. Transformation systems do not produce new data but solely derive data from already existent source data. All source data together with all derived data represent the complete information in an enterprise.

With these few abstractions I was able to categorize all kinds of tools and map it to my mental model. I also used it to define well-known data architectures like Data Warehouse, Data Lake(house), Data Fabric, Data Mesh, Data ? (you name it).

Just as an example let‘s use it to categorize and define the classical Data Warehouse:

A classical Data Warehouse is a transformation system that transforms source data extracted from source systems to derived data optimized for analytical purposes. The derived data is organized and partially also transformed in relational data storage systems (relational database management systems or RDBMS are the backbone of a classical Data Warehouse) and persisted in these systems data store (= a collection of data files with file versions). The data storage system also provides query and transformation capabilities that can be used by client applications to derive insight from data kept in the data store (=Business Intelligence). The data can even be queried by AI applications for training processes to produce transformation logic and derived data saved as ML models.

If you think that you would need additional abstractions to complete your mental model for a specific tool or architecture then please let me know. I am very curious which mental model helps you to keep the overview.

--

--