How to make my Data Engineering department shine again

Alexandre Colicchio
papernest
Published in
8 min readMar 24, 2022
image by Towfiqu barbhuiya from unsplash.com

Introduction

Even the most prolific gold mine may not be profitable !

During my experience as a Data Engineer and Head of Data Engineering, I had the chance to see (and list) many mistakes that can lead fantastic projects to ruin.

You undoubtedly have a fantastic product/company, but you have this unpleasant feeling of being held back in your ascent by the same recurring reason: data.

Delays, errors, long developments for each new need are the symptoms that your data architecture needs a little cleaning — and that the sign that this article will interest you!

The purpose of this article is not to detail each concept, but to help you pave the way towards regained stability with a proven method.

Summary

  1. Do I need Data Engineers?
  2. Prospecting : (Re)Discover your data (Data Catalog)
  3. Traceability : Reduce investigation times (Data Lineage)
  4. Prioritize : Root-Cause analysis and classification (Data stewardship)
  5. Isolate : Review you ETL process

Bonus : Fight obsolescence : Don’t be technology-dependent !

Do I need Data Engineers?

There is only one possible answer: yes.

Data engineering is a combination of several disciplines that have existed for many years: Software engineering, Devops & Data Analysis. Do not take advantage of this new name to reinvent everything, but take advantage of the many years of feedback from these disciplines to build a stable and reliable system. The proof is that the approach of this article is neither more nor less than an adaptation of the ITIL method — well known in the world of IT services management.

However, the success of a technical project often depends on the team members capitalizing on past mistakes and recruiting people from the field allows you to avoid the pitfall.

Tip #1: Trust your team’s judgment. In the world of data, operational knowledge is vital to adapt decision-making. Having a macro view is essential to stay on course, but the devil is in the details. You will never be able to know everything in detail so consult your teams and do not try to pre-arrange everything.

Prospecting : (Re)Discover your data (Data Catalog)

Documentation is mandatory. It’s a fact. The reality is quite different.

Over time, knowledge of historical flows and transformations ends up being eroded. Moreover, in architectures built iteratively over many years, the documentation has often evolved more slowly than the pipelines.

It’s time to grasp the nettle and rediscover your data.

For this, it is necessary to list data sources, starting from the most important from a business point of view, and include:

  • source with the way of extraction (FTP, API, etc.)
  • extraction frequency
  • steps of transformation
  • destination(s)
  • audience (external — internal at least)
  • frequency of use
  • bonus: the error rate (with a relation to time such as n bug/week)

This work is laborious but necessary. In addition, it will greatly participate in the refactoring of your flows by helping prioritization.

Tip #2: It will initially be impossible to dissect everything, especially given the urgency of reliability. Priority is given to the most important business flows. In addition, powerful tools exist on the market to make this task automatic. However, their implementation is long and expensive, especially on a still fragile architecture. Used a simple tool for creating diagrams so as not to waste time and focus on making your architecture reliable. However, do not forget to iterate on your Data Catalog early in the implementation of your reliable solution, otherwise you will come back to the same situation in a few months/years.

Traceability : Reduce investigation times (Data Lineage)

I have bad news: you will not solve all your problems in a few hours. Neither a few days, see a few months. And especially not with the pretty diagrams of this article!

It is thanks to the tiresome work of your team, under your leadership that the situation will improve little by little. In any case, if you try to change everything at once, you risk not succeeding or worse, crashing your business.

Now that you know your data and the daily maintenance work of your team, it’s time to save them the time needed to stabilize the architecture. Your team is probably already disturbed by debugging tasks, you have to give them oxygen to allow them to make the solution more reliable.

When determining a technical solution, we often forget to include the development of a high-performance debugging solution: this is where Data Lineage comes in.

It must be simple, not increase the technical stack or the processing time. Take a database that you already work with and create a set of tables there where you will store processing metadata. Don’t be shy, store everything.

We ingest a file from a drive — we log it in the data lineage with name, date, stacktrace, result (success or failure), where we took it and where we stored it. All this information is readily available in your system during the treatment so storing this information is only a minor development. On the other hand, in case of problems, we will directly locate the problematic file and the reason, which will drastically speed up the resolution time.

Tip #3: nothing more than communication. Having bugs is normal. The job of Data Engineer is thankless in the sense that it is normal when everything is going well, so the perception of a bug seen from other services is generally severe. Setting up dashboards on this new Data Lineage makes it possible to communicate on refresh states and on current bugs. This must remain an internal Data Engineering tool, but communicate as quickly as possible when your dashboard reports a bug with a forecast ETA. Over time, other departments will be reassured about error handling and can more easily find temporary solutions.

Don’t forget to reassure your team: the data lineage is not a tool to track their failures, but a tool for them allowing a better comfort of error management.

Prioritize : Root-Cause analysis and classification (Data stewardship)

Once again on a good basis, at this stage you have been able to provide visibility on your ETL processes. Your team has saved time in resolving common bugs and you have probably already been able to make some of your most critical flows more reliable.

Now is the time to tackle the underlying issues and make your company more aware of Data issues and your newfound place in the organization.

We will have to introduce a real Data approach in the prioritization of problems and for this you can find your heart’s desire with Data Stewardship !

A very simple method consists of starting from the problems (either the consequences), which everyone encounters in their daily work and which everyone is able to formulate. Subsequently, these consequences are categorized and then linked to groups of known causes. This last step will make it possible to classify and prioritize the causes having the most consequences and therefore to take targeted actions to correct them. Applying this method systematically by involving the stakeholders makes it possible to make the company responsible for Data issues as well as to put in place again medium/long term roadmaps.

Tip #4 : We must not underestimate the impact that allowing teams to get out of a reactivity mechanism can have on their state of mind: being able to quickly plan medium/long term tasks will give a lot of peace of mind and guarantee timing in migrations and improvement to be undertaken — coupled with good communication will further improve visibility for other departments.

Method: https://tdan.com/a-practical-tool-for-data-stewards/19764

Isolate : Review your ETL process

Last but not least — separate extraction from transformations and by extension from loading.

Indeed, most of your bugs will focus on the transformations performed on the data, which will evolve over time or are often dependent on the data schema. It is necessary to ensure at all costs and whatever your architecture that in no case you lose data.

To do this, the extraction of data from the different sources must be completely independent of any consideration of transformation to be sure to keep a raw copy on your systems. You will win on all counts, by protecting yourself from data loss, to be able to regenerate the valued data at any time and reduce the consequences when resolving a long-term bug.

You got it: making data extraction reliable at all costs should be the first step in upgrading your architecture.

In a second step, quick actions can be taken to deeply improve your ETL process on the loading part: the major part of the errors are related to schema changes (addition/deletion of columns & change of types). We are all confronted with it, I see you smiling..

And yet, it is not a fatality. Avoid serializing or typing your data too early to split formatting loading again — added a hidden layer to your users in your Data Warehouse or public database where all your data is pushed as STRINGs and with no restriction on the schema. Regarding the provision to your users, make views available for your users, that are much more flexible to these changes in schemas and types, updated in real time and above all that perfectly meet the need.

Tip #5 : Last point — get rid of any SELECT * lying around in your systems — be selective and restrictive when using data and when not moving it between systems. You will then see a significant part of the transformations that arrived too early migrate to views of your raw data: without realizing it, you have gone from ETL to ELT — Bravo!

Bonus

An additional point for systems making data available to users: always make sure to put an API between your databases and your users. In addition to allowing a customized control on quotas and rights, you will be able to for example change DBMS technologies without your users even noticing it. Put in place such APIs represent almost zero development cost.

This advice is also valid in micro-service architectures, but don’t do it too soon: depending on the volumes, these seemingly simple APIs can become real bottlenecks. Your choice !

--

--