My data portfolio — Case #01: Simple principles in Data Processing w/huge impacts

Jonas Scherer
5 min readJan 15, 2024

--

Foto de JJ Ying na Unsplash

Here is my approach to make Data Team to deliver better and faster before worry to much about the perfect setup for your data infrastructure.

I call it “principles” because it is a series of concepts and tools that guide a Data Team to it’s own “status quo”.

Without having a single line of code I anyway would call it a portfolio case because it comes from my own experience in a very specific context and it is decoupled from technologies (or could be easily adapted for many technologies).

Kick-starting — Challenges along the way:

Let’s say that I just landed into the company as a Data Engineer and it took me less than 2 weeks to understand a few things about how the data was being processed across the company:

  • Hundreds of ETL jobs running in different servers through Cron jobs (old but gold 😅);
  • No documentations about concurrency or schedules of those flows;
  • A lot of ETL flows with duplicated code/functions and some data being duplicated across multiple databases;
  • Data team deliveries are very slow because the lack of visibility and understanding about the data and business rules.

Can we compile that in a unique word? I call it LEGACY… We could also use a more fancier word to that: Data Mesh (or even Data Swamp maybe?). Anyway a team without vision and visibility is a lost team always trying do deliver only what internal customers want in a very inefficient way.

Value delivery:

After a 16 months of rebuilding data vision through many projects across the company (tech details come afterwards) I was able to:

  1. Reduce data bugs by 10x and increase team development deliveries by 5x. A clear vision about “what is running”, “when is running” and “why it stopped to work” was the key point here;
  2. Deliver data transformations up to 30x faster after implementing a better modeling concepts to integrate with BI tools to help internal customers in their analysis;
  3. Help to reduce company churn rate 5% through data quality and governance by acting in a more proactive approach to solve data issues;
  4. Deliver transformations that increased significantly the company’s MMR(monthly recurring rate — meaning customers paying more for data solutions🤑)

The Idea:

Implementing tools to increase visibility of data at transit and data at rest for all legacy flows. Cleaning things up using best practices for data/development towards team efficiency and assertiveness. Seems easy right? The long explanation is here:

Fist of all we needed and orchestration tool and make some effort to migrate those Cron jobs to something that gives us a visibility of what is happening. Prefect comes here as a great alternative to Airflow since its requirements are cheap and there are no huge dependencies to run. I like to call it the “Christmas Three” for data teams where we need to check is something is blinking either green or red:

Example screenshot from Prefect from https://docs.dados.rio/guia-desenvolvedores/pipelines-cloud/

After migrating hundreds of flows it’s time to clean the house since we have a huge transparency about what is happening. Some requirements that were met:

  • Unique repository for data processing and code reviews — Source code was reviewed and centralized to avoid duplication. Team was leveled up to make pull requests and code reviews;
  • No duplicated code across flows and no data towing to duplicate it across multiple databases — Basically DRY concepts to develop flows and if some database must access another, it should be done through database mappings to avoid inconsistencies;
  • ETLs running with unused data should be terminated — The work here was deal with people and create a plan to terminate things without impact production datasets and its customers. It always worth the effort by reducing points of failure;
  • Data flows should be labeled to avoid concurrency — Prefect allows us to set labels and timeouts to avoid database overload;
  • Data flows should be deployed in CI/CD concept — We are talking about TDD here and avoid manual workload across data team;
  • Every flow must log its output and alert if something wrong is happening— It’s easier to solve bugs when we do not need to run everything in our own machines. Also with Prefect there’s no need to go through server logs, it is all consolidated in a single platform;
  • Data processing should be segregated from data viewing — This comes from data being processed by data team and accessed by customers in the same database and causing load issues in the main platform. To solve that Dremio was implemented as analytics/acceleration tool for internal BI platforms and the data for platform access was off loaded from main database to a view only database to serve external customers;
  • Data infrastructure should be the same between multiple environments — Here Docker/Compose comes handy. Every flow runs in a Docker container that has the same packages in development/staging or production.
  • Data quality checks come together with ETL development — As a new requirement for a ETL comes, every output of that flow should have another flow working to audit its output quality. I deployed SODA SQL here to check data at rest and send alerts if something bad is happening:
SODA SQL UI example from https://www.soda.io/resources/how-to-get-started-managing-data-quality-with-sql-and-scale-2

Design Overview

Some key points:

  • Development environment is running inside each Data Engineer/Scientists and is the same as production. Data for development is processed locally and never loaded into production databases.
  • All the production data is processed inside our Cloud. Only metadata is send to 3rd parties (like schedules, logs, flow states, etc…)
  • Prefect + Soda Cloud versions are an extra. We could run everything using opensource versions of them and monitor internally with a little effort.

That’s it? What else?

One orchestration tool, one data quality tool, one acceleration tool and few constraints… Is it all? Really? Go back to the “Value delivery” section and read again.

Never underestimate the power of a few changes in a huge complex environment.

Few things came afterwards but that’s a topic for another case. If you arrived here, thanks for your reading and your time. Feel free to get in touch with me through LN or DataHackers BR community.

--

--