Solving The Root Cause as a Data Engineer

Rendy B. Junior
Ruangguru
Published in
6 min readNov 1, 2022

What are the responsibilities of data engineers?

If you ask this question to data engineers, the answer would revolve around data wrangling, data preparation, data cleansing, data collection, etc. Yes, those are correct answers, but have you ever wondered why the problem is there in the first place? Why do you have to cleanse the data so that data scientists can use it? Why isn’t the data clean from the start, just like clean waters flowing from the mountain springs?

Photo by Jeffrey Workman on Unsplash

I would like to argue that most, if not all of the problems data engineers try to solve, are avoidable in the first place. I would also argue that solving the problem from the upstream is a lot cheaper compared to dealing with it in the data pipeline. In this post, I will share what we did and what we are working on in Ruangguru.

The Data Problems

First, let’s break down what common problems we, data engineers, are trying to solve.

  • Data silos, in many places: Need to collect data in one place (MySQL, Postgres, log files, CSV, etc).
  • Dirty data: Testing account, wrong string format, duplicates.
  • Unsure which data we should use: Payment data or transaction data? They both talk about revenue and sales.

These are common issues I found or heard from several companies. In order to solve these problems, usually data engineers would build a “data pipeline” that collects, cleanses, and processes data into one place called a “data lake” or “data warehouse”. But does this data pipeline and data warehouse solve all the problems? Sometimes it does, and sometimes it doesn’t. Here are a few other common things I heard from the result of implementations:

  • New data warehouse = yet another copy of data, old data still being used. SSOT (single source of truth) is never there. This results in higher bills and resources required to store the data.
  • Data is not being used because it is hard to understand.
  • Data is not being used because it is not trusted (different number compared to the current report).
  • Failing too often, unreliable, and not timely. This is common as data pipelines usually combine a lot of sources, and if one of them fails, everything fails.
  • Takes a long time to implement, users end up creating their own data pipeline.

Even with a battle-proven method and pattern, and tons of best practices out there, a bad implementation can lead to disastrous and costly results. So, what is the alternative way to solve these problems? Can we prevent the problems in the first place? Yes, I think we can prevent at least some of them.

Stop the Dirty Data, Early

This is the typical “data pipeline” that data engineers create to solve the problem. We might have 10 data, then we craft 100 tables in the data warehouse, then we present 1000 reports. The later we deal with dirty data, the more costly it is in terms of effort to cleanse the data. We all know that bad data leads to wrong decisions, which leads to wrong business actions that might be taken.

Cost of Fixing Data

The premise is, we can avoid software bugs by having unit tests, integration tests, automated tests, and even manual tests by QA. I believe we can avoid dirty data from coming into the production environment just like we avoid software bugs.

In Ruangguru, we try to stop dirty data early on, starting from the source/data producers. App and microservices are sending data in streaming to the data team, we call this tracking. In order to ensure this data is clean, we do a few things:

  • In the planning phase, the product teams are supposed to create something we call a “tracking schema”. This schema consists of a field name, field type, descriptions, and data quality rule to check. The schema then has to be approved by a data analyst to ensure its quality, for example, whether the column description is good enough and not confusing.
  • In the development phase, engineers can try to send data to our tracking service. If the data does not fit with the schema, they will go to a separate table we call the “dead letter”. This is a common pattern in data engineering.
  • In the testing phase, QA has to check whether the data is coming in. We also provide QA with automated checking tools for simple rule checking such as checking regex patterns on a string.
  • In the production phase, we have data quality monitoring where we check for data faulty and summarize the error for each engineering team.

We enforce the process above using an internal tool called Optimus (short for Operational Tracking Management System). Each user has roles (PM, Eng, QA, and DA), and approval for each stage depends on what role you have. For example, there will be no tracking allowed to go to production without approval from QA. QA has to ensure that the data coming into production is already clean.

Embrace “Data Mesh” Mindset

How about data silos? Aren’t they inevitable as each team has its own application-specific needs? Here, we take the root cause analysis to the next level, we try to see the problem and hence the solution from a “data ownership” point of view.

Nowadays, the term “data mesh” has gained traction since it was first coined by Zhamak a few years ago. However, I feel that the term is often misunderstood. Data mesh is not about technology, it is about mindset. It is a pretty fancy and cool way for data engineers to say “I am giving the responsibility of data collection and wrangling back to software engineers. You produce the data, you know your data, you own it. Just expose the data to us so we can use it”. Basically, data ownership is given back to each team.

Data Mesh Paradigm Shift by Zhamak Dehghani

We apply those principles in Ruangguru, where data collection/ingestion from each database to the data lake in the company is not managed by data engineers. The data engineering team provides the necessary tooling, but each engineering team configures data ingestions by itself. This principle also applies when software engineers need a data pipeline to support their API. We provide real-time processing tooling for them to configure by themselves. The same tooling is used by the data analyst team to create their own data marts.

What’s Left and What’s Next

We avoid dirty data by putting many safety measures in the early phase (planning, development, and testing). We democratize data collection and processing to solve data silos, we give data back to its real owners. We enforce data description and cataloging since the beginning. So, what’s left?

Data engineers still have a data pipeline to manage, but we we put more focus on company-wide data such as transactional and user behavior data. We also focus on improving user experience and the reliability of the tools we provide, while exploring new technology. For example, we would like to democratize external data collection, and Airbyte seems like a promising one to try.

In Ruangguru, we try to not only become developers who implement features, but we strive to become problem solvers. By taking a step back, and thinking deeply about the root cause of our problems, we do not only solve the symptom, but we can stop the problem from happening, hence we do it more efficiently.

Data Engineering Warehouse and Analytics (DEWA) Team during our Happy Hour event (yes, we currently have a Work From Anywhere policy).

--

--

Rendy B. Junior
Ruangguru

Data & Tech. Insinyur Data channel @YouTube 🎥.