Data Engineering for Martech — DataQuality — Data Engineering Series (Part II)

Published in

DP6 US

7 min readJan 3, 2022

Data Governance is a subject that has become increasingly important in the Digital Marketing area: we even talked about it a little in a previous post. Concern about data management and security has grown a lot, mainly due to the implementation of the GDPL (General Data Protection Law). In addition, one of the governance areas that has gained prominence is the Data Quality area.

Data Quality can be defined as an indication of how reliable a piece of data is. For this you need to understand your data, define metrics and criteria that indicate the level of confidence. Then you must constantly evaluate these metrics, and make decisions to improve and guarantee quality.

For good analytics it is necessary to have the highest level of data reliability, as incorrect data can bias a result or generate flawed insights, which means your investment won’t be optimized. In Digital Marketing, the central object used for decision-making is Data Collection. Without it, understanding user behavior on a website or app is much more difficult.

However, reliable collection isn’t the only part that is essential. It is also important to ensure the quality of data throughout its lifecycle, from creation thought to maintenance, processing and even archiving (if applicable). So, how can we guarantee the quality of this data? What actions need to be taken? What is the role of the data engineer in this whole process?

At DP6, many of these processes are the responsibility of the engineer. Not only do they carry out the collection, but also generate inputs that help ensure the quality of the entire universe of data used. So, let’s try to understand a little of the process that the data engineer must follow when they implement Data Quality in a project.

Data quality in practice

There are many Data Quality tools on the market these days, but simply adopting one of them may not be as effective as expected. To ensure data quality you must go beyond mere tools; you have to understand your complete data structure and have in-depth knowledge of the information lifecycle. You need proper management of the origins, changes, integrations and useful life of the data. When dealing with Data Quality, it is important to keep these main concepts in mind:

Knowledge of the data — enables you to find hidden problems, wherever they are.
Lifetime management of data — managing the growth, transformation and disposal of data, as well as being aware of security and optimizing processes.
Constant monitoring of quality — Track quality issues across the full range of enterprise data, making sure that quality expectations are always met.

Now, let’s learn a little more about the step-by-step process of implementing Data Quality, and the data engineer’s responsibilities.

1 — Understanding the data lifecycle

Each project has specific requirements in relation to data, but generally speaking they all follow a life cycle, which you need to know very well. You must understand the data sources (which can be from collection, API downloads, third-party tools, other databases etc.), where it is stored (spreadsheets, database or Google Analytics itself), the processing and transformation it undergoes, what analysis it can be used for, and finally, whether it goes for disposal or archiving.

2 — Deciding when the data will be validated

Once you understand the entire lifecycle of your data, it’s time to establish the strategic points for its validation. Yes, Data Quality can (and should) be implemented at different points in the cycle.

Why is it necessary to have more than one Data Quality check? Firstly, because your data will likely come from different sources. Secondly, depending on the data structure, it is likely that there will be more than one database used for analysis, so it will not be possible to simply validate a final base. Last but not least, if you can guarantee quality from one stage to the next it’s much easier to find the source of any potential problems. For example, it is no use validating the data collection alone if transformations are made on the data later, or if the validation is only at the final point (or base). You will likely need to reverse engineer until you find where the data is incorrect.

Some examples of strategic points where data can be validated:

Data layer: it is possible to guarantee that the data is going to the layer correctly, and that there is no PII being sent.
Data collection for Google Analytics: validation of events and pageviews that arrive in GA according to the collection map.
BigQuery databases: ensure that you have all the data you need for analysis, that it has not been corrupted by transformations and calculations, and check whether it has been rendered incomplete due to a processing failure.

3 — Defining quality metrics

It is very important to know your data, to be able to say what is right and what is wrong. How do you know what the correct data volume is? How do you determine whether a userid is being collected correctly or not? These are some examples of questions that need to be asked, so that it’s possible to apply criteria for accepting the data and its reliability.

This process may also require the help of a Business Analyst. Together with the Data Engineer they will be able to define which metrics need to be applied, according to the analysis of current data and business understanding.

At this stage, it is worth learning more about TDQM (Total Data Quality Management), which is a methodology used to define metrics, measure quality, analyze the causes of inconsistencies and improve data quality. With this methodology, it is possible to use 16 quality dimensions to help define metrics that will be used to monitor data. It is not necessary to use all 16 dimensions to evaluate data: once you understand the needs of your business in relation to data, it will be possible to choose the dimensions that make the most sense for each group of data evaluated.

4 — Designing the validation framework

After defining where the data will be validated, it is time to design the process architecture. Just as data can and should be validated at more than one point in its lifecycle, there are several different ways to accomplish this design. You can use APIs, spreadsheets, or even plugins to validate data collection if you don’t have a cloud environment. However, if there is a cloud environment (GCP, AWS, Azure etc.) the range of possibilities expands even further. It is possible to develop automated scripts to validate both the collection and the data in a database, automate the generation of alerts (see the next step), and create a robust monitoring system. For this reason, the architecture must be well-structured, from end to end, including all of the tools that you are going to use. This step is as important as any other when implementing Data Quality. The better its structure is documented, the easier it will be to maintain, and to implement improvements later.

5 — Create an alert mechanism and/or a basis for follow-up (Monitoring)

Finally, we come to the execution and implementation of Data Quality. After all the planning, it’s time for the data engineer to get their hands dirty and create the mechanisms for validation (as designed in the previous step), which is also where the final delivery occurs. This can be either an email alert about a problem identified at the time of validation, and/or a monitoring dashboard. Here it will be possible to monitor the evolution of the data and how correct it is relative to the past, or whether it has already achieved the stability determined by the quality metrics. It is important to monitor the quality of the data and, on this basis, make decisions both for its improvement and its reliability.

Another important point is that you should always be aware of any new data that is made available. Data collection evolves and other database sources may emerge. New metrics may be needed, and there must be continuous improvement of the process to meet the evolving needs of the business and the data itself. Therefore, it is imperative that you keep track of all processes and regularly review the steps described above.

Having Data Quality seems simple at first, but it can be quite a complex task. If applied correctly, it provides assurances that the data used is truly reliable. The volume, variety and speed of data only tends to grow, and robust mechanisms that guarantee its quality are essential for deliveries related to it, whether you’re doing a simple analysis or using artificial intelligence. The Data Engineer is the professional with the greatest responsibility for the development of these processes.

At DP6 we are dedicated to Data Governance, and we are building a stack of open source tools that can serve a variety of points in the lifecycle of data for Digital Marketing. We will post more about these tools here very soon, so keep an eye out!

Profile of the author: Angélica Fatarelli | With a Bachelor’s in Information Systems and an MBA in Data Science, she worked for many years with software development. Today she works in the world of Data Engineering, bringing technological solutions to Digital Marketing with DP6 in the financial, educational and health sectors.