What Is Data Integrity, How Do You Achieve It, and Why Is It Important?

Idan Shafat
Riskified Tech

--

Data Integrity Engineers, aka DIs, wear many hats. A major part of this role is connecting data engineering and different data analysis teams, as well as analyzing data quality.

In different companies, the focus of the DI team is different. At Riskified, we get a taste of all aspects, such as the integrity of data sources, data stream monitoring, client integrations, customizations, and innovations.

Why is data integrity important?

Maintaining data integrity can save you resources otherwise spent making decisions based on incorrect or incomplete data.

DI is a fast-growing role in the data world, especially in the US, and now there are ~300K Data Analysts, and ~95K DI (a third).

By not having Data Integrity Engineers, companies risk having an insufficient data management strategy in place. This means they lack the right processes and tools to ensure data quality and security, which affects the company’s performance and reputation. Without proper data management procedures and monitoring, the company may be at a higher risk of data privacy and security breaches.

Main tasks of the DI in Riskified

The DI is in charge of validations and tests through every stage of the integration process. This ensures that when a client (or product) goes live, the models work appropriately.

New products and features rely heavily on historical data to make decisions. The data must be accurate and reliable, as maintaining data integrity helps companies identify new opportunities and trends.

Onboarding processes for new clients

The DI checks the data that the company receives through all of its sources, and is the go-to person for technical questions about data flows or requests that arise during the integration. Their main tasks are validating production and historical data, making sure it’s populated correctly, and unified and monitoring issues.

Historical data

This helps us calibrate our models before production. We receive big data from the client, transform it to fit our system, and validate and analyze the data before we upload it to our DB.

These validations help us ensure the integrity of the data we receive after going live with the client, thus maximizing integrity to the full potential of our client’s capabilities and our possible customizations.

Our validations contain automated reports via tools that we built and custom validations per client flow.

Production

Without proper data validation procedures, the data may contain errors and inconsistencies, leading to inaccurate insights and decision-making. After all, data-driven decisions can only be as strong as the data they’re based on.

On production, we get live data from our client and validate different flows, pipelines, and APIs, and that our converters and ETLs are functioning right (for example, for orders placed physically in the store, the IP data point will be treated differently). If any adaptations, manipulations, or fixes are needed, this is the time to customize the systems for the new segment.

Some of our validations are repetitive and necessary for every integration, while some are more client specific. The DI creates automations for repetitive tasks so that we can focus on more specific validations and edge cases.

For example, one of the data sources’ automated validation reports we created via Databricks enables colleagues to insert the relevant client ID and dates they want to analyze in the widgets, then run the report. The notebook will cache the main data views, and run validations in Spark and SQL to check every aspect of the data, saving a lot of time, and finding issues in a comfortable report with visualizations.

The notebook uses a dedicated cluster that operates at low costs:

The DI understands both the analytic and business aspects of data-based decisions and is thus in charge of documenting all known data issues and flows, communicating them to the relevant teams. We always try to fix the issues on our end first, but if a fix is needed on the client’s side, the DI takes into account the severity of the issue, if it’s a blocker, and if additional friction with the client is necessary. In these cases, the DI helps communicate the issue to the client and asks for another iteration of data.

Part of achieving data integrity is knowing where to be flexible and adapt to our client’s capabilities, and where we need to declare that something is a blocker.

For instance, if a customer can’t send us an indication for orders that comes from ‘desktop_web’ source, and instead they send ‘web’ (general), it’s not a blocker. If the customer sends us all of their orders with the same IP (the store’s IP and not the end user’s IP), it will be a blocker, and the DI has to ask the client to fix this issue to progress.

Integrations vary, so we need to keep this in mind to achieve data integrity in different, customizable efforts. Part of the challenge is understanding how flexible we can be.

Customization by industry

We work with different shops, and marketplaces from industries like fashion, travel, events, etc. For different industries, we request different data. The JSON files for different industry orders look different, and the data is passed through different models. The models take into account the different data points and assign different weights for features by industry to provide accurate analysis.

Without proper data policies, the company may struggle to enforce consistent data management practices, leading to poor data quality and integrity. Customizing our product for different customers may result in variations of definitions, structures, and formats, which can lead to inconsistencies in the data.
It’s problematic to integrate new data if it’s not unified with our settings and protocols.

That’s why the DI establishes best practices and guidelines for the ways we can customize the data for different cases, making sure that there are data sources where all data is unified with our company’s dictionaries and mappings.

Customization by merchant capabilities

Some clients can’t send us all the required data points. In these cases, we attempt to find replacement features and document the missing values.

If needed, the DI maps and manipulates some of the data points sent (map ‘standard_shipping’ and ‘standard’ the same, so the analysis is clearer).

If the client sends extra data we didn’t request, the DI tests to see if we can make use of it and communicate it to the development department.

Alerts and monitoring

After ensuring data integrity and going live with a new client, the next task is preserving integrity. This means making sure that no new data or flow issues arise, and that we get alerts in case of an issue.

That way, we manage to maintain and improve our performance.

Data issues

DI monitors are dashboards and alerts. If a new issue arises, the DI investigates the incident and understands what caused the metric drop. For example, we stop receiving a critical data point, or a more macro issue — receiving spam data on user behavior at the startup stage of different applications.

The DI attempts to fix the issue on our end or loops in the account manager to ask the client for a fix on their end.

Flow issues

The DI monitors the different flows and notifies if a client’s API is running into issues. We might see duplicated API calls, the wrong population of order IDs, and other cases that can cause data override or other issues in our DB.

Data issues can be found through BI metrics dashboards, trends, research, manual data review, and DI-based dashboards (using mostly Databricks to work with data lakes).

Automations and alerts

The DI builds and establishes automations for repeating processes, dashboards (like a data lake tables health check to see if there are lags), workflows, reports, alerts (e.g., tracking if a client deployed a requested fix), and spotting spam data in our systems (useless or duplicated), so we can point where it’s possible to save on storage and improve performance.

When building automations, the DI needs to be mindful of making things with minimum computing costs and efficient processes.

DIs understand their company and clients’ interests, in addition to being a connecting link between the DA and DE departments. Therefore they take part in the process of releasing new products.

New products

When a new product is developed, the DI oversees that the relevant data is collected and stored in the correct places for the analysts and data scientists to reach for useful insights.

The DI participates in discussions about new products, characterizing, and building tools that help monitor and ensure the data integrity of the products. Because the DI knows the data inside out, they take part in different tests of the products and help improve performance.

In addition, each DI takes a domain to master; it can be a new product, internal tool, workflows (like Airflow), and more.

DI’s main tools

Being a DI requires you to use and access a variety of tools. You will specialize in tools like Databricks, Pycharm, Rubymine, Git, BI tools (Tableau, Looker, Anodot), Coralogix, Logz, New Relic, and Internal tools and packages.

Databricks

Databricks enables us to cache data views from our data lakes and DBs (with multiple coding languages), create monitors, multitask automated jobs, control the computation power and resources used for our validations, analyze big data files through Amazon S3 or DBFS, create alerts, write data to research tables, and more.

Through Databricks, we can use our DI repository containing packages with functions, reports, and different scripts we write, and constantly update them to make our lives easier and preserve best practices across the team.

We also use other platforms for some of these tasks, though in Databricks, we can do most of our tasks. For example, we created a Databricks query to alert us whenever we got 8-digit credit card BIN numbers when 8-digit BINs had just been introduced to the world:

Output to subscribed emails:

Wrapping-up

Looking at the main responsibilities of a DI, it’s clear that these domains should be coordinated, and that this connecting link is an important part of a data company.

DI teams make it easier for other departments to focus on their goals, trusting the data accuracy and potential as the DI takes care of monitoring and tracking data issues, flows, pipelines, and edge cases. The DI can improve performance and decision-making, reduce costs, and prevent mistakes.

I hope this blog was helpful, offering insights into the data integrity world and increasing awareness.

--

--

Riskified Tech
Riskified Tech

Published in Riskified Tech

Software Engineering, Research, Data, Architecture, Scaling and more, written by our very own engineers and data scientists.

Idan Shafat
Idan Shafat

Written by Idan Shafat

Data Integration Engineer & Analyst at Riskified | Instructor at Practicum