Data Quality Automation With Apache Spark

Tanya Lutsaievska

Published in

People.ai Engineering

6 min readAug 1, 2019

How People.ai built an automated data quality system.

“Data that is loved tends to survive.” — Kurt Bollacker

Data dashboard displays possible root causes of incorrectly computed time spent on daily sales activities. Notice how the main root cause changes over time: you always need fresh data to make relevant decisions.

Monitoring data quality is crucial to understand customers and provide them with a great product experience. Data quality informs decision-making in business and drives product development. For example, one of People.ai’s features is capturing all activity from Sales and Marketing. We analyze activities ingested from a user’s email inbox, calendar, and CRM system and display actionable insights to help sales and marketing teams take the best next action.

As our system rapidly scaled, we began to see abnormal numbers for certain users, such as 70 hours spent in a day. This number seemed unrealistic (unless you time travel with a Time-Turner). After we manually investigated the problem, we didn’t find any bugs. The algorithm worked as expected. However, we have identified several edge cases, and these were impacting the reports. Identifying these edge cases helped us improve our model.

The goal around monitoring activity data quality at People.ai is to identify outliers and various edge cases without customer involvement and improve our platform to provide the best experience for every user. We set out on a journey to build a rigorous quality assurance system that verifies data at every stage of a pipeline. Over the last three years, we have iterated our data quality validation flow from manual investigations and ad-hoc queries, to automated tests in CircleCI, to a fully automated Apache Spark pipeline.

Semi-manual checks. In the early days, we manually investigated edge cases by running ad-hoc scripts and queries in a Jupyter notebook. To illustrate, below is pseudo-code of a test that verifies whether we have filtered out too many emails sent by users during the intake and analysis:

analyzed_emails_count = len([entry for entry in data if entry[‘outbound’]])total_emails_count = len([entry for entry in data if entry[‘from’] == user_info.user_email])emails_mismatch = 100 — analyzed_emails_count * 100 / total_emails_countassert emails_mismatch <= 5, ‘Filtered emails value exceeds allowed threshold of 5 percent’

To define a threshold for the test, we analyzed data derived from emails sent by our users. The analysis showed that, on average, we filter out five percent or less of sent emails. These are usually some non-work related emails. For example, we can use five percent as a threshold to verify if we missed any edge cases that would cause over-filtering. If the number of filtered emails is greater than the expected threshold, we can look for root causes, such as various aliases a user might use that we failed to detect.

Those queries became a part of our data quality analysis routine. But as we started to grow our customer base, the data became too large to manually check. Validation used to take more than 20 hours of work a week.

Next came automation with CircleCI. In this phase, we wrote down all checks we performed manually as unit tests and built them into the pipeline. We used those tests to verify activity data for every new user upon registration.

*Streamlining semi-automated testing flow with CircleCI*

Registration of a new user would trigger a “data quality checking job” in CircleCI. If at least one of the data quality tests failed, an internal ticket was created to investigate a root cause before the system marked the user account as ready.

We saw three main challenges with this approach:

This flow involves a manual step to review tests that failed and identify a root cause.
User registration was not complete until the issues were fixed. Whenever a test failed, we only had 48 hours to fix new user registration.
Tests were run for new users upon registration only. We wanted to ensure the data validation tests were in place at every stage of a pipeline every time we ingest activities for existing users as well.

It didn’t take long for us to realize that running data quality tests for each user in CircleCI was not scalable, involved too much time to identify the root causes, and caused production database overload. This led us to Apache Spark.

Fully Automated Data Quality Apache Spark Job Streaming

As our customer base became bigger, we started seeing a dramatic increase in the amount of data we ingested from users’ mailboxes, calendars, and CRM systems. Tests in CircleCI set up to run at the user level were no longer scalable.

To improve the flow automated with CircleCI, we replicated all tests by a scheduled Spark job in Databricks. One of the advantages of Apache Spark is executing code in parallel across many different machines. By introducing Spark jobs, we moved from running CircleCI tests upon user registration only, to setting up a weekly job to evaluate all data at user level at all times.

From a technical standpoint, we chose Databricks mainly because of its job scheduling capability. We also wanted the simplicity of cluster configuration that does not require DevOps engineer involvement. It became possible with Databricks as it provides Apache Spark as a hosted solution.

We also automated investigations that our support engineering team usually performs manually to check for all possible root causes. This helped us eliminate the manual step we had in the previous flow.

As of right now, we have numerous tests in place to validate data quality on user level that scheduled to run weekly. Our goal is to run Spark job daily and eventually embed data validation checks into the pipeline.

The Spark Data Quality Pipeline

The Spark pipeline includes three jobs:

ETL layer
Data quality checking layer
The reporting layer

The ETL layer involves a Spark job that extracts a snapshot from multiple production databases, checks and corrects data type inconsistencies, and moves the transformed data into a Hive table. This step is important to keep the data clean and avoid production database overload.

The data quality checking layer contains the code that includes a set of rules for checking and measuring data quality. We apply these rules to measure data quality metrics in the following dimensions:

Data completeness. Verify if there was an unexpected loss of data during activity intake or due to over filtering.
Data accuracy and integrity. A set of rules to identify outliers and edge cases.
Data consistency and stability. Verify if data is accurate, consistent, and stable for all users and that the numbers we report stay within an expected range.

Lastly, the reporting layer includes a data dashboard that displays the state of data quality across the board and email notifications sent to the support engineering team whenever data anomalies or outliers are detected, such as failure to detect email blasts.

*Data dashboard that shows the quality of job title data detected by People.ai algorithms for people with which sales reps engage*

The chart above shows whether People.ai detected a job title of a person correctly (such as Senior Account Manager), failed to correctly capture a correct job title, or detected an incomplete job title. For example, an incomplete job title might be the title “Director” but without a department or the title “Business Development” without seniority.

Data Quality Validation is Our Responsibility, Not the Customer’s

Data quality is a critical component of our system. At People.ai, data quality is placed at the same level of importance as integration testing is for software development.

We found the best way to validate activity data quality is by asking data itself what does not work and why it does not work. Finding an edge case and addressing it early helps us provide smoother user onboarding process and better customer experience.

Activity data quality validations should be automated as much as possible. Our goal is to embed validations into the data pipeline code, but in a manner that allows it to be effortlessly changed. Depending on the criticality of the data and validation, we want our pipeline to either fail completely or to flag and report the issue, and continue processing.

This has been our journey so far at People.ai. We continue to learn and grow overtime to ensure that our customers can have full faith in the quality of data we deliver.

Data Quality Automation With Apache Spark

The Spark Data Quality Pipeline

Data Quality Validation is Our Responsibility, Not the Customer’s

Written by Tanya Lutsaievska