Getting started with building a quality data pipeline

Thomas Cardenas
Ancestry Product & Technology
5 min readSep 15, 2022

Summary: This article is for teams new to data pipelines to improve trust in data outputs for better results. Below I share our insights on the tools we used and what to do before using data from dependencies. The three examples we cover are external task sensors, S3 location not existing, and using Amazon Deequ to check on the error levels that would cause the pipeline to fail.

I’ve recently started building data pipelines that support machine learning inferences. My team and I were new to data lakes, data pipelines, Apache Spark, Apache Airflow, and data engineering in general. We were tasked with building a data pipeline training and inference for a hint recommendation model.

This article describes that Ancestry manages tens of billions of hints for customers. The new data pipeline needed to pull a subset in the tens of millions efficiently. Ancestry’s hints are continuously changed, creating the added challenge of keeping track of accepted, rejected, and deleted hints for all the subsets of hints we were using.

By creating this new pipeline, we hoped to focus on improving data quality by using single-focused tooling to solve issues we saw. Our past integrations saw pipelines frequently fail, missing or duplicated data, and data sources misunderstood. This challenge was an exciting opportunity for my team to make a pipeline in the hopes of improving data quality.

A few helpful terms before reading on

  • Apache Airflow — tool for scheduling and orchestrating pipelines
  • Directed Acyclic Graph (DAG) — represents a series of orchestrated tasks
  • Apache Spark — general data processing framework for loading data and performing transformations is a common use-case
  • Amazon EMR (Elastic MapReduce) — compute resource offering that can run various Big Data applications such as Apache Spark
Figure-1: Pipeline visual of using external task sensors to connect separate data pipelines (Source: Author)

Airflow External Task Sensors

The dependency team used Airflow DAGs for each acquisition, and those DAGs lived on the same Airflow server as my team’s DAGs. When two DAGs have dependency relationships, built-in AirFlow External Task Sensors operators can be used to check if they have finished or to alert if they have failed. Figure 1 above shows a task per data source to check other data pipelines to see if they are finished before continuing to the Inference ETL.

Adding this gave my team quick insight into why pipelines failed without having to open logs. It also allows Airflow to build a dependency graph between all the DAGs. This is not always an option, but it is nice.

Figure-2: Pipeline visual of extending task sensors with S3 List Operator (Source: Author)

Ensuring the S3 files exist

External Task Sensors often said the data acquisition DAG passed, but our tasks failed. This was because the files in the S3 location did not exist due to the dependencies pulling zero rows.

Amazon S3 Operators is a package with S3ListOperator, which can quickly be wrapped into a custom operator to confirm that the files exist. The number of files or file names is unimportant, just that they exist.

Adding this allowed us to diagnose problems faster. As we all know, the fewer clicks required to figure out an issue, the better. It also prevented us from spinning up clusters (an expensive process) when the data wasn’t there.

Figure 2 shows how the addition of the task can be implemented along with the dependency status check.

Figure-3: Pipeline visual with Data Constraint check added (source: Author)

Amazon Deequ

Before using data from dependencies, it’s crucial to make sure any assumptions are accurate. It’s also essential to review the requirements of the final data deliverable to ensure they are met. To do this, an operator submits a Spark step to an EMR cluster that will use Deequ to validate those requirements and assumptions.

Assumptions may be things such as the row count being in the expected range, columns not being null, columns being unique, etc. There are various possible things to check, and Amazon Deequ is this check.

It allows one to add Checks that are specific constraints on data, such as making sure a column is not null, as mentioned above, or the value is contained in a list of enums. Checks on the error level would cause the pipeline to fail and not move forward. There is also a warning level for less critical checks.

Check(CheckLevel.Error, "Schema Failure Constraints")
.isComplete("user_id")
.isComplete("job_title")
.isUnique("user_id")

Additional queries can be added to checks; these are called analyzers. One would want to add analyzers for several distinct values in a column. I use these frequently to build dashboards. They don’t fail our pipelines.

var state = VerificationSuite()
.onData(data)
.addRequiredAnalyzers(Seq(Uniqueness("job_title")))

There is a third option for anomaly detection. It’s an additional layer of quality checks that are broader as opposed to adding very specific constraints. This could include verifying that the null count is in a certain percentage range or that the row count is +/- x number of rows. Like Checks, anomalies could cause pipelines to fail.

VerificationSuite()
.onData(data)
.useRepository(metricsRepository)
.saveOrAppendResult(key)
.addAnomalyCheck(
RelativeRateOfChangeStrategy(maxRateIncrease = Some(100000.0)),
Size())
.run()

Deequ is not just used for data acquired from other teams; I use it after every transformation and before delivering data to clients. This is to guarantee that no bugs were introduced at any point in time during the pipeline.

The team must look into the logs when a failure happens to see why the quality constraints failed. Deequ doesn’t remove the bad data. It only generates an analysis of the data profile.

Note: Deequ can check that the rows exist, but confirming the data exists is much better before running any quality checks. Submitting a job to an EMR cluster just for it to fail due to the data not being there is a needless expense.

Final Thoughts

There’s a multitude of tools out there that provide a full range of functionalities. What is described above doesn’t cover all quality or observability, but it can help those getting started to improve data quality.

Our team is currently expanding the use of Deequ for data profiling and metric reporting to build dashboards for a more straightforward analysis of the metrics.

If you’re interested in joining Ancestry, we’re hiring! Feel free to check out our careers page for more info. Also, please see our medium page to see what Ancestry is up to and read more articles like this.

--

--