A Paved Road for Data Pipelines

Seshu Adunuthula
Intuit Engineering
Published in
8 min readAug 18, 2020
Image attribution: www.piqsels.com

Data is a key bet for Intuit as we invest heavily in new customer experiences: a platform to connect experts anywhere in the world with customers and small business owners, a platform that connects to thousands of institutions and aggregates financial information to simplify user workflows, customer care interactions made effective with use of data and AI, etc. Data pipelines that capture data from the source systems, perform transformations on the data and make the data available to the machine learning (ML) and analytics platforms, are critical for enabling these experiences.

With the move to Cloud data lakes, data engineers now have a multitude of processing runtimes and tools available to build these data pipelines. The wealth of choices has lead to silos of computation, inconsistent implementation of the pipelines and an overall reduction in the effectiveness of extracting data insights efficiently. In this blog article we will describe a “paved road“ for creating, managing and monitoring data pipelines, to eliminate the silos and increase the effectiveness of processing in the data lake.

Processing in the Data Lake

Data is ingested into the lake from a variety of internal and external sources, cleansed, augmented, transformed and made available to ML and analytics platforms for insight. We have different types of pipelines, to ingest data into the data lake, curate the data, transform, and load data into data marts.

Ingestion Pipelines

A key tenet of data transformation is to ensure that all data is ingested into the data lake and made available in a format that is easily discoverable. We standardized on Parquet as the file format for all ingestion into the data lake, with support for materialization (mutable data sets). The bulk of our datasets are materialized through Intuit’s own materialization engine, though Delta Lake is rapidly gaining momentum as a materialization format of choice. A data catalog built using Apache Atlas is used for searching and discovering the datasets, while Apache Superset is used for exploring the data sets.

ETL Pipelines & Data Streams

Before data in the lake is consumed by the ML and analytics platform, it needs to be transformed (cleansed, augmented from additional sources, aggregated etc). The bulk of these transformations are done periodically on schedule: once a day, once every few hours, etc, although as we begin to embrace the concepts of real-time processing, there has been an uptick in converting the batch-oriented pipelines to streaming.

Batch processing in the lake is done primarily through Hive and Spark SQL jobs. More complex transformations that cannot be represented in SQL are done using Spark Core. The main engine today for batch processing is AWS EMR. Scheduling of the batch jobs is done through an enterprise scheduler with more than 20K jobs scheduled daily.

Stream pipelines process messages that are read from a Kafka-based event bus, use Apache Beam for analyzing the data streams and Apache Flink as the engine for stateful computation on the data streams.

Why do we need a “Paved Road”?

With the advent of cloud data lakes and open source, data engineers have a wealth of choices for implementing the pipelines. For batch computation users have the option of picking from AWS EMR, AWS Glue, AWS Batch, AWS Lambda from AWS, Apache Airflow data pipelines, Apache Spark etc from Opensource/enterprise vendors. Data Streams can be implemented on AWS Kinesis streams, Apache Beam, Spark Streaming, Apache Flink etc.

Though choice is a good thing to aspire to, if not applied carefully can lead to fragmentation and islands of computing. As users adopt different infrastructure and tools for their pipelines it can inadvertently lead to silos and inconsistencies in the capabilities across pipelines.

  • Lineage: Different infrastructure and tools provide different levels of lineage and in some cases none at all and do not integrate with each other. For example pipelines built using EMR do not share the lineage with pipelines built using other frameworks.
  • Pipeline Management: Creation and Management of pipelines can be different and inconsistent across different pipeline infrastructures.
  • Monitoring & Alerting: Monitoring and Alerting is not standardized across different pipeline infrastructures.

A Paved Road for Data Pipelines

A Paved Road for the data pipelines provides a consistent set of infrastructure components and tools for implementing data pipelines.

  • A standard way to create and manage pipelines
  • A standard way to promote the pipelines from development/QA to production environments.
  • A standard way to monitor, debug, analyze failures and remediate errors in the pipelines.
  • Pipelines tools such for Lineage, Data Anomaly detection and Data Parity checks work consistently across all the pipelines.
  • A small set of execution environments that host the pipelines and provide a consistent experience to the users of the pipelines.

The Paved Road begins with Intuit’s Development Portal where data engineers manage their pipelines.

Intuit Development Portal

Our development portal is an entry point for all developers at Intuit for managing their web applications, microservices, AWS Accounts and other types of assets.

We extended the development portal to allow data engineers to manage their data pipelines. It is a central location for data engineers to create, manage, monitor and remediate their pipelines.

Processors & Pipelines

Processors are reusable code artifacts that represent a task within a data pipeline. In batch pipelines, they correspond to a HiveSQL or Spark SQL code that performs a transformation by reading data from input tables in the data lake and writing the transformed data back to another table. In stream pipelines, messages are read from the event bus, transformed and written back to the event bus.

Pipelines are a series of processors chained together to perform an activity/job. Batch pipelines are typically scheduled or triggered on the completion of other batch pipelines. Stream pipelines execute when messages arrive on the event bus.

Defining the Pipelines

Intuit data engineers create pipelines using the data pipeline widget in the development portal. During the pipeline creation, data engineers implement the pipeline’s processors, define its schedule, and specify upstream dependencies or additional triggers required for initiating the pipelines.

Processors within a pipeline specify the datasets they work on and the datasets they output to define the lineage. Pipelines are defined in development/QA environments, tested and promoted to production.

Managing the Pipelines

From the development portal users are able to navigate to their pipelines and manage them. Each pipeline has a custom monitoring dashboard that displays the current active instances of the pipeline and historical instances. The dashboard also has widgets for metrics such as execution time, cpu, memory usage metrics etc. A pipeline specific logging dashboard allows users to look at the pipeline logs and debug in case of errors.

Users can edit the pipelines to add or delete processors, change the schedules and upstream dependencies, etc. as a part of the day-to-day operations for managing the pipelines.

Pipeline Execution Environments

The primary execution environment for our batch pipelines is AWS EMR. These pipelines are scheduled using an enterprise scheduler. This environment has been the workhorse and will continue to remain so, but it has started to show its age. The scheduler built in an enterprise world and has struggled to make the transition to cloud environment. Hadoop/YARN, which forms the basis of AWS EMR have not kept pace with advances in container runtimes. In the target state for batch pipelines we are working towards execution environments that are optimized for container runtimes and cloud native schedulers.

We’re also investing in reducing the friction to switch the pipelines from one execution environment to another. To change the execution environment for example from Hadoop/YARN to Kubernetes, all the data engineer is required to do is to redeploy the pipeline to the new environment.

Pipeline Tools

A key aspect of a paved road is a comprehensive set of tools for capabilities such as lineage, data parity, anomaly detection, etc. Consistency of tools across all pipelines and their execution environments is crucial for increasing the value we extract from the data and confidence/trust we instill in consumers of this data.

Lineage

A lineage tool is critical for the productivity of the data engineers and their ability to operate the data pipelines because it tracks the lineage of all the pipelines from the source systems to the ingestion frameworks to the data lake and the analytics/reporting systems.

Data Anomaly Detection

Another important tool in the data pipelines arsenal is detection of data anomalies. There is a multitude of data anomalies to consider, including data freshness, lack of new data coming in, missing/duplicated data, etc.

Data anomaly detection tools increase the confidence/trust in the correctness of the data. The anomaly detection algorithms model seasonality, dynamically adjust thresholds, and alert consumers when anomalies are detected.

Data Parity

Data Parity checks are performed at multiple stages of the data pipelines to ensure the correctness of the data as it flows through the pipeline. Parity checks are another key capability for addressing the compliance requirements such as SOX.

Conclusion & Future Work

Intuit has thousands of data pipelines that span across all business units and across various functions such as marketing, customer success, risk, etc. These pipelines are critical to enabling the data-driven experiences.. The paved road described here provides a consistent environment for managing the pipelines. But, it’s only the beginning of our data pipeline journey.

Pipelines & Entity Graphs

Data lakes are a collection of thousands of tables that are hard to discover and explore, poorly documented, and continue to remain difficult to use because they don’t capture the entities that describe a business and the relationships between them. In future, we envision Entity Graphs that represent how business use and extract insights from data. The data pipelines to acquire, transform and serve data will evolve to understand these entity graphs.

Data Mesh

In her paper on “Distributed Data Mesh,” Zhamak Dehghani, principal consultant, member of technical advisory board, and portfolio director at ThoughtWorks, lays the foundation for domain-oriented decomposition and ownership of data pipelines. To realize the vision of a data mesh and successfully enable domain owners to define and manage their own data pipelines, the “paved road” for data pipelines described here is a foundational stepping stone.

--

--

Seshu Adunuthula
Intuit Engineering

Seshu Adunuthula is the Head of Data Lake Infrastructure at Intuit. He is passionate about the Next Generation Data Platform built on public cloud.