A Paved Road for Data Pipelines
Data is a key bet for Intuit as we invest heavily in new customer experiences: a platform to connect experts anywhere in the world with customers and small business owners, a platform that connects to thousands of institutions and aggregates financial information to simplify user workflows, customer care interactions made effective with use of data and AI, etc. Data pipelines that capture data from the source systems, perform transformations on the data and make the data available to the machine learning (ML) and analytics platforms, are critical for enabling these experiences.
With the move to Cloud data lakes, data engineers now have a multitude of processing runtimes and tools available to build these data pipelines. The wealth of choices has lead to silos of computation, inconsistent implementation of the pipelines and an overall reduction in the effectiveness of extracting data insights efficiently. In this blog article we will describe a “paved road“ for creating, managing and monitoring data pipelines, to eliminate the silos and increase the effectiveness of processing in the data lake.
Processing in the Data Lake
Data is ingested into the lake from a variety of internal and external sources, cleansed, augmented, transformed and made available to ML and analytics platforms for insight. We have different types of pipelines, to ingest data into the data lake, curate the data, transform, and load data into data marts.
A key tenet of data transformation is to ensure that all data is ingested into the data lake and made available in a format that is easily discoverable. We standardized on Parquet as the file format for all ingestion into the data lake, with support for materialization (mutable data sets). The bulk of our datasets are materialized through Intuit’s own materialization engine, though Delta Lake is rapidly gaining momentum as a materialization format of choice. A data catalog built using Apache Atlas is used for searching and discovering the datasets, while Apache Superset is used for exploring the data sets.
ETL Pipelines & Data Streams
Before data in the lake is consumed by the ML and analytics platform, it needs to be transformed…