Embrace the Data Lake Architecture
Often times, data engineers build data pipelines to extract data from external sources, transform them and enable other parts of the organization to query the resulting datasets. While it’s easier in the short term to just build all of this as a single stage pipeline, a more thoughtful data architecture is needed to scale this model to thousands of datasets spanning multiple tera/peta bytes.
First, let’s understand some common pitfalls with the simplistic approach detailed above.
- Scalability: Typically, the input data to such a pipeline is obtained by scanning upstream databases (RDBMS-es or NoSQL stores), which are poorly equipped at balancing performance with a mix of OLAP & regular OLTP workloads. Going directly to these source systems from each such pipeline would ultimately stress these systems and result in even outages.
- Data Quality : Source systems all have different forms of data formats (e.g json, thrift, avro) and models (e.g: relational, document-oriented). Accessing such data directly allows for very little standardization across pipelines (e.g: standard timestamp, key fields) as well as constant breakages due to lack of agreed upon schemas to describe the data.
- Engineering Velocity : An important, but often ignored aspect of authoring data pipelines is managing backfilling/recomputing the output data, due to change in transformation/business logic. When managing 1000s of pipelines, such re-computation off source systems would have to be carefully co-ordinated across teams, due to the tight coupling of business logic with source data.
- Data Silos : Finally, not all data is available in a single place to data scientists or other data consumers in the organization, to cross-correlate them for insights (for e.g: join a database table with CRM data). Even on individual sources, since only selected fields are extracted out, data scientists for e.g might miss out on important features to improve machine learning models.
In recent years, the Data Lake architecture has grown in popularity. In this model, source data is first extracted with little-to-no transformation into a first set of raw datasets. The goal of these raw datasets is to effectively model an upstream source system and its data, but in a way that can scale for OLAP workloads (for e.g using columnar file formats). All data pipelines which express business specific transformation are then executed on top of these raw datasets instead.
This approach has several advantages over the simplisitic approach, avoiding all the pitfalls we listed before.
- Scalable Design : Since the data is extracted once from each source system, the additional load on the source system is drastically reduced. Further extracted data is stored in optimized file formats on petabyte scale storage systems (e.g HDFS, Cloud stores), which are all specifically optimized for OLAP workloads.
- Standardized & Schematized : During the ingestion of the raw datasets, a number of standardization steps can be performed and a schema can be enforced on the data, which validates both structural integrity and semantics. This prevents bad data from ever making its way into the data lake.
- Nimble : Data engineers can develop, test and deploy changes to transformation business logic independently with access to large scale parallel processing using 1000s of compute cores.
- Unlock data : Such a data lake houses all source data next to each other, with rich access to SQL and other tools to explore and derive business insights by joining them & producing even derived datasets. Machine learning models have unfettered access to all of the data.
In this section, we will look into further improvements to the data lake architecture/implementation based on lessons learnt from actually building large data lake(s).
- Consider ingesting databases in an incremental fashion using change-capture systems or even JDBC based approaches (e.g: Apache Sqoop) to improve data freshness as well as further reduce load on databases.
- Standardize both event streams from applications and such database change streams into a single event bus (e.g Apache Kafka, Apache Pulsar)
- Ingest the raw datasets off the event bus using technologies that can support upserts to compact database changes into a snapshot (e.g Apache Kudu, Apache Hudi, Apache Hive ACID)
- Design the raw datasets to support efficient retrieval of new records (similar to change capture) by either partitioning your data (e.g using Apache Hive metastore) or using a system that support change streams (e.g Apache Hudi)
Disclaimer : I am part of PPMC of Apache Hudi, as well as its co-creator.