Universal Data Lake: The Future of Data
The Current State of Data Lake
Today, data lakes are used as a dumping ground for production data. This can include clickstream or ad data and other types of external data, as well as data sourced from a production database like MySQL, Postgres, MongoDB, Cassandra, or DynamoDB. Typically, consumers of data from the lake (such as analytics, machine learning, and business teams) re-ingest data — which is a euphemism for transforming data into a proprietary format — into the optimized formats of their choice and process it so that they can run their applications.
Rather than re-ingest data into one execution framework (whether that be Spark, Hive, plain MapReduce, Presto, Athena, SageMaker, or something else), consumers typically re-ingest into several. The reasoning here is that each framework has its own pros and cons, so using more can help mitigate some of the risks from sticking with just one. But this isn’t without its own risks, as relying on several frameworks can risk the loss of data portability and reusability.
Using several execution frameworks also creates several copies of the data. Additionally, almost all of the execution frameworks have their own proprietary metastores to store the metadata associated with their copy of the data. In the above example, Spark is one of the consumers and it has its metastore and local copies of data; Presto is the second consumer with its own metastore and local copies of data.
Some obvious challenges with this current model are not only the multiple copies of raw data floating around but also the difficulty in keeping these copies in sync and ensuring that the latest version of the data is being utilized by different consumers. Consumers of Presto could see very different results when compared to similar queries in Spark depending on when the data was ingested into spark vs presto. From an OpEx perspective, there are multiple batch jobs that are consuming data from the data lake to create custom copies of data, not to mention storage overhead to store and access these copies by different execution frameworks.
What’s Next: Universal Data Lake
A universal data lake is a concept where data is ingested only once and the data is consumed directly from the data lake without further re-ingestions. In this approach, various databases such as Spark, Athena, Presto, etc are mainly used as execution frameworks that can reuse data in the datalake as well as persist the data in the lake to be used in a portable manner by a different framework. Some of the technical goals could include
- Data portability: Minimize the movement of data out of the lake and reduce the number of copies of data across various execution environments.
- Data reusability: Store data in an open-source format that translates across various frameworks. Ideally, consumers would be able to pass around in-memory data frames between execution frameworks.
- Globally accessible data storage: Store data in storage locations that are globally accessible while providing sufficiently robust RBAC infrastructure to support granular access control.
- Globally accessible metadata: Ensure that metadata is global and accessible across different frameworks.
Some of the business goals could include
- Opex reduction
- Consumer flexibility: Provide flexibility for consumers to use different types of execution frameworks depending on their business and technical requirements.
- Minimization of data discrepancies: Ensure that all the consumers work on the current version of data.
The Architecture of a Universal Data Lake
To meet the above requirements of a universal data lake, the data lake should have the following features:
- Easily accessible data: This implies that data needs to be stored in a storage system that can be accessed concurrently by various instances. Object stores such as S3 or Azure Blob Storage provide an excellent storage solution since they are affordable and accessible by a huge number of instances simultaneously without a “mount” process.
- A universal data format: Data stored in open-source columnar formats such as Parquet or Delta Lake formats is universally accessible by most execution frameworks.
- An external metastore: Hive metastore on MySQL/Postgres is read/write accessible to all of the execution frameworks providing global schema access.
Implementing a Universal Data Lake
For background, at BFA, a micro-services-based architecture on AWS is used to serve our production data. Data is ingested from the production databases using DMS and event infrastructure into the data lake, which is hosted on S3. We use Spark to preprocess and ETL the data, to create the final version of the data in Parquet. The Metastore is hosted on RDS using Hive+MySQL. And since our analytics team, machine learning team, and various business teams all consume data from the data lake, IAM role-based access helps in fencing the data and providing siloed access to the relevant teams.
Because all of our infrastructure is hosted on AWS, there would be several advantages to making the data universal. Data is stored in S3, and this helps us to create suitable IAM roles and policies to provide granular access to the data there. Additionally, data in S3 is accessible across various EC2 instances and EKS infrastructure that are used for computes. The Metastore is hosted externally in RDS and, again, this is globally accessible, while the data format is Parquet and Delta Lake. By making the Metastore externally accessible and ensuring that data is always persisted in Parquet format, consumers can then use different types of execution frameworks to achieve their objectives. The machine learning team can use both MLlib and AWS SageMaker to build and train models, and the analytics team can use Athena to run interactive queries and for dash-boarding without having to re-ingest data into an Athena-specific format. Batch queries are then executed using Spark to create views and projections. The data persisted by any of the execution frameworks can be reused by other frameworks.
A key ROI is that freshness of the data is determined by the ingest infrastructure since all consumers are reading data directly from the lake. By making ingest infrastructure event-based, near real-time data is available in the data lake. Any view or projections created by an execution framework is available for consumption by a different framework. Given that OpEx is a major factor in the cloud environment, this architecture reduces spend by significantly reducing the number of batch jobs that need to be executed to copy the data into different types of execution environments.
In conclusion, a universal data lake provides several advantages to the consumers of data in terms of choice of execution frameworks while also ensuring that data is portable and reusable.