By Gil Benghiat
In the world of analytics and big data, the term ‘data lake’ is getting increased press and attention. At the same time, the idea of a data lake is surrounded by confusion and controversy. The industry quips about the data lake getting out of control and turning into a data swamp. What is a data lake and what is it good for? How do I build one? This post will give DataKitchen’s practitioner view of a data lake and discuss how a data lake can be used and not abused.
Introduction to the Data Lake
DataKitchen sees the data lake as a design pattern. Design Patterns are formalized best practices that one can use to solve common problems when designing a system. Most simply stated, a data lake is the practice of storing data that comes directly from a supplier or an operational system.
Data Lakes have four key characteristics:
- The data is unprocessed (ok, or lightly processed).
- The data is saved as long as possible.
- Data lakes are coupled with the ability to manage the transformations of the data. More on transformations later.
- They can support schema on read.
Many assume that the only way to implement a data lake is with HDFS and the data lake is just for Big Data. Not true! There are many technology choices and every lake does not have to contain Big Data. The data lake pattern is also ideal for “Medium Data” and “Little Data” too. Technology choices can include HDFS, AWS S3, Distributed File Systems, etc. There are many vendors such as Microsoft, Amazon, EMC, Teradata, and Hortonworks that sell these technologies. Finally, data lakes can also be on premises and in the cloud.
Data Lake Operational Components
The following diagram shows the complete data lake pattern:
On the left are the data sources. These can be operational systems, like SalesForce.com customer relationship management or NetSuite inventory management system. Other example data sources are syndicated data from IMS or Symphony, zip code to territory mappings or groupings of products into a hierarchy. There needs to be some process that loads the data into the data lake.
The data lake should hold all the raw data in its unprocessed form and data should never be deleted. That said, if there are space limitations, data should be retained for as long as possible.
The data transforms shape the raw data for each need and put them into a data mart or data warehouse on the right of the diagram. For the remainder of this post, we will call the right side the data warehouse. For example, looking at two uses for sales data, one transformation may create a data warehouse that combines the sales data with the full region-district-territory hierarchy and another transformation would create a data warehouse with aggregations at the region level for fast and easy export to excel. Sometimes one team requires extra processing of existing data. To meet that need, one would string two transformations together and create yet another purpose built data warehouse. Finally, the transformations should contain Data Tests so the organization has high confidence in the resultant data warehouse.
Once the data is ready for each need, data analysts and data scientist can access the the data with their favorite tools such as Tableau, Excel, QlikView, Alteryx, R, SAS, SPSS, etc. The organization can also use the data for operational purposes such as automated decision support or to drive the content of email marketing.
Getting started with a Data Lake
Resist the urge to fill the data lake with all available data from the entire enterprise (and create the Great Lake :-). Place only data sets that you need in the data lake and only when there are identified consumers for the data. However, if you need some fields from a source, add all fields from that source since you are incurring the expense to implement the integration. Once a data source is in the data lake, work in an Agile way with your customers to select just enough data to be cleaned, curated, and transformed into a data warehouse. Designers often use a Star Schema for the data warehouse.
How to use a Data Lake
A handy practice is to place certain meta-data into the the name of the object in the data lake. Your situation may merit including a data arrival time stamp, source name, confidentiality indication, retention period, and data quality. For example:
In the data lake pattern, the transforms are dynamic and fluid and should quickly evolve to keep up with the demands of the analytic consumer. As requirements change, simply update the transformation and create a new data mart or data warehouse. These are examples of events merit a transformation update:
- More data fields are required in the data warehouse from the data lake
- New transformation logic or business rules are needed
- A new aggregation is needed
- Implementation of better data cleaning is available
- A new data source is required
Once the new data warehouse is created and it passes all of the data tests, the operations person can swap it for the old data warehouse.
The final use of the data lake is the ability to implement a “time machine” — namely the ability to re-create a data warehouse at a given point of time in the past. Remember, the date is embedded in the data’s name. A best practice is to parameterize the data transforms so they can be programmed to grab any time slice of data.
How to abuse a Data Lake
There are four ways to abuse a data lake and get stuck make a data swamp!
First, create a data lake without also crafting data warehouses. This would put the entire task of data cleaning, semantics, and data organization on all of the end users for every project. It is imperative to have a group of Data Engineers managing the transformations and make a group of Data Analysts or Data Scientists super powered. That said, the analytic consumers should have access to the data lake so they can experiment, innovate, or simply have access of the data to get their job done.
Second, as mentioned above, it is an abuse of the data lake to pour data in without a clear purpose for the data.
Third, ignore data governance including data semantics, quality, and lineage.
Finally, do not put any access controls on the data lake. For example, if a public company puts all of its financial information in a data lake open to all employees, then all employees suddenly become Wall Street insiders. Not good.
DataKitchen does not see the data lake as a particular technology. The data lake is a Design pattern that can superpower your analytic team if used and not abused. If you are interested in data lakes in S3, let us know.
Like this story? Download the 140 page DataOps Cookbook!