Detail Differences between Data Lakes and Data Warehouses

Data Warehouses

It represents an abstracted picture of the business organized by subject area.

It is structured and highly transformed.

Data is not loaded to the data warehouse until the use for it has been defined.

Data Lake

All data is loaded from source systems. No data is turned away.

Data is stored at the leaf level in an untransformed or nearly untransformed state.

Data is transformed and schema is applied to fulfill the needs of analysis.

Next, let’s highlight five key differentiators of a data lake and how they contrast with the data warehouse approach.

Data Lakes Retain All Data

The development of a data warehouse, a sufficient amount of time is spent analyzing data sources, understanding business procedures and profiling data. The outcome is a highly structured data model designed for reporting. A big part of this procedure involves making decisions about which data to include and not under the warehouse. often, if data isn’t used to answer specific questions, then it may be excluded from the warehouse. This is generally done to simplify the data model and also to conserve space on expensive disk storage that is used to make the data warehouse per formant.

In contrast, the data lake keep ALL data. Not only data that is in use today but data that may be used and even data that may never be used just because it MIGHT be used in future. Data is also kept for all time so that we can go back in time to any point to do analysis.

Data Lakes Support All Data Types

Data warehouses commonly comprise with data extracted from transactional systems and consist of quantitative metrics and the attributes that describe them. Non-traditional data resources like web server logs, social network activity, sensor data, text and images are mostly ignored. Latest uses for these data types continue to be found but consuming and storing them can be costly and difficult.

The data lake approach holds these non-traditional data types. In data lake, we retain whole data anyhow of source and structure. We retain it in its raw form and we only transform it when we’re prepared to use it. This approach is known as “Schema on Read” vs. the “Schema on Write” approach used in the data warehouse.

Data Lakes Support All Users

In mostly organizations, 80% or more of users are “operational”. They require to get their reports, see their essential performance metrics the same cluster of data in a spreadsheet every day. The data warehouse is genrally ideal for these users because it is well structured, straightforward, simple to understand and it is purpose-built to answer their questions.

The next 10%, do more data analysis. They utilize the data warehouse as a resource but genrally go back to source systems to get data that is not contained in the warehouse and sometimes conduct in data from outside the business organization. Their favorite tool is the spreadsheet and they create new reports that are often distributed throughout the organization. The data warehouse is their go-to source for data but they often go beyond its bounds

Finally, the last few percent of users do intensive analysis. They may create completely new data sources based on research. They mash up various kind of data and come up with completely new questions to be answered. These users may use the data warehouse but many times ignore it as they are genrally charged with going beyond its capabilities. These users include the Data Scientists and they may use advanced analytic tools and capabilities like statistical analysis and predictive modeling.

The data lake approach supports all of these users uniformly well. The data scientists can go to the lake and work with the very large and varied data cluster they require while other users make use of more structured views of the data provided for their use.

Data Lakes Adapt Easily to Changes

One of the main complaints about data warehouses is how long it takes to change them. Considerable time is spent up front during development getting the warehouse’s structure right. A good warehouse design can adapt to change but because of the complexity of the data loading process and the work done to make analysis and reporting easy, these changes will necessarily consume some developer resources and take some time.

Numerous business questions can’t wait for the data warehouse team to adapt their system to answer them. The ever increasing require for faster answers is what has given rise to the concept of self-service business intelligence.

In the data lake on the other hand, since all data is stored in its raw form and is always accessible to someone who requires to use it, users are empowered to go beyond the structure of the warehouse to explore data in novel ways and answer their questions at their pace.

If the result of an exploration is shown to be useful and there is a desire to repeat it, then a more formal schema can be applied to it and automation and re-usability can be developed to help extend the results to a broader audience. If it is determined that the result is not useful, it can be discarded and no changes to the data structures have been made and no development resources have been consumed.

5. Data Lakes Provide Faster Insights

This last difference is really the result of the other four. Because data lakes contain all data and data types, because it enables users to access data before it has been transformed, cleansed and structured it enables users to get to their results faster than the traditional data warehouse approach.

However, this early access to the data comes at a price. The work typically done by the data warehouse development team may not be done for some or all of the data sources required to do an analysis. This leaves users in the driver’s seat to explore and use the data as they see fit but the first tier of business users I described above may not want to do that work. They still just want their reports and KPI’s.

In the data lake, these operational report consumers will make use of more structured views of the data in the data lake that resemble what they have always had before in the data warehouse. The difference is that these views exist primarily as meta data that sits over the data in the lake rather than physically rigid tables that require a developer to change.