How can you unlock the potential of your data lake?
You’ve unlocked so much potential within your Data Lake, you should be able to take advantage
When I speak with people who have never worked with a data lake before, I like to use the metaphor of an actual lake. Streams, creeks, and rivers carry water into a lake, where it is cleaned by the natural processes that inhabit the lake. That clean water is then fed out of the lake by other streams, creeks, or rivers. Replace the word “water” with the word “data”, and you have the basic definition of a data lake.
In my first post in this Data Lake series (here), I wrote about how to best establish the foundation for your data lake. In my second post (here), I wrote about how to expand your data lake, and process the data to generate more insights more effectively. Once your Data Lake has a solid foundation and data flowing through it efficiently, the final piece of the puzzle is to make it available to your applications, both operational and analytical. While that sounds simple, it can become a huge logistical headache if you try to make your lake a one-stop shop itself for all access.
Being able to combine data from every incoming data stream is such a huge benefit of a data lake, but even if you have the most solid foundation of a data lake, and you’re processing data like crazy through the lake, it’s all for nothing if you can’t get that data out of the lake. In the physical lake metaphor that I described at the beginning, the third part of the journey is the clean water leaving the lake via a river or variety of streams. Understanding how that river or stream is created and maintained is key to maximizing the consumption, without duplicating your entire lake into databases and systems across your architecture.
As your data lake processes generate more and more data, and provide cleaner views into your organization, the value of that data consistently increases. However, there are multiple types of systems that may need to access the data, which could change how you want to make the data available.
As an Analytics Consultant, my first instinct is always to think about the analytics implications of a data solution. That mindset doesn’t change when it comes to data lakes. Many times, we see an initial implementation of a data lake used to combine data from many sources for analytical use cases, like combining data for dashboards and reports. The impact of an analytical use case does not require the same level of data availability and processing speed that an operational use case does, so it allows an organization to get comfortable with the new approaches and technologies.
There are multiple ways to consume analytical outcomes from a data lake, and they each depend on a number of factors: performance, frequency, modeling, and concurrency. Each of these factors generally have their own priority within a particular organization, but it’s the collective requirements that help to drive how an analytics consumption layer is set up on top of your data lake. There is never a wrong answer when it comes to how your analytics applications interact with your data lake, but there are certainly more efficient ways based on the needs of your organization.
For example, if you are looking to provide near real time analytics to an application, you may want to build a query layer directly on top of the data lake, allowing the dashboard to access the data as it’s updated in the lake. This goes back to the frequency factor, for your dashboard will require frequent refreshes itself. One example of this is if you have a customer service dashboard that is displayed on screens in your call center, providing analytics on the calls as they happen. This data, as it’s fed into the lake, should be available on the dashboard to better serve the needs of your call center team.
Let’s look at performance for a minute, though. If that call center was up on every single computer in the building, rather than one screen, having it pull live from the data lake might not be the best idea. Data lakes are fantastic for storing and processing data, but they leave a lot to be desired when it comes to actually querying the data. The way a data lake is stored, the data is partitioned in a way that is efficient for storage, not for querying. Data warehouses store the data in ways that are efficient for querying. So, if you need a performant way to query the data, pushing the data from the lake to a warehouse would be the best way to surface the data for consumption.
These are just two examples of how these different factors can impact how you determine what a consumption layer from your data lake would look like for an analytics application, but it’s crucial to look at all the factors when making those decisions. You may have a real-time requirement, but if that real-time requirement is spread across hundreds of users accessing an analytics application, there are ways to maintain that real-time need while publishing to a data warehouse to optimize query performance. By thinking holistically, you can make your analytics application much more efficient.
One of the biggest benefits of a data lake is the ability to combine data from a lot of different systems and gain a more comprehensive view of your organization. While there are plenty of analytics outcomes from that view, there are a lot of ways that you can benefit your operational processes and provide them better data on which to operate, and while how the data is used is different when it comes to operational systems, the approach to how you architect your consumption patterns should follow the same path.
As I said in the previous section, performance, frequency, modeling, and concurrency are the key factors when trying to build the patterns for your applications to consume the data. And while the previous section was meant for analytics-based applications, the same mindset should be used when thinking about how operational applications consume from the data lake. There are generally more operational applications than analytics applications within an enterprise’s technical footprint, which would lead to more patterns being developed for consumption.
A true microservices model would separate an application from dependence on any other application within the stack. The application would hold it’s own transactional data store, and send data out to the data lake in a separate stream. This model is great for keeping applications segregated, and minimizing the impact of changes to one application on the others. However, this can cause some hurdles with consuming from the data lake. If all the applications are reading and writing directly from the data lake, changes to the lake could impact them downstream. So, how do our factors play a role in designing the consumption patterns? Let’s use a shared customer taxonomy as an example of how best to design your patterns.
In retail, there are generally multiple doors that a customer can come in to buy your product: online, in-store, business to business, etc. If each of these doors has their own application, and each application has it’s own data store, then it becomes very difficult for the enterprise to truly understand how many customers it has. Each of application may have a different definition of a customer, and they would not have information for a customer who only comes through one door. Many enterprises will leverage their data lake to create a shared customer taxonomy model for their applications to leverage a shared understanding of their customers. If that taxonomy lives in the operational layer of your data lake, your applications should be able to enhance their existing data set with the information in that model. But how they enhance the data depends on the need.
If frequency is the truly driving factor, and the data needs to be updated in real time as customers make purchases, the application should either be interacting with the lake directly, or, depending on the performance needs, the data should be streamed back to the application’s data base on a real-time or micro-batch basis. However, if all applications need the data in a near real time basis, and the concurrency is such that streaming them all back to the stores is impractical, a separate, centralized, database should be created to optimize the query performance. And these are not either/or type patterns. The goal is to provide the right pattern for the right consumption use case, and truly think through what factors are important.
A data lake architecture should never be looked at as a single solution. There are technologies that claim to be the one-stop shop for all things data lake related, from ingestion, through processing, to consumption. These tools may have plenty of power and capabilities built in, but it forces an enterprise to fit themselves to their patterns, rather than the enterprises developing the right patterns for the right use cases. If you want to build a data lake architecture that can truly transform your business, the solution has to be thought about holistically, and patterns need to be drawn out and developed based on how the organization wants to leverage the data in the lake. Then, an enterprise can truly begin to move towards becoming data-driven.