So you have a Data Lake. Now what?

How can you get the most value out of the investment in your data lake?

In part one of my data lake series (here), I talked through how to best establish your data lake so that it is clean and usable. Having a strong data lake foundation is the largest hurdle in a successful data lake architecture, and once it’s established, an enterprise can really begin to accelerate their adoption of the lake. As adoption of the lake picks up, understanding the lineage of the data and making sure it’s being processed effectively and efficiently, is pushed to the forefront.

Whenever I’m in a discussion with someone who is new to data lakes and is trying to understand the concepts, I like to use the metaphor of an actual lake. Multiple flows of water will enter a lake from a variety of places, and of a variety of cleanliness. All that water will enter the lake and sit there. Then, the inhabitants of the lake, both flora and fauna, will clean the water in a variety of ways. That clean water then flows out of the lake through a river or number of streams. Now, replace the word water with the word data, and you’ve got the basic definition of a data lake. Still, the biggest mystery in that whole process is how the water (or data) is cleaned by the various inhabitants of the lake.

There are many different ways to process, maintain, and catalog the data that you have in a data lake. And while there is never a single right way of doing it, there are certainly some things that you may want to avoid when you begin to clean and catalog your data in the lake.

Tracking Data Lineage

In part one, I spoke about planning ahead in terms of how you organize your data lake. That helps to make it more searchable, and the data is more easily accessed and worked with. This same mindset definitely needs to be applied when it comes to how you process data, and how you track your processing as well.

Data Lineage has always been an important aspect of a well-maintained data warehouse, and really an important aspect of any data platform. There are many tools and technologies that were built to solve this exact technical problem, with varying degrees of success. On many data projects that we go through, one of the first things we try to build is a Source to Target Mapping, which we use to drive how our data engineering team builds their processes. Now that’s a straightforward task within a fairly structured data warehousing model. But with all the different data types and data streams that are involved in a data lake, a simple Source to Target Mapping document is going to become very messy very quickly.

Rather than trying to document every single field and the lineage of that field throughout the lake, there are a few ways that you can simplify that process. First, put some rules in place as to how different data types or data domains are processed in the lake. This allows an organization to set blanket rules that are much easier to document and enforce than more specific rules by columns and objects. For example, if you have multiple order management systems that are batched into your lake in a fairly structured way, you may want to add a column to the raw table that shows which OMS it came from. That column can then be used programmatically throughout the processing cycle, without having to know which table or files holds data from specific OMS systems.

Second, there are many tools that can crawl through the data in the lake in order to create a catalog. Leverage them as much as you can, and make sure that the process is auditable. For example, AWS Glue provides a crawler capability, reading through the object that exist within S3, and will write the results out to a searchable data catalog. While this crawler isn’t always going to be able to read every single file or object within the S3 bucket, it will provide an audit log that can show what WASN’T read, allowing for other processes to catalog them as such. This also helps to keep track of new data streams that are added to the data lake, without a lot of manual pre-work on cataloging everything in the feed.

Read vs Write Access

Every data scientist I’ve ever worked with always is looking for unlimited access to a data lake. They feel that they cannot do their job without the ability to work with the data at its rawest form, rather than a specific subset data that is provided them by an engineering team. On the other hand, the engineering team is usually not very interested in providing the data science team any real access to their lake, for fear that they will mess with something and introduce new, un-auditable, data to the lake. Both trains of thought are absolutely correct, and unfortunately they generally directly contradict each other.

Data Lakes are designed to be a shared source of information for anyone and everyone to access, so that more and better insight can be developed from a shared location. Many data scientists want to write models and processes that read a large amount of data and then output a result of some sort. That exact process can be dealt with inside a data lake, leveraging the ability to define read access and write access to certain parts of a lake.

A production data lake should only ever be written into by some automated and auditable process (see my previous section as to why). However, everyone who needs to should be able to read data from the data lake. By creating a “data science sandbox” section of the data lake, data scientists can begin to develop and train models, with the outputs being written to the sandbox for further review and training. Once the models and processes have been developed through a standard Software Development Lifecycle (SDLC), they can be implemented to the production data lake by the engineering team, following the same auditable patterns as all the rest of the jobs that write data to the lake. This gives the data science team a place to do their work without any major reliance on the engineering team, and the engineers the cover they need to ensure they maintain a clean lake.

Understanding Your Usage Patterns

Every data lake is established for a different reason, though most are driven by the desire to centralize all the data. In my experience, many data lakes are initially established to replace a monolithic data warehouse, so the use cases for the combination and processing of data is for analytics. This can be done in a batch, or even micro-batch, way in order to create an analytical schema for consumption by analytics platforms. That is just one of many different usage patterns that can be leveraged within a data lake architecture.

In part one of this series, I talked a lot about being flexible in your ingestion patterns based on the types of data that are being loaded into the data lake. That same sentiment is key when thinking about how best to process that data through the lake. Rather than thinking about the types of data that are being processed, it’s key to think about how the data is being used upstream. Many enterprises are trying to move to a model where applications can read from data sources that have been enriched with the combined insights of all data producing applications. This would mean that the data lake usage would expand, and require an operational data store to be built within a layer in the data lake.

While building an operational data store (ODS) is no easy feat, what is more relevant is the speed at which the data is written to the ODS. An ODS does require instant read after write for the transactions that have been created, so the speed at which the data is ingested and made available is key. The processing patterns leveraged would be drastically different than the batch, or even micro-batch, processing that your analytical layer of the data lake would require. So the question becomes, do you maintain a small transactional database with the application, and provide a shared layer that holds a common data set that drives the applications, or design an immediately available data lake layer for operational data stores?

If you are leveraging a classic HDFS-based data lake, with the compute and storage completely integrated with each other, it becomes much easier to create an operational data lake store within that lake, since it can be treated more like a database with immediate availability. While that is certainly a pro, a classic HDFS-based data lake can be very cost prohibitive. It also requires a much more segmented data lake architecture, so that the resources used by the data science team does not impact the performance of your applications.

If you are leveraging a cloud-based object storage data lake (AWS S3 or Azure Data Lake Store, for example), building an operational data lake store becomes much more difficult, and potentially impossible depending on the requirements of the application. However, that does not mean that you would not drive certain parts of the application from the data lake. For example, if you have 5 different applications that all interact with different sections of your enterprise, they may all have different definitions of what a “customer” is. If you completely isolate each application from each other, each with their own operational data store that runs the application in its entirety, you will never get to a common definition of customer. If you push all transactions both to your operational data store database, as well as to your data lake, and process them through in real time, your applications could all read from the same Customer ODS within the data lake, and leverage this shared definition of a customer.

This leads to the last question that needs to be answered when it comes to building out your data lake: how do you make all this data available for consumption, and when cleaned, where is all this water going to go?