How do you keep your Data Lake from turning into a Data Swamp?

James Anderson
Slalom Technology
Published in
7 min readMar 15, 2019
Which one looks more appealing to you?

Every data-focused enterprise has investigated or has implemented some level of data lake architecture within their organization, whether they know it or not. Everyone wants to consolidate their data and provide a single source of truth, both for their operational systems as well as their analytics. However, far too often, an organization gets too excited about their lake, scales too fast and without any governance, and their data lake very quickly turns into a data swamp. It becomes hard to navigate, difficult to find the data you need, and pretty soon no one except the one data lake SME is leveraging the system. They become a bottleneck, and very quickly, the usage of the data lake disappears.

This is the worst possible outcome, not only because you’ve now sunk this massive technical investment into your organization, but because you’ve centralized all your data into a system that you now can’t get it out of. You’re stuck in this swamp, with no clear path that will help to salvage this investment. And the worst part is, you’ve created all these ingestion patterns into the lake, and your data is continuing the accumulate. And, you don’t want to turn off the ingestion, because you have no where else to collect this data. Navigating your way out of this swamp becomes more and more difficult.

So, how can you make sure that, when implementing a new data lake, you make sure it stays clean and useful? Keep these key guiding principles in mind when planning for a data lake, and an enterprise can go a long way in making sure it doesn’t become a data swamp.

Plan Ahead

One of the beauties of a data lake is that you can store any type of data: structured, semi-structured, and unstructured. Database feeds, event messages, streaming events, file storage, even audio and visual data. This is a huge benefit of centralizing your data within a data lake, since data that could not be combined in a standard database or data warehouse all of a sudden can be analyzed without a whole lot of pre-processing.

While that concept can be incredibly enticing, and seems like it’s incredibly easy, it’s very important not to rush to get everything flowing into your lake. As you open up the flood gates, all that data needs to go somewhere. If it is not organized, that data becomes harder and harder to find in an ever growing platform. Trying to fix it later is much harder than one might expect. It’s like trying to build a dam while a river is continuing to grow in flow. So, how can you make sure that you get all the benefits of a data lake without making it impossible to navigate? Have a plan!

When you begin to plan out your data lake, it’s incredibly important to come up with a fairly flexible, but very rigid organizational structure. The structure should be able to apply to any type of data that is being ingested, but not restricting that it becomes overly difficult to ingest the data. Generally, I use a path of Data Type/Source System/Source Table (or Flow name/File) to keep it simple. The key is that the organization can quickly sort all the raw data that is being ingested, and a developer should be able to quickly find the data they are looking for.

Don’t Over-commit

As your lake begins to grow, the ability for applications and users to work with the data becomes a priority. Everyone within the organization will start banging on the door of IT, trying to make their initiative the priority for IT to enable. And, one of the benefits of a data lake is that the landscape of tools that can be plugged into the lake is vast, and each has a completely valid use case that allows them to be connected to your lake. So, just to get these business units out of your hair, you enable whatever it is that they are asking for, no questions asked. Then, all of a sudden, you have 40+ applications reading and writing data through a hodge-podge of different channels, all being maintained by IT.

While that does sound like you’ve solved a big business need, in fact you’ve made it a lot more difficult for the enterprise to scale the usage of the data lake. First of all, none of these tools you’ve tacked onto the data lake are free. Without realizing it, you’ve probably spun up a huge investment, that are now all in production and cannot be easily changed without a big impact and assessment. Also, you now need to hire people with expertise in each of these individual tools, which can lead to a lot more overhead cost than expected. So, how can we avoid this scenario while still enabling the business units and their initiatives? By consolidating all requests, and coming up with a application-neutral approach.

If you look at the diagram above, you’ll notice that there is only really one way in and one way out of the data lake. Now, that’s not 100% realistic for every example, every time. However, if you ask the organization to meet you in the middle, and provide a single API interface that is flexible enough to be leveraged by many different types of applications, you get a more streamlined ingest and consumption model. Each application within your organization is going to have a different architecture within their service layer, but the application owners should be responsible for architecting a solution WITH the data lake team, not putting the data lake team in charge of dealing with every little difference between the applications themselves. Then, when an application needs to change or upgrade it’s own architecture, the impact to the rest of the applications and their relationships with the data lake is minimal.

Keep an open mind

Now, I know I just spent the last 5 minutes drilling the idea of single point of entry into your head. However, it is also important to keep an open mind when it comes to how applications and users interact with your data lake. Not every application is going to be able to provide a stream of data, and the data that’s being sent to the lake is going to come in many different formats. There is no way that every single thing that comes into your data lake is going to be able to follow the same pattern, from ingestion all the way to consumption. One example is document storage.

Someone asked me the other day if it was possible to use a data lake as a document repository, so that different applications could still source the same documents from the same place. Their example was based on people’s resumes, and so the knowledge experts application could pull a resume, and the board of directors search application could pull from the same pool of resumes. Those applications are not about to stream resume files into the data lake in the same way transactions are streamed from an order management system, nor are they going to consume the resumes from the data lake in the same way the order management consumes the normalized customer data. The patterns are going to be completely different.

The key to a well designed data lake is to be flexible with your patterns, from ingestion, through processing, and to consumption, but not so flexible that every little difference between applications requires a new pattern. I know it sounds like I just said to go eat a cheeseburger and maintain a low-fat diet at the same time, but the key here is balance. If you have one application that streams JSON messages, one application that streams XML, and one application that sends parquet files, you should be able to leverage ONE pattern to manage the ingestion of them all. But if you have one application that sends JSON messages and another transactional system that can only be ingested through a batch pull, you will need to leverage two different patterns. And that’s OK.

An easy way to make sure you’re being flexible with your patterns is to go through a journey-mapping exercise. By working with each application owner, and each business unit affected by that application, it becomes easier to map different usage patterns to different ingestion and consumption patterns for the data in the lake. This should help architect a solution that better fits the needs of the entire enterprise much more efficiently.

There are many different ways to architect a data lake, and each of them absolutely has its own merit. Every organization is going to have a different need for the lake, and the usage of the lake is going to differ as well. However, as long as an organization plans ahead, doesn’t over-commit, and keeps an open mind, they will be well on their way of scaling a successful data lake, rather than building a fairly unhelpful and non-scalable data swamp. Once you’ve established a strong data lake foundation, the next step is to better understood the lineage of your data, and how it is processed within your data lake for consumption by applications and analytics.

--

--

James Anderson
Slalom Technology

Sales Engineering Leader @ Snowflake. All opinions expressed are my own.