Building a data lake without drowning

Isa Huerga
5 min readAug 6, 2020

--

Lessons learned that will help you avoid common pitfalls during your data lake implementation

In recent years, I’ve worked with a number of companies as they implemented a data lake. Here are some of the key lessons learned.

There are many reasons why companies decide to build data lakes. Data lakes deliver many benefits, like the ability to store raw data in structured and unstructured formats, at massive scale and low cost.

However, be aware that:

Putting data in a data lake will not automatically bring value to your business.

One of the keys to success when building a data lake is having a clear and specific goal related to the value that will be extracted from the data.

Moving data to the data lake, consolidating different datasets, collecting data from new sources, or accelerating intake, are not goals. They are ways to achieve the goals. Goals should be related to the business or provide a benefit for your customers (which indirectly benefits the business).

Ask yourself: What is the value your data lake is bringing to your business or your customers?

Business goals are usually related to increasing sales, decreasing costs, or increasing efficiency. Once you have clear, SMART goals, you can start building a data lake as part of the process to achieve them.

The fact that data lakes can be a key component of your data strategy doesn’t mean they should replace other components, like databases and data warehouses.

Databases and data warehouses remain a good approach for different scenarios. In fact, AWS offers 15 different purpose-built databases to satisfy these specific requirements, including relational, key-value, document, in-memory, graph, time series, and ledger databases.

These components can be integrated with the data lake. The new data lakehouse paradigm, which Amazon Redshift enables, for example, allows queries to span across different components without the need to reload data.

Tight integration across a variety of components makes more sense than trying to use a single tool for every job. So, rather than choosing technology such as a data lake, focus on each use case to choose the right technology for it.

When it comes to implementation, the most successful approach to building a data lake that I’ve seen is to focus on use cases that keep the broader goal in mind. This is the same approach I recommend for migrations to the cloud.

Starting with a data puddle is ok, so long as you grow that data puddle as opposed to building a lot of data puddles.

Start small, think big.

Selecting a single or a few use cases allows you to validate the process and get more familiar with new tools or new analytics approaches that you are implementing.

You can build on this experience, incorporating the lessons learned in subsequent use cases that you add to the data lake. Standardize what works, and change or improve what doesn’t.

I purposely refer to use cases and not datasets or sources. Instead of just focusing on the destination, focus on the full the process:

  • Where will data be stored?
  • What security and compliance requirements need to be fulfilled?
  • How will information be processed, contextualized and consumed?
  • How will the data, metadata and your platform be governed?

Remember that the goal of the data lake (and your data strategy) should be to extract value from data. But keep in mind that:

To be able to extract value from data you need the right speed, the right people, and the right data.

Right speed refers to the time-sensitivity aspect of analytics. If the time from data collection to consumption is too long, the data or analysis will lose relevance.

This is aggravated by today’s pace of change. The capacity to adjust quickly to changes, or even anticipate them, can provide an advantage over competitors. For example, moving from reactive to predictive models for maintenance or pricing optimization strategies.

Right people means people with the skills to interpret and contextualize data. This means the data needs to be consumable in different ways to fulfil different requirements.

Data lakes are ideal for this as they support schema on read, which means that the same data can be looked at in different ways, by different teams. Also, while you can store raw data as-is, sometimes you might want to consider transforming from row to columnar formats (keep a copy of the original) to improve the performance of your analytics (and even cost).

With data lakes it is easy to make data accessible via a variety of tools, interactive SQL queries, notebooks, visual dashboards, and reports, among others.

Ultimately, data needs to be consumable, otherwise it remains unused and delivers no value.

Right data is data that can be trusted. A data lake is a single centralized repository that can act as a single source of truth. However, if contaminated with bad data, it can affect the outcome of your analysis, resulting in wrong conclusions or even invalidated analysis.

Data is the most valuable asset in a company, so ensuring its quality and integrity cannot simply be left to good intentions — it requires mechanisms.

To ensure data quality and integrity — and avoid building data swamps — you must have governance. There are many tools and mechanisms that can help and should be a must-have for any data lake. For example, data catalogs.

A data catalog is a detailed, searchable inventory of all the data assets in the organization, which lets users see all available datasets and their metadata. In addition to complement — and sometimes even replace — data lineage solutions, data catalogs enable self-service consumption of the data, which extends access to data and insights to non-technical users and business stakeholders.

Data lakes democratize access to data, so it’s fundamental to implement governance.

In addition to clear processes and tools, governance includes having well-defined responsibilities. You need to designate data owners, or perhaps — depending on your needs and the size of your organization — consider dividing responsibilities into separate roles, such as data stewards and custodians.

Building data lakes is more than just moving data to a new type of repository — it requires knowledge in different areas.

There are many tools and frameworks available to build data lakes and perform different types of analytics. You need to consider which are best for your scenarios, how they integrate with each other, and your available skillset when making your choices.

The cloud makes building data lakes easier by providing access to on-demand, pay-as-you-go services that enable easy experimentation with low risk. You can chose from different options, including managed services for open source solutions. It provides easier integration between components and a consistent experience. This allows you to focus on managing data instead of setting up and managing infrastructure.

An easy way to build a data lake in AWS is with AWS Lake Formation, a service that simplifies setting up a secure data lake. For more information, check out this workshop: https://lakeformation.aworkshop.io/

— Isa Huerga

--

--