Getting started with Data Lakes in AWS

Published in

NicaSource

6 min readSep 28, 2022

During this new digital era, in an attempt to retain and create lasting value in their customer shares, businesses are racing each other to integrate technology among their products and services. Every company hopes digital transformation will provide some advantage over its competitors. Since businesses aim to achieve a significant competitive advantage in a highly competitive market, they must know their customers and understand how they interact with their offerings. Yet, understanding who their clients are and their needs is a complex process.

In the process of understanding your client, you might need to do data research, architectural design, extraction, transformation, and analysis. For all of the previously mentioned to happen and produce valuable results, it requires a combination of distinct data sources (like relational databases, non-relational databases, API, data warehouses, etc.). Sometimes, structured data might not be sufficient, making semi-structured and unstructured data part of the equation that needs to be included while designing a feasible solution.

Is it even possible to catch up with all the present demands in an elegant, scalable, and cost-effective way? The answer is a big yes, and it is called a data lake. In this saga of articles, we will build a Proof of Concept (PoC) of a data lake that allows you to understand how it works and the basic design principles that will help you achieve a company’s goal.

In the following GitHub repository, you will find the required resources and a more beginner-friendly guide on how to set up everything: https://github.com/carloscruzns/datalakepoc#datalakepoc.

Key Concepts

Before we start designing our diagrams, we first need to understand some essential concepts.

What is a Data Lake?

According to AWS, “a data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.”

Slightly contradictory to AWS’ definition, Databrick affirms that “a data lake is a central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data.‍”

I tend to disagree with the latter definition because something as small as a few excel files with invaluable information to a company can start a data lake. And don’t get me wrong, sometimes big tech companies start data lakes with petabytes of data that their companies generate daily. But, a petabyte is not the minimum requirement to begin a data lake.

It is undeniable that data lakes and big data go hand in hand. Luckily, cloud computing allows us to design solutions that work in both scenarios — a lot of data vs. little data — without financially compromising a company in the process. At the end of the day, the total amount invested in setting up the project presented throughout this saga should be less than a dollar if done correctly.

Data Lake Layers

To manage any amount of data from different sources and types, data lakes must have a flexible and scalable architecture. A key component of this type of architecture is the data lake layer, which isolates information based on its level of quality and value. This characteristic gives the convenience of creating the necessary amount of layers based on a project’s requirements.

There are different theories about the ideal amount of layers, but the standard and essential ones are the following:

Raw data layer: In this layer, we focus on ingesting data as-is from our data sources as efficiently as possible. This data may originate from an API, a Data Warehouse, Database, IoT, etc. We must ensure it maintains its original structure and keep the original file format (if possible). It is also known as the Ingestion layer.
Curation data layer: The data is denormalized, cleansed, and derived at this stage. Our objective is to make it standard and uniform. It is also known as the Distillation layer.
Application data layer: Here, we implement business logic and analytical applications to consume the data. The goal is to generate value by creating insights and useful data models. It is also known as the Trusted layer.

Some projects might need a different number of layers for their implementation, but this proposal should be enough for our data source. Sometimes it is recommended to rename layers so it gets easier to understand them. For this exercise, we will be calling them coal, pressure, and diamond layers. I’m aware diamonds don’t come from coal, but it’s a cool visual metaphor.

Data Lake — Architecture

We now have the required knowledge to start working on our PoC. For this trial, we will use AWS as our cloud provider, and the following services will be required:

RDS: our data source will be a MySQL database using a Sakila DB sample.
Lambda: it will provide the processing power for ELT operations.
S3: It will be the default service for storing files.
Glue Data Catalog: it will contain our data’s location and schema.
Secret Manager: it will store the credentials for our data source in this service.
Athena: it allows us to make interactive queries to our data in Amazon S3 using standard SQL.
QuickSight: this will allow us to access the data models after they are created. Additionally, we will be able to create interactive dashboards and obtain useful and valuable insights.

We will take advantage of AWS Free Tier since it will give us enough leeway to keep our costs at zero or as close as possible.

Finance Pro Tip: If this is a personal project, make sure you delete everything after you’re done.

Data Lake Sources

Our first step will be to set up our data source. For simplicity, we will set up a MySQL database with the RDS service. The data will be extracted from Sakila, a well-known sample database. This database simulates a movie rental business with information about films, actors, stores, rentals, inventory, etc. It provides us with enough tables and data to obtain a few insights.

Once the database is up and running, we should connect to it and run the scripts. Make sure to use the ones in the GitHub repo. This project assumes that we rely on a specific schema and a specific amount of data.

S3 Data Lake Layers

Now that our data source is set up, we must prepare our layers to store it. We will create a unique bucket for each layer, giving us enough flexibility to tailor the bucket settings (access, versioning, etc.) according to their specific requirements. We also need a separate bucket for resources and miscellaneous things that other services will use. In this project, we will keep the default settings as they are and will have the following result:

Final Thoughts

Let’s take a look at our progress so far. We learned a few key concepts when designing and implementing a Data Lake. We also defined an architecture that will work for our sample database in RDS, Sakila. And last but not least, we created the layers so we can isolate and store our data. Once everything is successfully completed, we will have the pillars to continue making our fun-sized data lake. The next step is to process this information and create data models to get useful insights out of them.

See you next article, my dear fellows!