Building a Data Lake on AWS

5 min readMay 9, 2019

This image is under creative commons license Pixabay

Building a Data Lake on AWS

Many customers need a solution for data storage and analytics which is flexible and more robust than traditional data management systems. Amazon Web Services (AWS) offers a data lake, which is a new way to securely store and analyze a huge amount of data for a low cost. It supports easy search and analysis competencies for a variety of data types.

A data lake on AWS is a central storage repository which allows you to store both structured and unstructured data on any scale. Through use of a data lake in AWS architecture, data can be stored as-is (without the need to first structure the data) for different analytics for big data machine learning, etc. can be performed for better decision-making. For this purpose, AWS offers automated, cost-effective, and highly available data lake architecture with a user-friendly console for real-time searching and requesting of datasets.

The core AWS services are automatically configured for the easy tagging, searching, sharing, and governing of specific data subsets across an organization or with external users. Using data lake architecture, new users can catalog and upload new datasets of any size with searchable metadata as well as for existing datasets in Amazon S3 with minimal effort. These datasets will be integrated with AWS Glue and Amazon Athena for further transformation and analysis of that data.

Why a Data Lake?

A data lake helps in identifying and acting upon opportunities for faster business growth while attracting and retaining customers with better decision making. A data lake can have:

· Unlimited Data Management: A data lake helps store an unlimited amount of data in its original format and acts as an online system where data is always available for queries.

· Cost Reduction and Acceleration in Data Preparation: When high performance is required, data processing workloads can easily be migrated to a data lake at a low cost and parallels in a much faster than before.

· Analytic Ability: A data lake offers analytic agility; it provides a self-service environment in which analysts and data scientists can rapidly integrate, explore, and analyze the data they require. In addition, the structure can be applied incrementally at the right time rather than waiting for necessarily up front.

· Not Limited to Standard SQL: A data lake provides opportunities for machine learning, full-text search, scripting, and connectivity without being limited to standard SQL to data discovery, existing business intelligence, and analytic platforms. This makes it a cost-effective solution for running data-oriented experiments and analyses of an unlimited amount of data.

Components of a Data Lake

A data lake offers three main operations, i.e., data ingestion, building catalog, and processing. Within these operations, AWS offers multiple operations. Following are a few:

Ingestion

Amazon Kinesis is one of the ingestion options that AWS offers. It provides easy data streaming. It also helps in building custom applications which use standard SQL queries to analyze or process streaming data. Using Kinesis, users create availability zones which act as data centers; data is taken from various sources such as mobile applications and websites for these zones and pushed to archives for sliding window analysis with DynamoDB.

Building Catalog

The catalog is responsible for information regarding key aspects of stored data, such as its format, its classification, and the tags used to search metadata in data lakes which can be kept in a storage location.

You can build a catalog using the following steps:

· Create an object by putting the object into the Amazon S3 bucket in AWS lambda. An event will be triggered when an object is stored in a bucket. An event can be a piece of code which is invoked in any infrastructure.

· This invoked code is used to extract the metadata of the stored object. Then the extracted object will be stored in a NoSQL database such as Amazon DynamoDB.

· AWS Lambda will again pick up this data and push it into an Elasticsearch, which different teams will use to query data to skim through the catalog.

In Amazon, metadata is accessed by users or teams via APIs which are built on top of metadata. The Amazon API gateway service is used to build a website which helps search the data lake. These APIs are used to connect components such as AWS Lambda, EC2, or the public endpoint on the backend where the catalog is built.

Processing

The processing of unlimited data has different uses; different techniques are employed for different entities. AWS has a variety of data processing services, such as Amazon EMR, Athena, Redshift, etc. Amazon Athena is widely used for processing data in data lakes and offers the following benefits:

· Serverless (no ETL)

· Pay per query (pay only for data scanned)

· Built on presto (runs stands SQL)

· Fast and interactive performance for large datasets

· Highly available

· Secure

Benefits of Data Lake Building on AWS

Following are some advantages which make AWS a supreme choice on which to build a data lake:

· Flexibility: Supports and stores a large volume of data at a scale using Amazon S3 regardless of format or volume.

· Most Comprehensive Platform: The most comprehensive platform for building data lakes including security, agility, flexibility, and lower TCO.

· Security and Compliance: Supports easy encryption of all data lake data and can achieve regulatory compliance standards such as PCI DSS and HIPAA.

Building a Data Lake on AWS

To build a data lake on AWS, you will use an AWS CloudFormation template to configure the solution, including AWS services such as Amazon S3 for unlimited data storage, Amazon Cognito for authentication, Amazon Elasticsearch for strong searching capabilities, AWS Lambda for microservices, AWS Glue for data transmission, and Amazon Athena for data analytics. The following figure represents the complete architecture of building a data lake on AWS using AWS services.

**Figure 3:** Data Lake Architecture on AWS

AWS data lake architecture leverages durability, security, and scalability while storing unlimited data to Amazon S3 to manage an insistent catalog of datasets of businesses with Amazon DynamoDB to manage relevant metadata. When data is cataloged, its attributes and tags can be used to search and browse datasets from the solution console.

Written by Pasha Mahmoudzadeh