How to get started with AWS Lake Formation?

4 min readMay 18, 2020

To start with my first effort to write something related to emerging technologies around us…Yes friends I’m talking about AWS and last week I have participated in AWS online Summit and thought of to write few interesting features introduced by AWS.

“AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.”

Architecture to design Data lake for Zipcodes in New York City

So I’m going to walk you through, how to develop a AWS Lake Formation.

Step 1. Assign data lake administrator

The first step in creating your data lake in Lake formation is to define one or more administrator . Administrators have full access to the lake formation system, and control the initial data configuration and access permissions.

Go to “Permission” at bottom left of the panel ,after clicking into AWS Lake Formation.

Add administrator in admin and database creator

Administrator has been updated successfully in AWS Lake formation

Step 2. Now, its time to setup the Data lake and it will be created in 3 stages.

Stage 1. Register your Amazon S3 Storage

Choose your S3 bucket to setup the Data lake

Step 3. Once the S3 path is registered, we need to develop a database where all the objects will be stored in specific format (Athena, RDS, Redshift, etc.)

Stage 2. Create a Database for data cataloging

Step 4. Now as per the design , the database is being created through the AWS Glue catalog, so need to grant the permissions for Glue.

Zipcode-db is a database which has AWSGlueServiceRole persmission

Step 5. Create and Run AWS Glue Crawler to load the data into Zipcode-db

After a successful run of crawler a **zipcode** table has been populated

Step 6. Once table is there , to read this we need permission.

We can also restrict the permission based on users (Data Analyst, Data Scientist or a Business Analyst), this resolves lots of concerns regarding cost management (restricted column will be accessed), security views, prevent from malicious attacks or data thefts.

Two columns have been granted to have a read for a data analyst

Conclusion:

Amazon S3 is the fundamental for data lakes. We can privatize our data lake , encrypt everything, and secure specific access (Data Analyst, Data Scientist, Data Engineer, etc.) to and from that data lake. This improves the performance by parallelization of access and scale horizontally. And also, this architecture can be leveraged to improve data governance, data management, and efficiency.

References:

Data Lake Formation — https://aws.amazon.com/lake-formation/
AWS Summit Online Takeaways

Written by Rajesh Kumar