Data Lake Vs Lake Formation

Rajesh Kumar
2 min readJun 15, 2020

--

Why do we actually need Data lake?

  • Upcoming more Data (Growing Exponential)
  • Go way beyond from traditional Data Warehouse
  • More Experiments and User Experience (Different Users)

That’s why we came up with a data lake which is a Centralized Repository that allows you to store all your Structured/Unstructured data at any scale.

Setup of a Data Lake in AWS Environment

If we deep dive in above execution flow; the pipeline involves a manual provisions of Creating a S3 bucket, RDS DB Spin-up, Create different policies on S3 bucket, whether define the schema of tables and lots of other developments. This brings a question that it might be possible that your data lake is

Manual → Error -Prone → Time Consuming”

To resolve these tendencies of an AWS data lake, AWS came up with a new service that we already explored in our previous story. To see how to setup the lake formation, please refer this link (AWS Lake Formation).

So this blog is all about the Data Lake vs Lake Formation. After research and few POCs, we got to know the Lake formation is too much sufficient to resolve all the complexities of data lake, specially IAM policies and Administrator rights. Now discuss the major differences.

Lake formation provides its own permission model that augments the AWS IAM permission model. This centrally defined permissions model enables fine-grained access to data stored in data lakes with simple grant/revoke mechanism

Lake Formation permissions are enforced at the table and column level across the full portfolio of AWS analytics and machine learning services, including Amazon Athena and Amazon Redshift.

AWS announced few R&D stories and upcoming release in AWS Lake Formation.

Release soon (AWS R&D)

  • Moving Hadoop Based data directly to S3 using Lake Formation
  • Automatically scaling of No. of DPUs required in AWS Glue
  • Row Label Security (RLS) is under process

In upcoming blog, will write a new story on Security and Access Control over the Data Catalog in Lake formation. This will help us to understand how we can fabricate a better security and access policies on a centralized data catalog.

See you soon…

PS: “Security is always excessive until it’s not enough

--

--

Rajesh Kumar

You can have data without information, but you cannot have information without data — Technical Lead at Lumiq.ai ( AWS, GCP , Azure, & Snowflake)