AWS Lake Formation Permissions with Examples
This blog was originally published at https://tokern.io/blog/lake-formation-permissions
AWS Lake Formations helps to setup a secure data lake on AWS S3. One of the main goals of the product is Simplified Security Management. The central tenet to this goal is to define security, governance and audit policies in a single location. Lake Formation permissions go beyond IAM policies to manage secure access to databases, table and columns as well.
The previous blog post explained how to setup a data lake with NYC Trip Data.
This tutorial uses the same data lake and shows how to setup basic permissions for two personas: data engineer and data analyst.
Rest of the blog explains resources and permissions. Then it sets up two users and their permissions (data engineer and data analyst) to explain the concepts and how they help to secure the data lake.
Resources and Permissions
There are two types of resources in AWS Lake Formation:
- Metadata is stored in AWS Glue Data Catalog and consist of databases, tables and columns.
- Data is stored in AWS S3 locations or in databases.
Metadata
The table below describes the metadata permissions available for each of the resources in the catalog.
Data
Data Location permissions are applicable to AWS S3 locations registered with Lake Formation. The only permission available as of this article is DATA_LOCATION_ACCESS. DATA_LOCATION_ACCESS allows users to create tables without the need to have IAM permissions for a S3 location.
Similarly, SELECT, INSERT and DELETE permissions are available only on registered S3 locations.
An important point is that Lake Formation users do not need IAM permissions to access tables in a registered data location. Lake Formation’s permissions ensure secure access instead of AWS IAM permissions.
Note that for database locations such as RDS or Redshift the permissions are determined by the user specified in the JDBC connection string and the GRANT permissions in the database.
Permissions on NYC Trip Data data lake
Pre-requisites
- AWS Glue, AWS Athena and AWS Lake Formation have been setup.
- NYC Taxi Trip data set is available in the data lake in taxidata database.
- lakeadmin is the AWS Lake Formation administrator.
- lakecli and athenacli are setup and configured in your system.
Create Policies and Users
First create a few managed policies.
Navigate to AWS Console > IAM > Policies > Create Policy and create the following policies
- DataEngineerLakeFormationPolicy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"lakeformation:GrantPermissions",
"lakeformation:RevokePermissions",
"lakeformation:BatchGrantPermissions",
"lakeformation:BatchRevokePermissions",
"lakeformation:ListPermissions" ],
"Resource": "*"
}
]
}
- DataEngineerPassRole
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PassRolePermissions",
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": [
"arn:aws:iam::account-id:role/workflow_role"
]
}
]
}
- DataAnalystGlueAccess
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"glue:GetTable",
"glue:GetTables",
"glue:SearchTables",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartitions"
],
"Resource": "*"
}
]
}
- S3NycBucketRead is required because nyc-tlc is not registered as a data location and explicit IAM permissions are required.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::nyc-tlc",
"arn:aws:s3:::nyc-tlc/*"
]
}
]
}
Navigate to AWS Console > IAM > Create User and create the following users with attached policies.
Register a data location
- Create an S3 bucket called prefix_datalake. Choose a unique prefix.
- In AWS Console, go to AWS Lake Formation > Databases > Create Database
- Create a database with the following details:
Data Engineer Setup
This section sets up the permission for datalake_engg. datalake_engg creates a table in taxilake from a table in taxidata.
datalake_engg needs CREATE_TABLE permission to create and manage tables in a database.
# lakeadmin in lakecli
lakecli> select * from database_privileges where principal like '%datalake_engg%';
0 rows in set
Time: 0.001s
lakecli> GRANT CREATE_TABLE ON DATABASE 'taxilake' to 'user/datalake_engg';
GRANT
Time: 3.071s
lakecli> select * from database_privileges where principal like '%datalake_engg%';
+----+-------------+--------------------+--------------+-------+
| id | schema_name | principal | permission | grant |
+----+-------------+--------------------+--------------+-------+
| 4 | taxilake | user/datalake_engg | CREATE_TABLE | 0 |
+----+-------------+--------------------+--------------+-------+
1 row in set
Time: 0.027s
lakecli> GRANT SELECT ON TABLE 'taxidata'.'csv_misc' to 'user/datalake_engg';
GRANT
Time: 3.279s
lakecli> select * from table_privileges where principal like '%datalake_engg%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 3 | taxidata | csv_misc | user/datalake_engg | SELECT | 0 |
+----+-------------+------------+--------------------+------------+-------+
1 row in set
Time: 0.024s
#datalake_engg in athenacli
us-east-2:default> select zone from taxidata.csv_misc limit 10;
+---------------------------+
| zone |
+---------------------------+
| "Newark Airport" |
| "Jamaica Bay" |
| "Allerton/Pelham Gardens" |
| "Alphabet City" |
| "Arden Heights" |
| "Arrochar/Fort Wadsworth" |
| "Astoria" |
| "Astoria Park" |
| "Auburndale" |
| "Baisley Park" |
+---------------------------+
10 rows in set
Time: 4.064sus-east-2:default> create table taxilake.zones as select zone from taxidata.csv_misc;
0 rows in set
Time: 11.644sus-east-2:default> select count(*) from taxilake.zones;
+-----------+
| _col0 |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 4.030s
us-east-2:default> select count(*) from taxidata.csv_misc;
+-----------+
| _col0 |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 4.844s
The data engineer automatically gets SELECT, INSERT, DELETE, ALTER and DROP permissions with GRANT capability.
\r:iamdb> select * from table_privileges where principal like '%datalake_engg%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 1 | taxilake | zones | user/datalake_engg | ALL | 1 |
| 2 | taxilake | zones | user/datalake_engg | ALTER | 1 |
| 3 | taxilake | zones | user/datalake_engg | DELETE | 1 |
| 4 | taxilake | zones | user/datalake_engg | DROP | 1 |
| 5 | taxilake | zones | user/datalake_engg | INSERT | 1 |
| 6 | taxilake | zones | user/datalake_engg | SELECT | 1 |
| 7 | taxidata | csv_misc | user/datalake_engg | SELECT | 0 |
+----+-------------+------------+--------------------+------------+-------+
Analyst/Data Scientist
datalake_user needs SELECT permission on a table to read data.
# lakeadmin in lakecli
lakecli> select * from table_privileges where principal like '%datalake_user%';
0 rows in set
Time: 0.002s
datalake_user also cannot run any queries in Athena.
# datalake_user in athenacli
us-east-2:default> SELECT COUNT(*) FROM taxidata.csv_misc;
Insufficient permissions to execute the query. Insufficient Lake Formation permission(s) on csv_misc
datalake_user has access to the table data once SELECT permission on the table is granted.
# lakeadmin in lakecli
lakecli> GRANT SELECT ON TABLE 'taxilake'.'zones' to 'user/datalake_user';
GRANT
Time: 1.499s
lakecli> select * from table_privileges where principal like '%datalake_user%';
+----+-------------+------------+--------------------+------------+-------+
| id | schema_name | table_name | principal | permission | grant |
+----+-------------+------------+--------------------+------------+-------+
| 2 | taxilake | csv_misc | user/datalake_user | SELECT | 0 |
+----+-------------+------------+--------------------+------------+-------+
1 row in set
Time: 0.024s# datalake_user in athenacli
us-east-2:default> SELECT COUNT(*) FROM taxilake.zones;
+-----------+
| _col0 |
+-----------+
| 107003212 |
+-----------+
1 row in set
Time: 6.932s
Summary
AWS Lake Formation Permissions are better suited than IAM permissions to secure a data lake. The two main reasons are
- Lake Formation Permissions are on logical objects like a database, table or column instead of files and directories.
- Lake Formation Permissions provide granular control for column-level access.
This blog explores the permissions required for a basic setup of a dataset, administrator, data engineer and an analyst.