AWS Lake Formation And Glue Access Analyzer
AWS Lake Formation permissions control access to data sets in your data lake in AWS at a table and column level granularity. For a quick primer, read Lake Permissions by Example blog post.
Once access policies are setup in AWS Lake Formation, it is important to regularly check that the policies are up to date and are not leaking any unintended privileges. In this article, two utilities — lakecli and piicatcher — are combined to automatically check against privilege leak in a data lake built on Glue, Lake Formation and S3 with a single SQL statement.
piicatcher tags all columns that contain critical data like PII & PHI in the AWS Glue catalog.
lakecli provides a SQL interface to find all privileged users.
Once the list is reviewed, it can be used to ensure that there is no leak in privileged access through a scheduled automated process.
Prerequisites
The article assumes the AWS account has a data lake setup using the following technologies :
- AWS Glue
- AWS Lake Formation
- AWS Athena
- AWS Cloudtrail
AWS Athena is used by data analysts and scientists to access the data. If you use another product, then ensure that it uses Glue catalog as the metadata store.
Check Secure Data Lake Tutorial to setup a secure data lake using New York City Taxi and Limousine Commission (TLC) Trip Record Data.
Discover and categorize data
The first step to analyze access is to categorize data sets. Typically access policies are determined for every category. Every business has its own categories and patterns to recognize it. Common categories of data are:
- Personally Identifiable Information (PII)
- Protected Health Information (PHI)
- Business specific critical information like sales and financial data.
PiiCatcher
Run PiiCatcher to discover PII data in the NYC Trip data set.
> piicatcher aws -r <region> --list-all
PiiCatcher finds PII data in taxidata.csv_misc.
Augment AWS Glue Catalog with categories
PiiCatcher finds PII data but that is not sufficient. It is important to tag the tables and columns with the category permanently so that other utilities can use them for analysis. AWS Glue Catalog allows custom metadata to be stored in a field called Parameters for every column. For example, the columns for taxidata.csv_misc are:
'Columns': [
{ 'Name': 'locationid', 'Type': 'bigint' },
{ 'Name': 'borough', 'Type': 'string' },
{ 'Name': 'zone', 'Type': 'string' },
{ 'Name': 'service_zone', 'Type': 'string' }
]
PiiCatcher adds a parameter to store the type of PII data found in the column when run with the command below:
piicatcher aws -r <region> --output-format glue
After the run, piicatcher has added a new parameter with key PII and value as the category of PII. The same table now has the following metadata:
'Columns': [
{ 'Name': 'locationid', 'Type': 'bigint' },
{ 'Name': 'borough', 'Type': 'string', 'Parameters': { 'PII': 'PiiTypes.ADDRESS' } },
{ 'Name': 'zone', 'Type': 'string', 'Parameters': { 'PII': 'PiiTypes.ADDRESS' } },
{ 'Name': 'service_zone', 'Type': 'string', 'Parameters': { 'PII': 'PiiTypes.ADDRESS' } }
]
The columns with PII parameter can now be used by lakecli to analyze privilege access.
Access Table Properties in Information Schema
provides an information schema for AWS Lake Formation. The information schema provides a SQL interface to the Glue catalog and Lake Formation permissions for easy analysis.
The column table has information on which columns have PII data.
\r:iamdb> SELECT ORDINAL, TABLE_NAME, COLUMN_NAME, PII FROM COLUMNS;
table_privileges stores all the privileges defined on tables.
\r:iamdb> SELECT * FROM TABLE_PRIVILEGES where principal like 'user%';
The query below joins these tables and lists the principals who have access to columns with PII data.
SELECT
DISTINCT `PRINCIPAL`
FROM
`COLUMNS` INNER JOIN `TABLE_PRIVILEGES`
ON
`COLUMNS`.`TABLE_SCHEMA` = `TABLE_PRIVILEGES`.`SCHEMA_NAME` AND
`COLUMNS`.`TABLE_NAME` = `TABLE_PRIVILEGES`.`TABLE_NAME`
WHERE
`COLUMNS`.`PII` IS NOT NULL AND
`TABLE_PRIVILEGES`.`PERMISSION` IN ('ALL', 'SELECT')
ORDER BY 1
This list can be reviewed to ensure that the right principals have access to PII data. A similar process of discovery and analysis can be extended to all types of critical data.
Continuous Access Analysis
The above process can be automated using scheduling systems like cron or Apache Airflow to automatically review privileged principals against a canonical list.
Conclusion
This article described how two utilities — lakecli and piicatcher — can be combined to automatically check against privilege leak in AWS Data Lake built on AWS Glue, Lake Formation and S3. If Access Analyzer is of interest to you, get in touch by providing your email id below.
Originally published at https://tokern.io.