Published in

Funding Circle

8 min readJul 20, 2022

At Funding Circle we treat data as a first-class citizen, as it drives all our critical business decisions. We collect and process data through various channels. Data we deal with daily may contain sensitive and personally identifiable information (PII) data — protecting sensitive and PII data and processing it in a controlled manner is one of the highest priorities in the company. In this blog post we want to talk about various approaches we used along the way to control access to the data in the data lake and our journey towards using LakeFormation.

About the technologies

If you are familiar with AWS IAM, Glue and LakeFormation, then you can skip this section.

AWS Glue

Glue is a managed ETL service from AWS, which can be used for extracting data from various sources, transforming it with Spark and loading it into various targets. It comes with a metastore called Glue catalog where metadata about the data is stored as databases, tables and columns.

AWS Identity and Access Management(IAM)

AWS IAM provides fine grained control to AWS resources. This can be used for restricting the access to AWS services and resources. The permissions are usually defined as Json documents called policies and are attached to principals(users/roles) for granting/restricting the permissions. In this blog post when we refer to IAM policies, we mean granting access to s3 resources(data under s3 prefixes) and Glue catalog resources(database and tables).

AWS LakeFormation

AWS LakeFormation is a managed service for setting up datalake in the AWS cloud. In our use case we primarily focus on access control as a feature provided by LakeFormation.

Background

We built our data platform on top of AWS, and use S3, Glue, Athena, et al to fulfil our analytics requirements. We store the data in S3 with the corresponding schema/metadata stored in the Glue Data Catalog, and the data service layer uses Athena. We adhere to the principle of least privilege and want applications and users to have access solely to the data they require. We also want to secure, audit and control access to sensitive and PII data.One more driving principle is to follow mesh style data product ownership, where the owner of the dataset will grant or restrict access to the data. We are going to talk about the old IAM-based approach, the issues with that and then how LakeFormation was used to address those issues.

IAM-Based Permission Model

We used to have IAM-based access control on Glue Data Catalog and underlying S3 data to enable specialised access. IAM allows defining permission policies on s3 prefixes and Glue Data Catalog resources with table-level granularity. It has a few limitations though:

We cannot categorise what is PII and what is non-PII.
We have to create principal or resource-based policies to enable specialised access, which leads to a large number of IAM policies to be created and maintained.
We can’t grant permission at a granular column level using IAM policies. This is required for having different views of the same underlying S3 data, which contains PII and non-PII data. This can be achieved by creating different copies of data; one copy would contain PII and sensitive data, while the other would have it deleted or redacted. Then different users/roles would be granted permissions on differenct copies of the data based on their needs.
We can’t determine what users/roles have permissions on a specific dataset in a centralised manner from IAM policies alone.

LakeFormation Access Control

After realising the limitations with the IAM-based permission model for data, we looked into the access control feature provided by AWS LakeFormation. We found the following main features that would address issues we came across with the IAM-based approach:

Ability to grant/revoke permissions at a granular column level, so we no longer have to create and maintain different copies of the data.
RDBMS-style grant/revoke permissions model on databases, tables and columns.
The ability to tag Glue databases, tables and columns for different confidentiality levels.

Our initiative with LakeFormation was not a greenfield project and we had to work in parallel with the existing IAM permissions model. AWS documentation recommends retaining global data lake settings of “Use only IAM access control for new databases” and permissions for IAMAllowedPrincipals to be backward compatible.

For our existing Glue datasets, we had to do it gradually dataset by dataset, once a dataset is migrated over to Lake Formation the permissions ensuring backward compatibility with IAM permissions (IAMAllowedPrincipals) are revoked for the dataset. We also disabled the LakeFormation default settings that state “Use only IAM access control for new databases” and “Use only IAM access control for new tables in new databases” to enforce that new datasets implement the LakeFormation access control as soon as they are created. In the following sections we are going to talk in detail about the different types of permission models available in LakeFormation.

Named Resource Based Access Control (NRBAC)

In the beginning Lake Formation supported the named resource based method only, where we had to grant permissions to IAM principals on specific databases, tables or columns. The goal was to have an easy-to-implement and reusable solution that others could follow to migrate their datasets to LF. We started with a centralised Github repository where we used Terraform to define the LF admin settings, LF permissions for IAM principals and to register S3 locations on LF. We soon realised the centralised approach went against the data mesh principles, which promotes that access control for a dataset should be maintained and owned by the dataset owners and close to where data product is created.

Since everything is configured in Terraform, we came up with reusable Terraform modules that could be used to group permissions either by IAM principals or Glue catalogue object (database, table, column etc). Named resource-based access control works very well and provides a RDBMS-style grant/revoke permission management model. However when it comes to scalability it is poor to scale out, as the number of IAM principals or Glue catalogue objects grows. E.g., if we have N resources (columns, tables, databases) and M principals, then in the worst scenario we would have to create (N * M) grant expressions.

LF named resource based permission model

Tag Based Access Control (TBAC)

In May 2021 AWS announced support for tag based access control (TBAC), which is similar to attribute-based access control for other services. In TBAC instead of granting permissions on resources by name we tag them with different attributes (tag=value) and then grant permissions on a tag by creating a permission policy where the tags are associated with the database/table/column. This helps in reducing the number of access policies that have to be added or changed every time we add a new database/table/column.

Again the goal was self-service and decentralisation, therefore we decided to keep permissions for any dataset/IAM principal close to the project where they are created. Since having tags being defined by various repositories could result in tag collision we decided to keep tag management centralised.

Terraform didn’t support Tag based access control at that time, we created a PR in the provider repository which has been merged and now available v4.20.0 onwards. We use Terraform for defining permission policies and registering S3 locations. We also developed a Drone(our CI/CD platform) plugin for associating tags with Glue catalog resources (database/table/column). There are some limitations with this approach though, as Glue resources need to be present before attaching tags to them.

Final Solution — Hybrid of TBAC/NRBAC

We ended up with a solution which is a mix of TBAC and NRBAC.

We use TBAC for read permissions, because we have to define a few tags, associate them with resources and then grant permissions on them.
For write permissions we use NRBAC, since usually only one application role/user is required to write to a database/table. This helped in avoiding unnecessary data product-specific tag creation.

Automated Tests for Validating Permissions

We also defined a testing strategy that would partially validate the permissions of IAM principals after any change is made. We have some automated tests as a part of our CI/CD pipeline that validate the permissions for the non-application users/roles, for every change we deploy.

Limitations

There are a few cons of using Lake Formation for managing access to the data:

It takes quite a few steps to onboard a dataset and it is not very straightforward. From registering the location to granting permissions through IAM policies for individual users/roles, we don’t really get the experience of the RDBMS grant/revoke model.
There are no CloudWatch metrics that can be configured for LF, so we don’t have monitoring on LF changes.
It only augments IAM and won’t restrict S3 read access for locations registered in LF if there is an IAM policy allowing that. Users/roles will still be able to download data from S3 even if LF doesn’t allow it.

Key Decisions

Lake Formation provides a relational style grant/revoke permissions model at the cost of complexity. There are quite a few steps to onboard a dataset in Lake Formation. To make it self-service we made a few key design decisions that helped us achieve an acceptable solution in the end.

Use Terraform for granting and revoking the permissions on tags and catalogue resources (database/table/column) and use a config-driven plugin for tagging the resources. This enables users with no Terraform knowledge to use the solution for tagging their datasets, and granting permissions on tags is a one time operation that can be done by an engineer or with help from an engineer.
Create and manage tags in a centralized location to avoid any collisions. Keep the permissions and tag associations in the individual project repositories. Permissions for console/human roles are an exception which we keep in a centralised location.
Use TBAC for read permissions and NRBAC for write permissions.
Keep separate state files for LF global settings and admin configurations.
Disable IAM-based global permissions after rolling out the tooling, to enforce using the LF permissions model for new datasets. This helped with faster adoption.
Tag resources at deployment time as opposed to run time. Since for most of our ETLs the table structure is defined as part of the process, it was challenging to tag them at deployment time. However we kept it simple, so for new datasets the end-user would have to run two deployments — one for the ETL and another for tagging the resources.

Conclusion

AWS LakeFormation is a complete suite for building a data lake or a lake house in AWS, it has more features besides the access control. The tag-based access control is quite impressive and eases a lot of the burden in handling access to data stored in S3. However it comes with complexities of implementation and requires additional tools to be built around it, to make it more robust and self service.

N.B: I want to give a shout out to Maiara Reinaldo for her review and valuable input, Daniel Messias for his contribution to the Terraform AWS provider for tagging support, and the whole Pink Scorpions team for their outstanding work.