Achieving least-privilege at FollowAnalytics with Repokid, Aardvark and ConsoleMe
We first heard about ConsoleMe at re:Invent 2020 when Netflix decided to open source their internal tool as long as with its CLI utility called Weep. They were motivated by reducing the time spent by their Cloud Infrastructure Security team in handling the numerous AWS IAM Policies requests they had. Their approach was to empower the users with the ability to ask themselves the AWS IAM Policies they wanted and letting the Security team with the only responsibility to approve or deny the user’s requests. It also allowed them to grant only the privileges their users needed to perform their tasks.
We were very interested in the idea of granting least privileges to our users and reducing our time spent handling AWS access requests. After a proof-of-concept and some tests, we decided to deploy ConsoleMe internally and start a culture change to use it as our main tool to control user’s AWS permissions.
In our search to get even closer to the principle of least privilege, we found two other Netflix tools open-sourced in 2017 called Repokid and Aardvark. Those tools work together to remove unused user’s IAM Policies. Their functioning and how we deployed them at FollowAnalytics will be detailed later in this article.
The problem of static AWS keys
Before we deployed ConsoleMe internally, our users could access AWS resources using their own AWS IAM users with their associated AWS keys. The AWS access requests were made by users through Slack messages and then handled by the SRE team. All the procedure was very time-consuming for both users and the SRE team. In addition, most of the users don’t rotate their AWS keys very often, which exposes them and the company to potential security risks. It would be possible to implement a system that automatically rotates those keys as described here, but its architecture isn’t trivial, and it would require some maintenance. So we decided to go with the model that is secure by default.
At the same time we were testing ConsoleMe we started to use the AWS SSO service to centralize the access to all services we use, including AWS resources, internal and external services. Once ConsoleMe supports the use of SAML identity provider we could use AWS SSO to authenticate users in ConsoleMe.
The integration of ConsoleMe and AWS SSO, in addition to the ability of ConsoleMe to assume AWS Roles, allowed us to enforce the usage of AWS Roles instead of AWS IAM Users. Now we authenticate using AWS SSO and let ConsoleMe take care of assuming the AWS Roles.
We created a first set of AWS Roles for each user, allowing them to access S3 buckets, RDS databases and Redshift. You can read more about how we use ConsoleMe to allow our users to generate ephemeral passwords to connect to our databases here.
Users can assume AWS Roles directly from the ConsoleMe home page or using the CLI tool called Weep to generate temporary keys. The AWS Roles can be assumed by a user only if it has a tag named consoleme-authorized with the user’s email address as its value. In our case, we integrate the AWS SSO with ConsoleMe which also gives us the possibility to use SSO groups in the Role tags to give access to a group of users. All AWS Roles managed by ConsoleMe also need to have in its Assume Role Policy the ConsoleMe Instance Profile in order to allow ConsoleMe to assume them on behalf of the users.
The users are free to create new IAM roles without policies attached and ask for additional permissions using the ConsoleMe interface. The SRE team is responsible for approving or denying those new permissions after receiving an email and a Slack notification informing about the new request.
How Repokid works
At FollowAnalytics we value automation to perform recurrent tasks, so we use terraform scripts to create policies and roles. The problem we faced with this approach was the fact that sometimes we create policies that end up not being used by users and usually, those policies persist for a long time.
Repokid is helping us to address this problem by automatically analyzing the inline IAM Policies usage in IAM Roles and removing all unused ones for more than 90 days.
The fact of analyzing only the inline Policies gave us the possibility to manage separately the application’s policies lifecycle from the user’s Policies lifecycle. We could keep using Terraform to manage the application’s permissions, and we could start using ConsoleMe to manage user’s permissions through inline Policies. Also, the security gains were remarkable, because we could apply the principle of least privilege to our user’s permissions, which tends to change more often than the application’s privileges, in an almost completely automatic way.
Repokid depends on an external service called Aardvark, which is a caching layer storing the information about IAM Policies usage such as when it was last used to access an AWS service. Aardvark stores its data in a PostgreSQL database and exposes an API to be consumed by Repokid.
Repokid is a CLI tool allowing users to run analysis over a list of Roles and identify their unused inline Policies. It uses a verb called “repo” for the act of taking back unused permissions (“repo” is shortened from repossess). With the CLI we can run a schedule command to analyze Roles and plan the deletion after a certain amount of time, which is configurable, and all Role’s data is stored in a DynamoDB table.
It’s also important to highlight that Repokid maintains a cache of previous policy versions in case a rollback is needed to restore a previous policy state.
Considering the Repokid dependency, we decided to deploy Aardvark in our EKS as a Kubernetes deployment and expose its API using a dedicated endpoint. Aardvark requires a PostgreSQL compatible database and considering that we run Repokid once a day, we could use AWS Aurora Serverless to benefit from its elasticity and scale down the instance to zero when there are no requests. Consequently, we could reduce the costs with the database running time.
We chose Jenkins cronjobs to run the Repokid CLI because it was simple to deploy and maintain. Also, the history of the cronjobs executions in Jenkins is very helpful to monitor and keep track of the permissions changes.
We created two cronjobs running every day. The first one was responsible for running the analysis and scheduling the eligible Roles to be repoed in a week, and the second one was responsible for repoing the Roles with expired scheduled time.
Repokid has an option in its configuration to apply filters using patterns to exclude or include Roles. We wanted to be sure that only Roles managed by ConsoleMe would be analyzed by Repokid, so before running the CLI we list, using the AWS CLI, the Roles with the ConsoleMe Assume Role Policy, and inject the result list in the Repokid configuration file. The CLI command we use with its query to filter Roles is the one below:
aws iam list-roles --region <AWS Region> --query 'Roles[?AssumeRolePolicyDocument.Statement[?Principal.AWS==`arn:aws:iam::<AWS Account Number>:role/ConsoleMeInstanceProfile`]].RoleName'
There is no notification feature in Repokid itself, meaning that running only the cronjobs we designed would result in policy deletions without warning users and the SRE team. To solve this problem, we decided to add two more components in our architecture responsible for notifying us about upcoming policy deletions.
The first component is an AWS Lambda function triggered once a week responsible for notifying the SRE team and the Role owners about upcoming scheduled Roles. The first part of this script builds a list of all weekly scheduled Roles with names, ARNs, scheduled dates, and sends it to the SRE team mailbox and a dedicated Slack channel. The second part checks each Role individually in the previously generated list and sends an email to the Role owners informing them about their scheduled policy deletion.
This feature allows us to notify users by email 7 days before the policy deletion and gives them time to reevaluate their access needs. If they don’t want to lose the access granted by the scheduled policy, they can simply use it once to avoid it being deleted, but we advise them to let it be deleted if they don’t need it anymore.
The second component is another AWS Lambda function responsible for informing users after their Roles are being repoed and their policies deleted. It’s triggered by changes in the Repokid DynamoDB table, more precisely it relies on changes in the attribute storing the scheduled date in a timestamp format. This attribute, called RepoScheduled, stores the date when the Role is scheduled to be repoed and stores zero when no schedule is planned. The attribute value is set to zero when the Role is repoed. This change triggers the Lambda function, which sends an email to users informing them about their policy deletion. We use the Role tag consoleme-authorized mentioned above to recover the user email.
We use the managed service AWS Proton to deploy both Lambda functions and infrastructure resources automatically. It also provides a continuous integration pipeline with AWS CodeBuild offering a quick build and deployment of new versions of our code.
We simplified the management of this architecture by splitting it into three separate projects with their own GitHub repositories, one for Aardvark, one for Repokid, and the last one for the notifiers.
The Aardvark project stores the helm template and the necessary scripts to perform the deployment in our EKS. The Repokid project has the Jenkins jobs definitions in DSL and the Groovy scripts to be executed.
The last project is dedicated to the AWS Lambda functions definitions and the source code. As mentioned above, we use the service AWS Proton to create the infrastructure and the two Lambda functions. AWS Proton uses CloudFormation templates to create the infrastructure, the Lambda functions, the Continuous Integration (CI) pipeline in their CI service called AWS CodeBuild, and a Continuous Deployment (CD) pipeline.
AWS offers many application samples here to be used with their services. So, we used this sample project to create our AWS Proton templates, adding just a few modifications. The AWS Lambda functions definitions were created from a generated application with the AWS Severless Application Model (SAM) CLI with very few changes and the addition of a second AWS Lambda function.
All three projects follow the GitOps flow, meaning that all changes in our git repositories will automatically trigger a new deployment. We spent two days building the CI and CD pipelines and once it was implemented we could reduce considerably the time to deploy new versions of our components.
Today we only inform users of the name of the repoed Role and not the deleted policies attached to it. Our users receive an email like the one below warning them about a Role being repoing:
The SRE team also receives a weekly email with the list of Roles being repoed:
In the new version of our notification system, we’re planning to add more information about which policies are being deleted, because some of our users have Roles with multiple policies. So, from their perspective, receiving a more detailed notification could help them to quickly spot the policies they don’t use anymore.
Using ConsoleMe internally has been helping us to have better control over the user access to different AWS resources. In addition, it considerably reduced the time spent by the SRE team handling access requests. Before depending on the access requested, we would need to work a few hours to have it finished, now it’s as easy as pressing a button and usually takes minutes.
From the user perspective, ConsoleMe was added as an intermediate tool to their routine to connect to AWS. They now connect to AWS Console through the ConsoleMe interface, or they access via the AWS CLI with the temporary keys generated by weep.
The use of AWS Roles via ConsoleMe to access AWS resources improved our security and reduced the risk of AWS keys being used inappropriately or by someone who shouldn’t have access to our resources. It certainly is less practical than using static AWS keys, but it adds a great improvement in our internal security. Another advantage is the fact that ConsoleMe is deployed behind our VPN, blocking all public access.
Combining ConsoleMe and Repokid helped us to get even closer to the principle of least privilege, because we easily control which resources the users have access to during the ConsoleMe access request review and automatically delete all unused access with Repokid.
It took us approximately two weeks to design, implement and deploy the solution with one person working exclusively on the project. We also could contribute to the Repokid open-source project. We’re constantly reviewing our procedures and solutions, and so far we received very positive feedback and the improvement in security is remarkable. We would like to thank the Netflix teams for opening the ConsoleMe, Repokid, and Aardvark projects.