ChatOps for Production Access Control
Access control is a key component of data security. In simple terms, access control means regulating who has the ability to access resources in a computing environment. At Policygenius, we implemented an access control policy around our Google Cloud resources following the principle of least privilege.
The principle of least privilege promotes minimal user privileges on computing resources, based on users’ job necessities. Ideally, each user should have the least authority necessary to perform their duties. This helps reduce the “attack surface” of the computing resources by eliminating unnecessary privileges that can result in network exploits and system compromises.
Weighing our options
The main requirement was to have an approval workflow where an engineer would only be able to access the Google Cloud Platform (GCP) resources after management approval and only for a limited amount of time. Additionally, we wanted to log all of the related activity and store them for auditability.
One potential solution was using an emergency access account aka break glass account. We looked into HashiCorp Vault’s open-source solution to safeguard the shared account’s password. This solution does not offer an approval workflow, one of the key requirements of our endeavor. Also, a common account would make it challenging to trace actions back to an actual user.
Another suitable option was Gimme, an open-source access control solution developed by Spotify. Google recently released a feature called IAM Conditions (beta at the time) which gives us the capability to provision access that automatically expires after a specific amount of time. Gimme uses this IAM feature to limit access to Google Cloud resources for a specific duration. It has an approval workflow, but it is much more difficult for the approving manager than just clicking an approve/deny button. The code needed significant customization and modification to accommodate all our requirements. Also, the Github repository was later archived.
There were multiple other full blown access control solution offerings we looked into. Most of them lacked ease of integration with GCP’s access management and were costly. Although some of these tools were really great (we almost ended up using Gimme), we decided to write our own tool primarily because it can be customized to meet our exact requirements.
Based on a ChatOps project built during our hackathon, Geniusbot, we decided to approach the problem from a different perspective. Since managers wanted the ability to quickly approve the requests even after hours, we decided the most user-friendly solution would be a slackbot. ChatOps is conversation-driven development where developers can type a command into a chat room and a chatbot is configured to execute these commands through custom scripts.
Our ChatOps tool consists of three main components:
- Slack user interface for users and approvers.
- Google Cloud Functions written in Python for the backend.
- GCP’s IAM Conditions to handle access provisioning.
Similar to the Gimme tool, we are using IAM Conditions to automatically expire provisioned access after a set amount of time. It ensures that an engineer has least privileges required to perform their duty at a given of point. This is just one of the features that we are using. IAM Conditions has other capabilities such as setting up access schedules and resource based access provisioning.
User initiates a request in the slack channel by typing a simple command. A message with request details is sent to the channel with the on-call manager tagged to it. They can either Approve or Deny the request.
If denied, the workflow stops and the user is notified in the Slack channel. If approved, a browser window opens prompting the approver to authenticate using their Google credentials. On successful authentication, the process is complete and the user is notified in the slack channel.
By default the access is valid for 60 minutes, but a user can request for more or less time by passing a parameter in their slack command.
To better understand the architecture, we will split it up in smaller parts.
Slack has a feature called “slash commands”; these slash commands trigger HTTP endpoints, the Request Handler cloud function endpoint in our application. This cloud function handles all of the incoming requests. It generates a slack response message which includes details about the request as well as “Approve” and “Deny” buttons. The function also makes a request to PagerDuty to identify the manager on call and tags the manager along with the request.
The approving manager then looks into the request and either approves or denies it. This triggers the Acknowledger function which acknowledges the action via a slack message. The workflow stops if the request is rejected. If approved, an authentication page opens up in a web browser. If authentication is successful, it checks if the approver is authorized to approve the request based on their IAM privileges. This step covers both authentication and authorization of the approver. Google Key Management Service is used to decrypt URL parameters embedded in slack buttons.
Once both Authentication and Authorization pass, the Provisioner function creates the IAM condition based on the request. The authentication token of the approver is used to push changes to the IAM policy for the Google Cloud project.
For auditability, we log details about processed requests, requesters and approvers. Logs also provide abundant useful information when it comes to debugging an application. Cloud Function logging is an out-of-box feature in GCP. Our Python cloud functions log all of their activity in this logging system. We also have a log sink setup which continuously copies all of these logs in a storage bucket for long term storage.
Below we have a complete architecture diagram after combining all the parts.
After implementing this tool, we are now able to enforce the principle of least privilege. When required, access can be provisioned for a limited duration after going through a management approval workflow. Importantly, we now have the ability to audit who has access to production resources on a given timestamp.
After launching this ChatOps command, we have also expanded this tool to other distinct slack commands. Each command identifies which IAM role a user is requesting access for. We are working on expanding this tool for other resources and IAM roles within our Google Cloud setup.
At Policygenius Engineering, we work on solving challenging problems with innovative approaches. If you are interested in the type of work we do and wish to experience our highly collaborative culture, check out our careers page!