Automating Response to Security Events on Google Cloud Platform

Amber Shafi
GSK Tech
Published in
4 min readSep 15, 2020

We frequently see data breaches dominating news headlines, but did you know that nearly 95 percent of cloud security failures are due to misconfigurations by the cloud customers themselves? And there is a big price to pay as well, with cloud misconfiguration breaches costing companies worldwide upwards of $5 trillion over the past two years alone.

Public cloud providers, like Google Cloud Platform and Microsoft Azure, take their security and compliance responsibilities very seriously but it will be naive to believe that choosing a strong cloud provider alone is enough. Due to the customisable nature of cloud services, public cloud providers operate a shared responsibility model starting from the infrastructure as a service (IaaS) layer where only the hardware, storage, and network are the provider’s responsibility, up to software as a service (SaaS) where almost everything except the content and its access are up to the provider.

With a large degree of control still left with the customer, a typical organisation can encounter hundreds of security incidents per day caused by inadvertent infrastructure changes. For instance, a developer may accidentally allow a storage bucket to be accessible to the public or users outside of company domains, exposing private user data. This makes it critical to quickly detect and respond to such events without increasing toil for DevOps and security teams. There are multiple sources that capture various events in our Google Cloud Platform environment. In this blog we outline a solution which helps remediate access misconfigurations by detecting and automatically responding to specific Cloud Logging events in real-time.

Automating response to Cloud Logging events

Automated security response implementation overview

Logging is a critical component of our cloud platform environment as it provides valuable insight into the performance of our systems and applications by allowing us to store, search, analyse, monitor, and alert on logging data and events. As such, in case of inadvertent grant of access to unwanted users, access and permissions audit logs are a vital source of information for a security response. Changes to access that are captured in logs trigger our code which ensures that any unwanted users are removed and we are informed of it immediately.

Our teams has many guardrails in place to avoid such events in the first place. Resource behaviours are controlled using org policy constraints, IAM permissions are restricted and only allow users enough access to perform their necessary work, all but our sandbox environments are fully automated and any code changes require reviewer approval. This work ensures, however, that in the rare occurrence that an access misconfiguration slips past our guardrails, we have a way to automatically detect and remediate it and notify relevant teams right away.

This solution is currently only in place for our team’s GCP environment in GSK although we are in discussions on how best to approach this for the entire organisation.

Implementation

In Google Cloud Platform, access can be granted at different hierarchal level i.e. organisation, folder, project or resource level. We need to make sure that our security response mechanisms remediate access misconfigurations at all these levels. To avoid having to write a solution from scratch for each type of security event, we configure a reusable module which creates a log sink, a Pub/Sub topic and an event based Cloud Function. At the resource level, we deploy this module in our GCP organisation for BigQuery and Google Cloud Storage (GCS) as these are main service we used for our data storage. We specify the log filter, naming and details about the function using the module as below (if you want to learn more about modules here is a great guide from Terraform):

We place this solution in a highly restricted GCP project which is only used for implementing security foundation elements.

With the basic infrastructure set up, the next step is to write the function logic to respond to the event. In case of Big Query and GCS, this means ensuring that the function looks at the logs for any changes to access and removes any new public users or users outside of our whitelisted domains. In our implementation we used Go to write this function but Cloud Functions also support Node.js, Python and Java.

If you prefer a different language, you could also explore Cloud Run which supports any languages.

Once deployed, our function takes just a few seconds to remediate any inadvertent or malicious access grant. The remediation actions are captured in logs and we also publish them as a message to Pub/Sub. This then enables us to set up an alert for relevant team members about the event that occurred, whether it was done by an authorised user and details of the remediation action that took place.

Outcome and next steps

A misconfigured bucket or dataset can expose sensitive data and causes significant damage to an organisation. This work allows us to monitor and instantaneously deal with any occurrences where private user data in GCS or BigQuery becomes inadvertently accessible to public or foreign users. This implementation also autoscales based on the volume of log events without any additional work or interaction from the DevOps and security teams; all the while keeping them notified of the remediations taking place.

There is a wide range of situations where you can implement an event-driven system to process and respond to events. We are also using our module to deal with unwanted IAM permissions changes at project, folder and organisation level in the same way. We will next be building upon this work with further use cases and looking at Event Threat Detection as source as well as integrity validation and anomaly detection.

Please reach out if you have have any questions or feedback!

--

--