Big red button for incident response: How to approach security incidents in Cloud

Julio Diez
Google Cloud - Community
7 min readAug 7, 2020

I was faced with a difficult question: in the case of a security incident, one customer desired a “big red button” to completely shut down Google Cloud Platform (GCP). But how to do that? Is there a right way to do it? Let’s see.

The problem

Imagine an attacker could hack our Cloud systems and exfiltrate sensitive data, or abuse the system to launch an attack. How do we keep control? Can we just shut down GCP? Would it stop the attack and contain the damage?

Organizations in regulated industries often feel they are obligated to demonstrate such a level of control, so regulators believe the organization has done their due diligence. This is especially true if GCP communicates with internal systems and could potentially be a channel back to the core on-premise network.

Attacks may be carried out via a number of external-facing systems such as web servers or Content Management Systems (CMS). Other scenarios include leaked/stolen SSH keys or stolen IAM credential files left on a public repository. These attacks assume external forces. However, the risk grows when we add internal attackers to the mix.

Security in the Cloud focuses on a shared responsibility model that proves essential for businesses to succeed with cloud technology. Typically, the Cloud Service Provider is responsible for securing the infrastructure, and the customer is responsible for securing the data and applications. Google Cloud helps you protect your systems and data with products and services. But no matter how many defenses you put in place, you should be prepared for bad things to happen.

If an unfortunate event occurs, the customer wants to keep control and considers shutting down GCP a good mechanism to do so. Wouldn’t this stop the attack and contain the damage?

Shutting down GCP is more difficult than it may seem. First, we should define what “shutting down” really means. Different customers may have different interpretations, but it can mean blocking access to resources, isolating systems, or destroying resources. Let’s take a deeper look at the problem and possible solutions.

The technical side

We can think about this problem in a traditional way. One of the typical protections that on-premise infrastructure leverages is the network. Much like a user might react by pulling the ethernet cord from a computer if they are attacked, some companies might immediately restrict access to critical systems and the internet when they detect their systems are compromised. In GCP, we might want to label critical firewall rules with a tag such as “ripcord” then, in an emergency situation, we can quickly identify and delete those rules in one pass.

While this digital version of “pulling the cord” may block traffic from exiting your systems, it may not be clear how it would be implemented and if it would help in all situations. Attackers may use public storage buckets or other managed services to dump sensitive information, which complicates security analysis when strictly considering only firewall rules.

Luckily, in Cloud, you can take advantage of how quickly and easily access can be restricted through Cloud IAM. For example, if you are using CMEK, our solution to manage your encryption keys, you have an effective way to block access to your data by revoking IAM permissions to CMEK keychains. You can apply the concept similarly to other critical roles. However, as in the firewall rules example above, it may not be clear how exactly to leverage IAM to block an attack. If you suffer a ransomware attack, which has a goal of preventing data access through encryption, revoking IAM access to CMEK won’t help you. Additionally, if the attackers are using your platform to launch DDoS or other types of targeted attacks to other systems, IAM won’t help either.

Google Cloud offers you VPC Service Controls (VPC SC), which allows the definition of security perimeters to mitigate data exfiltration risks due to misconfigured access controls, malicious users copying data to unauthorized cloud resources, and attackers attempting to access sensitive data in GCP resources from the internet. VPC SC is a powerful mechanism to enhance your security posture, and I encourage you to explore how it can be used in your environment. A word of caution, VPC SC is not a trivial solution and requires adoption at the architecture phase. Assuming a situation in which we are already under attack applying VPC SC after the fact is not realistic.

Besides VPC SC, there are additional, more drastic mechanisms you can put in practice. In a cloud-native environment where Infrastructure as Code is used to manage your deployments, you could consider automating everything to remove all resources with the press of a button, allowing redeployment later. Be aware, this solution refers strictly to infrastructure and not live data, which would be lost.

You may have already realized there is no single solution as every situation requires a different approach, or a mix of them, to be tailored to specific circumstances. One solution may evoke a mechanism which not only fails to address the problem but is also hiding or destroying evidence of what really happened, leaving you exposed. Worse, some solutions may impact future business. Blocking client access to services, losing control of, or access to your systems may have a negative impact on your business reputation that may be worse than those of the attack itself.

Remember, you should ensure the confidentiality, integrity, and availability of the data when creating a security program. The limited mechanisms discussed so far may help protect your systems, but there are still many loose ends to address:

  • What type of incidents or attacks should you plan to address? You can’t protect against something you don’t know.
  • How do you identify and qualify incidents?
  • What are the processes to decide if a solution should be applied?
  • How will you protect your systems and data during the incident?
  • How will you recover?
  • Will you meet regulatory and compliance requirements?

If the sole tool you have to manage an incident is a big red button you will eventually press it. This way, your button may become the perfect tool to be weaponized by an attacker, and the fastest way to put your company in the headlines.

To avoid the catastrophic consequences of a “big red button plan”, you should have an effective Incident Response plan. This plan leverages systems, teams, and processes to manage security incidents. Previous questions and more should be taken into account when defining this plan:

  • How will you perform a forensic analysis?
  • What teams should be involved, and how will they coordinate?
  • How will you communicate the incident to your customers?
  • How can you learn from the incident and improve?

Investing in a plan will put you in a much better position to manage security incidents when they happen, lower the associated risks, and improve your confidence and security posture.

Google has a rigorous Incident Response process divided into the following phases:

  • Identification. This phase focuses on monitoring security events to detect potential vulnerabilities and incidents, and report to the incident response team.
  • Coordination. When an incident is reported, a triage will take place to evaluate the nature and severity of the incident and engage the response team if needed.
  • Resolution. At this phase, we will investigate the root cause, resolve immediate security issues if any, and coordinate tasks to contain and recover. Communication plans are also developed if needed.
  • Closure and Continuous improvement. We analyze each incident to gain new insights and learn lessons to improve our tools, training, and processes for our overall security.

In this white paper, you can gain more insights into Google’s approach to incident response.

The psychological side

We have talked about the challenges that come up when dealing with an incident. As explained, technically, the best path is to have an Incident Response plan to manage these situations. Still, having a proper plan is not easy. It requires time, effort, resources, qualified people, and commitment from the CxOs to invest in plan success. In the absence of that, some customers may think they are safer with a big red button than without. We have discussed technical reasons why this is not a good idea. But the problem is not solely a technical one; it is also psychological.

Bruce Schneier wrote: “Security is both a feeling and a reality, and they’re different. You can feel secure even though you’re not, and you can be secure even though you don’t feel it.” Regretfully, sometimes people make decisions based on the feeling of security rather than its actuality. When the feeling of security doesn’t correspond to reality, feelings can be an enemy.

Imagine you discover malicious behavior on your systems or data is compromised. You don’t have an incident response plan nor a lot of information about the attack and what has occurred. But you have that button. This situation, where you are risking your business, is an emotional rollercoaster, especially without a plan.

You may feel more secure with the option of pressing a button. You know doing so has risks, but many prefer known to unknown risks. We perceive risks according to several factors like (un)familiarity and fear; we assign higher risk to threats that are new and not well known. This phenomenon of perceived risk applies, even more, when we must make quick decisions under pressure. We can’t help but feel emotional reactions to the various options available. This can lead to a high probability of inappropriate decisions.

Conclusion

A security incident and how to respond has many consequences. You can improve your security posture if you assess past incidents thoroughly and consider future threats. You should have an Incident Response Plan, including several components. First, include written processes and specific procedures to follow for each situation. Second, train your people to follow the processes and procedures. Provide support when they must execute them and, finally, complete retrospectives to learn from the experience. Your ultimate goal is to plan ahead of time to be safe!

--

--

Julio Diez
Google Cloud - Community

Strategic Cloud Engineer at Google Cloud, focused on Networking and Security