Resilience: Understanding security at scale

7 min readDec 5, 2022

Scaling security is the way to creatively combine the old concept that unites people, technologies and processes. It is looking for talents to build technology and creatively design processes in such a way that it helps companies to be resilient against attacks and also allows them to be more agile.

Security at scale is also thinking about how to apply controls independently of people. How do I control the security posture of the five hundred instances that my organization has? What if there are thousands? What if there are hundreds of thousands? But let’s start with one of the main aspects of what it means to do security at scale: resiliency.

What is resilience?

In just a few words, resilience is the capacity to recover quickly from difficulties.

The resilience of companies is based on assuming that sooner or later they will suffer a security incident. That is out of your control (the dichotomy of control is a Stoic practice that applies to Cybersecurity). Therefore, following the Stoic philosophy, what a modern security team must do is focus on what they can control. And what you can control is preparing for the worst (even if you hope for the best). In other words, you have to know very well your capabilities (technical and personal) to build a resilient environment against attacks.

After a few years of starting in the world of security, at the beginning of the 2000s, I heard, from security heads, many times the phrase “nothing ever happened to me”; in clear reference to how “effective” the head of security was in his role. At that time, I was a young pentester (who had also gone through a period of implementing security solutions and knew how easy it was to make mistakes and expose a vulnerability) who was very challenged by that phrase and clearly questioned it. Who can be so naive as to claim to guarantee absolute security?

But one day, many years ago, I read the first versions of Octave and Carnegie Mellon’s CERT Resilience Management Model methodologies that helped me understand the secret of a good security professional.

There is not to brag about how unbeatable the solutions that designs and implements are but to think about designing safe fallbacks in the event of failure or security compromise. That paper was not only the first that I read that mentioned the concept of resilience in terms of security, but it was also a before and after in the concept of security that I had. However, it was only years later that I was able to apply it.

How can resilience be achieved?

Resilience is not a solution, it is a process. Let’s start with the first step. Know where we should put our effort (generally scarce) to be efficient in protecting the resources of our organization.

And we achieve this by knowing the main risks of our organization. Based on this, we must choose our initiatives (EPICs or BAUs) in such a way as to build controls that help us reduce risks. In other words, before doing something, you must know whether or not it will serve to reduce a risk. If not, you shouldn’t consider it.

As we can see in the graph below, the intersection of these three dimensions is the optimal point where we should focus our efforts to reduce risks. This means that the initiatives we are carrying out implement controls that are effective in reducing risk and we are efficient in the way of doing it.

Other possible readings emerge from the graph. For example:

Neglected Risks: what about risks that have no associated controls or initiatives? Should I worry about them? Should I focus our efforts on reducing those risks or do I simply accept them as such?

Effort optimization: what about the initiatives that I carry out but do not focus on building controls that reduce risks? Do I have any chance to optimize resources? Am I developing something that I shouldn’t? Are my efforts well directed?

Control optimization: what about controls that do not reduce risks? Are these controls an unnecessary overhead? Is it worth keeping them or could you eliminate them and thereby not only reduce resources but also friction on our users?

Obviously, this is an abstract model and we will probably find valid counterexamples for each of the alternative readings I mention. My intention is that this model serves as a trigger to question whether what we do is worth doing or if we have an opportunity for efficiency. Challenging the status quo of processes is a central part of innovation.

So far, nothing new. Just remember that you have to focus on the risks. A separate chapter is how to show these risks in a way that is understandable for the business, but it is the subject of another post.

What follows is a security model that does take resilience into account as a central point. It is the model that we have applied at MercadoLibre (Nasdaq: MELI) for several years.

MercadoLibre Inc. (Nasdaq: MELI) security model

This model contains four pillars which sostent the resilience capability.

Zero Trust: This model begins by assuming that we should not trust anything. I saw this idea, now fashionable, for the first time in the model applied by Google for the access of its collaborators (Beyond Corp); then, delving into the idea that the traditional perimeter had “dead” I found the origin of this idea that was revolutionary for me. That model is Forrester’s Zero Trust. In a few words, this model invites us to think that we must operate in hostile environments regardless of where our collaborators are.

Anomaly Behavior Analysis: starting from the assumption that we have to operate in a hostile environment, we must build controls based on the detection of anomalies and not based on knowledge; that is, we must know the behavior of our users and our ecosystem.Understanding it allows us to detect anomalous behaviors that an adversary cannot avoid (or it is very difficult for him to do so) because he can control his behavior but not all the behaviors of the environment in which he is intervening. On the other hand, even if it knew how other users and/or systems behaved, it could not alter these behaviors.

Automatic Response (aka SOCless): Given that what we want to achieve is the resilience of our organization, what we must build is a defense system that responds as quickly as possible. In this sense, a main objective is to reduce human intervention as much as possible. Can you imagine a missile system where a human operator must manually calculate the coordinates to intercept a military jet? Neither do I. So why do we see value in having a room full of people looking at monitors like in a Security Operations Center? What happens when an operator has accumulated fatigue and has a night shift? How long does the response take when three or more teams need to be involved to make a decision? As you can imagine, each of these answers shows us possible opportunities for improvement in our resilience process.

In short, the more automatic our response, the more resilient we will be. Of course it’s not all about responding automatically. When things go wrong, we will need experienced, trained people with courage and tenacity to be able to recover from an adverse situation.

Decentralization: An old idea is that the security team can supply any role necessary to validate a control: “Access to a database, to be validated by security”, “the definition of roles, to be done by security”, “the analysis of reasonableness of an action executed by a collaborator, that of the ok security”. These examples (and they are just a few) not only do not provide greater security but also generate unnecessary friction. Who better than the one who knows the value of the information that was accessed, to define whether the access was correct or not? And this idea can be applied to grant access permissions, respond to DDoS attacks, notify that a collaborator made a massive download of information, etc. In this way, higher quality and more agility in the responses and less operational burden for the security teams.

Although throughout the next post I will delve into these concepts and different application examples, for those who cannot wait, I leave you with this example of how we detect and respond to DDoS attacks.

https://aws.amazon.com/es/blogs/architecture/mercado-libre-how-to-block-malicious-traffic-in-a-dynamic-environment/

By applying these ideas, not only is a greater defensive capacity and involvement of the organization’s collaborators obtained, but greater agility is achieved in the interaction of the different actors that are part of a control. On the other hand, the automation of processes allows obtaining indicators of their health and this in turn generates greater traceability.This not only serves the security team to measure its effectiveness, but also third parties that need to audit these processes.

EOF

Keep hacking!

Resilience: Understanding security at scale

Written by Jorge O'Higgins