Kubernetes Security monitoring at scale with Sysdig Falco
Two years ago in Skyscanner we made the decision to start moving our workloads to Kubernetes. Today, even though this transition is far from complete, our infrastructure is using >2000 nodes spread across 30 different clusters to power a fleet of >160 services.
As part of the transition to Kubernetes, the Security team had to come up with a way to detect malicious activity in Skyscanner’s Kubernetes clusters. Due to the sheer size of the target we want to monitor, the solution we chose needed to be able to scale as much as our most demanding services, without hindering their performance while allowing us to be immediately alerted if one of the machines in any cluster is compromised. Also, not only do we need the ability to scale to all the Kubernetes clusters, we also need an automated way of mapping and contacting the owners of the affected services if anything were to happen. Luckily we’ve dealt with this issue before and we have solid service mapping and ChatOps tools, so we needed a tool that could integrate them.
After studying some options we decided to use Falco (https://falco.org/), the open source tool from Sysdig. Falco is a container Native Runtime Security solution focused in Intrusion and abnormality detection which uses the open source linux Kernel tooling built by Sysdig to generate alerts based on a custom rules and macros engine (all the rules and macros are also available as part of the open-source project). It was exactly what we needed to add to our security tool roster when moving to Kubernetes.
Why Sysdig Falco?
Some of the key features that made us decide for Sysdig Falco were:
- Complete container visibility through a single sensor that allows us to gain insight into application and container behavior.
- Easy installation as a Daemonset, ready for Kubernetes
- Adopted into the Cloud Native Computing Foundation (Incubated project)
- Active open-source community. We ❤️ Open Source at Skyscanner.
A powerful ruleset
Sysdig Falco is based on rules that will trigger alerts when conditions are met, for example Falco is by default able to detect when:
- A shell is run inside a container
- A server process spawns a child process of an unexpected type
- A sensitive file, like
/etc/shadow, is unexpectedly read
All of the above are examples of first actions that an attacker would take if they managed to access one of our services, so it’s of the outmost importance that we’re able to detect this activity in any of our machines (and act on it as soon as possible).
The Falco rules engine also allows easy creation of new rules and macros, and modification of the existing ones, knowing that in the future we’ll be able to tailor rules that target our mission critical services was a strong decision factor for adopting Falco. As a bonus we can also share our rules with custom rules from other Falco users that share pieces of our stack.
Starting with Falco in K8s, Using a Daemonset vs. a Falco daemon in each container
Once we decided to go with Falco, We had two main options when it comes to how to deploy Falco in an easy way without increasing the friction of adoption (while considering our Cell Architecture), we could either install the Falco daemon as part of the docker container in each pod or run Falco as a Daemonset, so each spawning node would have a Falco pod that monitors the other pods in the same node. In the end we decided to go for the Daemonset installation because we can update the Daemonset template and redeploy it without having to rebuild each other service. The only downside to this option is that it’s specific to Kubernetes, so services running in EC2 won’t be monitored; however, we’re moving to be fully in K8s by the end of the year, so this will not actually affect us.
Deploying Falco at Skyscanner scale
You might be thinking: adding a tool that inspects every system call sounds like it can have a very big impact in the performance of our services, right? The answer is no, deploying Falco as a Daemonset in a K8s cluster hasn’t shown any negative effects in the services performance (believe me, we really tried to break it). We worked with our K8s team to evaluate the potential performance impact, testing Falco with very fast services that would respond in a couple of milliseconds, and services that performed heavy computation and took much longer to respond on average, we did load tests in every way we could think of (we even tried making every request to an API trigger an alert in Falco) and we didn’t manage to cause negative effects in performance. As an addition to this point, once Falco was deployed successfully the first time, we set up a set of Prometheus alerts that would immediately notify us on any sudden drops in the number of Falco pods, too many simultaneous pods waiting to start or increased memory usage, metrics that can indicate if the Daemonset state is unhealthy. At Skyscanner we do a great effort in making sure that all our services emit meaningful metrics that allow us to control the services state, and we must not forget to monitor the monitoring tools! Who watches the watcher?
High level architecture
The current infrastructure that we have built around Falco looks like this:
As mentioned before, deploying Falco as a daemonset ensures that each new node that spins up will contain a Falco pod, those pods will post the payload of the generated events to the forwarder lambda that will push the event to Slack, Splunk and our internal Security Aggregation and Visualisation System (Talos).
In order to be able to update Falco without stopping the service (and making sure that one updates don’t cause failures in the service) we use Kubernetes RollingUpdates. With this mechanism, once the Falco daemonset template is updated, the old pods are replaced by pods with the new configuration of Falco in a controlled manner that guarantees that if they don’t reach the “Ready” state in a node the update won’t continue. This allows is to apply updates as frequently as we want without worrying about downtime during updates.
First round: Sandbox
We started by deploying Falco in our Sandbox environment to test performance, and also to start getting an idea of the results that Falco generates based on the Ruleset we chose. The Ruleset selection it is an important part of the process, as depending on your characteristics and size of your environment the default rules could be very noisy, we want to reduce noise to focus only on actionable alerts. As an example, during this round we realised that the HAproxy ingress (https://haproxy-ingress.github.io/) pods were generating many “Write below etc” alerts when running their startup. This happens because the service needs to update configuration files on startup (so it’s an expected behaviour) and the custom process that we use to start the proxy was not added as one of the whitelisted HAproxy programs (even though Falco’s default rules already take HAproxy into account, so for most users this shouldn’t be an issue. Because we didn’t want this to generate noise, we updated the “haproxy_writing_conf” macro used in the “write_etc_common” rule available in the public version of the Falco rules (here: https://github.com/falcosecurity/falco/blob/373d2bfd890be8928813410bb7687edf0ab80f01/rules/falco_rules.yaml#L1176) so that it would also take into account the process spawned by our custom haproxy containers.
Once the noisiest events were removed we started to see some interesting findings — one of the most interesting ones (and alarming too!) was containers running with root privileges; this is dangerous because running a containerized app can for instance allow attackers to escalate into the host machine and escalate from there (more details in this article). After we found this, we wanted to find out the reason why this was happening, in Skyscanner we have custom docker images for each of our most used languages (Python, Java, Node and now Go as well), those images have been created in collaboration with Security so they follow all our standards (not running as root is one of them), but in some special cases teams need to use a different base image or language, and the base images available online don’t usually set up a user and end up running as root. This was helpful to highlight the need to review our strategy on Pod Security policies to prevent this from going forward.
Second Round: Deploying to all cells in production
After successfully deploying the Falco daemonset in Sandbox and adding the necessary changes in the config to minimise noise, we set up to deploy it in all our production clusters. In 2020 Skyscanner will be moving to a Kubernetes Cell architecture, which means each service can (and should) be deployed across multiple clusters, in multiple Availability Zones, across multiple accounts per region, so in the event of a cluster outage the traffic can be redirected to healthy cluster without affecting the Travellers. For Falco there is no difference between sandbox and production clusters, as they’re all just Kubernetes clusters (even though the production ones have much more traffic and more resources available). We updated the Falco config to include data in the request sent to the forwarder Lambda to allow us to differentiate where each finding comes from (adding a cluster name and region made the results much more readable), and we proceeded to roll out Falco in the first production cluster.
There was a predictable spike in received findings as we went from only ~20 nodes to ~120 sending findings (we deployed Falco first in the cluster with the least amount of traffic to be safe), as we were using a Lambda when more than one concurrent finding was posted the necessary Lambda instances spun up to adapt to the traffic. If the traffic increase had caused an increase in the number of findings that caused the Lambda to be constantly running we would have needed to move the Lambda to me a microservice in EC2 or Kubernetes instead, but as that was not the case no changes to the infrastructure were made.
One of the most prominent findings in production is users running commands inside the pods, this happens because some services in production are used for load testing and devs need to access the pods to start and disable the tests, when this happens we still want to notify the user (as we consider a shell spawning to be a critical enough event that we must require a confirmation from the user, more on this in the future).
Storing and aggregating the results
The Falco findings allow us to see many potential security issues across all our Kubernetes services, but we cannot act just on raw data, so once the findings started being sent we had to store them and visualize them before we could take action.
The information that Falco gives us is limited to what we can extract from the running pod in Kubernetes and most of the times is not enough to know who we should contact to address the issue, and copying a pod name in one of our engineering Slack channels hoping for the service owner to read the message and react to it is not a good way of finding a service owner. Luckily in the Security Engineering team we’ve had to deal with this issue before, to solve it we built a system that collects ownership and resource information about each of our services (obtained by scanning the metadata available in the project’s repository, our AWS accounts and a few other places, such as data received from our deployment pipeline). We updated the Falco processor Lambda (that’s receiving the Falco events) so that it would push information to this mapping service and voila! We could then accurately know things like who is the owner of each pod that triggered an alert, a Slack channel that we can use to contact them and some other contact and service information.
As we continue gathering information from AWS, Github and vulnerability sources like Snyk (Dependencies security) and Clair (Docker vulnerabilities), we can also know which AWS roles or infrastructure are linked to the same service, which language and libraries the service is using and wether any of the libraries is affected by vulnerabilities that an attacker could be exploiting. Finally, all this data is stored in Elasticsearch, so using Kibana we built some dashboards that would easily allow to monitor the number of findings in each cluster, filtering by owner or service criticality, rule triggered and so on.
Actioning the alerts
As we have seen from the results of running Falco in Sandbox, the tool is helping us surface many issues that we’d otherwise not know about, pushing the results to our in-house data mapping service allows us to gather extra information about the affected pod and gives us information to act on the alerts by knowing who to contact and what other services may be affected (in the case of an attack). Having this level of visibility is great, and is the first step to allow the Security Engineering team to take decisions, and prioritize actions. But considering we own more than 5000 services and our Security team is 22 people, doing this manually doesn’t scale well so we need to automate this process, but how?
We’re able to action the results with our ChatSecOps Bot ‘Hermes’, this bot is configured with a series of rules that scan the received data from a source (in this case the Falco findings, with the information added by our mapping tool) and uses it to contact the owner of the resource where an anomaly is detected and inform them about the event. And finally we are working on making the bot to asks the owner of the service via our ChatSecOps Bot if they are the ones executing the action that triggered the rule. If the owner of the resource responds ‘No’ the bot will trigger an Incident Response process.
The way forward
Now that Falco is running in the Kubernetes clusters we have a solid baseline of rules that are checked against all pods the moment they start running in one of our clusters, however the work to secure our Kubernetes is far from over, and we plan to build from this baseline to ensure each service that’s deployed in Kubernetes as secure as possible. We wanted to achieve full visibility from the security point of view in order to have situational awareness and accurate data to take decisions.
The next steps for this process are the following: on the one hand we’ll continue polishing and adding new rules to our Falco ruleset (we’re looking at a few that target specific technologies that we use, like access to Elasticsearch instances. On the other hand, we’ll also be focusing on pre-emptive actions, taking a closer look at how our pod security policies are configured and tailoring the permissions given to each user to grant least privilege wherever possible to minimize the blast radius if any of our services were to be compromised.
If you’re interested in finding out more about our security automation tooling, we’ll be releasing another post where we talk in depth about what tools we built to support and enable us to manage all aspects of security at our scale, so stay in touch!
About the authors:
Nacho recently completed the Graduate programme at Skyscanner as a software engineer. He’s passionate about all things automation, specially chatbots, and currently works in the Platform Security and Automation team, developing services to automate the security processes in Skyscanner. Nacho is based in Barcelona.
Christian has extensive experience in the Security Industry. He is currently leading Security Engineering at Skyscanner, focusing in delivering security at scale. Christian is based in Barcelona.
Join Skyscanner, see the world
Life-enriching travel isn’t just for our customers — it’s for our employees too! Skyscanner team members get £500 (or their local currency equivalent) towards the travel trip of their choice in 2020 — and that’s just one of the great benefits we offer. Read more about our benefits and have a look at all of our open roles right here. If you want to join our Security team and help Nacho and Chris to secure Skyscanner, check our Security Operations Engineer role.