Our Application Security Journey (Part 1)
This is the first in a series of articles on the state of Application Security at Wise, describing the integration of security in the Software Development Lifecycle.
The Application Security team at Wise protects software application code and data against cyber threats and ensures customers’ assets are safe. We identify and report vulnerabilities across the company and work with product teams to ensure security practices are followed and mitigations are applied throughout the whole software development lifecycle.
How we started
The Software Composition Analysis (SCA) program led by the Wise Application Security team is critical for the security of the entire organisation as thousands of open-source libraries are used on a daily basis. Manual tracking of open-source libraries is an impossible mission in today’s fast-paced development practices. It’s an inconceivable task to ask of an organisation, security department and engineering teams.
At Wise, we heavily use kubernetes clusters for the deployment of the product services. Docker images are built as part of our Continuous Integration (CI) pipeline and our platform team maintains our own Docker base images. During CI builds they include various tools and libraries for wide engineering use.
Product engineering teams build their services on top of a base image and are only responsible for the code and library vulnerabilities used by their services. You can learn more about how the Continuous Delivery team in Wise works and the exciting journey they are going through in this recent blog post by Massimo Pacher.
So, there is a split ownership between different teams, of vulnerabilities coming from the base images and vulnerabilities coming specifically from product services.
After evaluating different solutions and tools and reiterating a feedback process across engineering, the Application Security team identified these common pain points shared across the organisation:
- Assigning the right vulnerabilities to the right owners
- For example, removing from the different reports vulnerabilities belonging to base images (OS vulnerabilities)
- Have a global view of the security state of the services
- Easily develop dashboards based on different requirements from different stakeholders of the organisation
- Difficulties using a new UI across various other product engineering teams
After some brainstorming and PoC sessions, we decided that the task of running scans, assigning ownership of the services and vulnerabilities to the right teams along with an easy-to-use tool for the visualisation of the vulnerabilities was feasible. We started building the automation capabilities around an open source tool with the aim to solve the challenges identified.
In this post, we will detail our process of building our tooling, how a cross-team initiative drove our development process, learning points and our next steps to continue improving our service for engineering teams at Wise.
Developing an in-house solution
The discovery process led us to evaluate multiple open-source tools in charge of the execution of the scans and identification of the vulnerabilities both coming from pure container packages and from libraries included in the applications running within the container. The final decision was to use Trivy.
Trivy has all the characteristics that we need; it’s well-supported, maintained and has capabilities to scan containers for operating system-based vulnerabilities. This will increase our visibility in package updates and libraries for the application running in containers.
The next steps were to identify:
- Service catalogue/service registry of the services
- Ownership of the services
- When new images were built and then deployed to production
- Where to store scan data (Vulnerability Management)
- How to build custom dashboards, metrics and publish data to a tool familiar to product engineering teams (which we will explore in the second part of this blog post)
Service registry, ownership and identification of the services
In Wise, the Platform teams have built a very convenient asset catalogue used to store metadata regarding product services and they are categorised in a tiering system which also gives us good indicators of criticality of a service.
The service exposes APIs which lists metadata of every service deployed in our production environment, but most importantly, contains a map of ownership service/ teams.
One of the biggest challenges security teams face are the right assignment of vulnerabilities per area of ownership. Failing to do so means vulnerabilities remain ignored, because of the wrong context, and most of the time unpatched, as the recipient team wouldn’t necessarily know how to fix them.
Another useful system discovered during the investigation was our internally developed continuous deployment automation service. This will allow us to identify current images running in the production environment using the API exposed by the service itself.
To tie all of these findings together in one central location, our investigation led us to use a very popular open-source vulnerability management tool, DefectDojo.
DefectDojo removes many complexities around managing vulnerabilities, and gives nice additional features which will help us to classify and categorise applications accurately:
- De-duplication of findings from scans
- Writing new parsers for any tool we want to integrate
- Flexibility to use any tool at any stage of the CI/CD
- Easy to use APIs
- Consolidation of data in structured JSON
- Tagging system for the products
- Assignment of ownership to relevant teams
- SLA counter for vulnerabilities.
Dashboard, metrics and visibility
Engineering teams at Wise, particularly data and analytics teams, use Looker, a data visualisation and analytics tool to build their dashboards and share them across the organisation.
One of the main feedback points our Application Security team received from numerous other Engineering Teams was that they already had too many tools and one more would have been difficult to introduce and adopt.
We went through a process of discovery and analysis of these tools and we came to the conclusion that our process might have to be adapted to fit the data in Looker and make the data available company-wide powered by the data warehouse Snowflake.
Looker showed the flexibility we needed and most importantly, we would have the option to create dashboards from YAML files with a simple commit to a repository.
Moreover, Looker enables us to explore the data, whatever that data may be and enable different teams to visualise data according to their needs.
All the pieces of the puzzle were fitting together!
In order to better understand the diagram above, the next section will focus on delving into the workflow of our automation security and how the services interact with each other.
Application security services
It is important to mention that currently, this first release of our security automation services, acts as a separated workflow from our CI and CD workflows.
In this revision, our service is hosted in an EKS cluster and uses an RDS database within AWS to store data.
This initial separated process is important in order to start scanning, releasing data, and have a global view of the wider vulnerability state of our services, without impacting the CI/CD process and speed, as we deemed the hosting infrastructure and current architecture of our services not to be capable to support the almost 12k daily jobs that are initiated by CI.
Based on the lessons we learnt from this, around monitoring the performances and speed of our services, in the following iterations we will try to connect it to CI and CD workflows.
The service is built using Python and technologies such as FastAPI. The services expose APIs to allow interaction from other services and for future usages. The various steps of the scans have been built using APScheduler which starts the processes to fetch metadata, scan, and publish reports.
We also use webhooks between the different services to initiate synchronisations. The scans are performed using Trivy, as mentioned above, and executing it as a process.
Our service will collate and map different pieces of information from different tools to describe every service we identify in our asset catalogue system.
This way the automation service can build a comprehensive list of attributes associated to each product service in a structured JSON, which is possible to query via API in order to retrieve lots of useful information for a specific product service.
We use Github workflows as a CI system and every repository will have their own dedicated pipeline. Wise makes use of several hundreds of services and one of the biggest obstacles to our metadata gathering phase has been GitHub API rate limiting.
Gathering metadata for product services is the most critical and lengthy part of this process. Critical because the metadata is used to tag services (e.g. production image tag / artefacts versions/tags), assigning ownership and identifying the docker base image used to build the service. The rate-limiting of Github APIs impact the speed of it.
During their lifetime, services can change attributes dynamically (e.g. team ownership, change of team name or change of repository). Once the metadata has been acquired and populated/updated, the scanning phase starts finding vulnerabilities of the docker base images. This enables the service to build a dictionary of base images and associated vulnerabilities which will be compared with the output produced by scanning the final product images and removing them, enabling the separation of the vulnerabilities and correctly assigning them to the relevant teams.
When the scanning of the base and product images are completed, our synchronisation service starts synchronising the data of the latest scan results and will produce messages that are shipped to a Kafka topic. These messages are then stored in Snowflake using another pipeline made available by the analytics team which synchronises every hour.
Once data lands in Snowflake it is ready to populate the vulnerability dashboards created in Looker.
When we started to analyse the data produced and had a reasonable stability of the services, we started to develop dashboards with the data available and to structure a process of distribution and remediation of these vulnerabilities.
In this article we shared how we developed an in-house solution by using an open-source tool and built automation to start our Software Composition Analysis (SCA). We covered the capabilities of our services to collate information from multiple sources to produce JSONs to provide detailed information for each of the production services.
Finally we discussed the workflow, including the metadata acquisition, scanning, reporting, assigning to relevant owners vulnerabilities and the visualisation of the data.
In the next part of this series, we will share how we created dynamic dashboards and started the distribution and mitigation of the vulnerabilities.
The Wise Application Security team will continue to learn as we implement the above improvements and features moving forward.
So, watch this space! We will publish further updates on our team’s vision, where we will share what we are working on for the evolution of this process as it will bring speed, real time feedback, and more importantly scalability. It encompasses other exciting areas of the automation program, with features like DAST, SAST and secrets-scanning.
If you enjoyed reading this post and like the presented challenges, keep an eye out for open Application Security Engineering roles here.
Disclaimer: Product names mentioned in this article are property of their respective owners. This article is not aimed at publicising or endorsing any of the brands used or mentioned in the article.