Our Application Security Journey (Part 2)
This is the second in a series of articles on the state of Application Security at Wise, describing our integration of security in the Software Development Lifecycle.
This article is Part 2 of our series of blog posts around the Application Security Journey at Wise. In Part 1 of this series, we presented how we planned and built our Software Composition Analysis (SCA) service, some of the challenges we faced along the way and how we started to think about how all of this data could be visualised. In this final article, we will detail the steps we went through to roll out the SCA tooling dashboards and how we collaborated with the wider Wise Engineering teams to work towards a common goal.
Exposing Vulnerability Dashboards to Product Teams
As we detailed in Part 1 of this series of blog posts by Cristiano Corrado, a requirement for our system was to have a comprehensive dashboard for the whole of Engineering to view their vulnerabilities. We started aggregating information around container and library vulnerabilities for teams and products in a single visualisation dashboard. Our first attempt produced a view similar to Figure 1:
We initially thought this worked quite well, but something we hadn’t fully taken into account, was how usable this was for the rest of the engineering teams.
What actually happened when we shared this dashboard was that there was a lot of confusion from the wider engineering teams about how to navigate around these views.
We realised that we made too many assumptions in regards to delivering this data with the aim of fixing the vulnerabilities.
And as a consequence we attained we had not received enough feedback from the rest of the engineering teams… So, back to the planning table we went!
Identifying Potential Bulk Mitigation Solutions
As mentioned in Part 1, our service assigns vulnerabilities to the relevant teams:
- Vulnerabilities detected in the base images are assigned to the Site Reliability Engineering (SRE) team
- Vulnerabilities detected in product images are assigned to engineering teams.
In light of this, the Application Security team and SRE came up with a plan to mitigate as many vulnerabilities as possible within the base images first, before exposing the remaining findings to Engineering Teams.
Doing this would significantly reduce the effort required from engineering teams, as instead of duplicating the effort of fixing the same vulnerability across hundreds of services, all teams would need to do is to rebuild their services with the latest available base image first, eliminating at the source a large swath of vulnerabilities.
Addressing Initial Feedback
Once we had planned and agreed on the base image upgrades, our next step was to start implementing feedback received which included general visualisation improvements, better documentation around how to use the service, detailed documentation on how to identify transitive or direct dependencies and how to start fixing them.
Setting an SLA (Service Level Agreement) Process Around Vulnerabilities
By iterating through this feedback process for around one to two weeks, alongside having teams try these improvements out, we found what worked and what did not. One of the biggest improvements made was to add our SLA policy for each identified vulnerability, stating the resolution timeline. This information was displayed as a widget on our dashboard, shown in Figure 2. This provided teams not only the total number of critical, high, medium and low vulnerabilities, but out of those, how many were breaching SLA. By having this data visible, engineers were able to put mitigation plans in place with accurate timelines to work with, to ensure that all vulnerabilities could stay within SLA.
Improving Dashboards and Documentation
As mentioned above, direct feedback from various engineering teams indicated a need to focus on visualisation clarity and a need for better detail on how our tooling runs within our documentation. Acting on this feedback has left our main engineering dashboard looking a little different, but more tailored to our engineers’ needs and allows Wise to work with a shared responsibility model when it comes to owning and fixing container vulnerabilities. An example of this updated dashboard at the time of writing is shown in Figure 3.
Engineering Wide Collaboration
The next stage of the process was to begin working with other engineers across Wise to start getting the dashboard used and findings mitigated. In this section, we will cover exactly how we communicated, organised and eventually mitigated our findings.
In the Application Security team, a key part of what we do is working with engineers to eliminate security vulnerabilities. This was no exception. While we could have just written our documentation and asked teams to start fixing vulnerabilities, to get started, we opted for a different approach and started to roll out this process using what we call at Wise, a swarm week.
A swarm day/week is a time when engineers across the company get together and spend time on some cross team tech initiatives. The purpose is to get together with the rest of Engineering to understand how to start investigating and how to mitigate container vulnerabilities as a wider group.
We organised a week in March where we asked each team to have at least one engineer sign-up to a session tailored (where possible), to each timezone where they may be located.
Each day we would have an introductory context sharing session which would last around 15 minutes, detailing what the structure of the day would look like, including any guides (similar to that of the TL;DR in Figure 4), or tooling that could be used in order to get started with fixing and identifying container vulnerabilities. We kept the Zoom session open for 2-hour slots in case anyone wanted to ask the Application Security Team or SRE any questions. Additionally, we had a dedicated Slack channel set-up to provide further support.
Putting this week together and sharing all resources and context available enabled the whole of engineering to start working on a shared goal. We found all teams ended up with increased visibility, knowledge and understanding of the vulnerabilities belonging to their services. It became extremely clear who had ownership of each part of the remediation process and the timelines that must be met. Overall, swarm week was a great success measured by the massive downtrend in open vulnerabilities we began to see.
This journey was about building a service, learning from mistakes throughout the process, and showing positive results as well as the negative ones. We learned that engineering at Wise is a collaborative and positive place where each part of the organisation is willing to work together to solve complex problems.
This is only the start of where we want to build our container and library security service.
Moving forward, we plan to continue to improve our service offering to the rest of Wise in some of the following ways:
- Creating a highly configurable and flexible notification service connected to different services (E.g. Slack or Jira)
- Approaching a real time ability to notify of new vulnerabilities
- Continuing to implement feedback as recommended by the rest of Wise
Disclaimer: The example images and graphs are not based on past or current real representation of our state of risk or real vulnerabilities. Images are only shown for visual representation of the work performed.
If you enjoyed reading this post and like the presented challenges, keep an eye out for open Application Security Engineering roles here.