Scaling Vulnerability Management across a thousand image repositories

Published in

Agoda Engineering & Design

6 min readApr 9, 2021

Container hardening is a proven control for all organizations as part of a strong IT security strategy. One important component of container hardening is to use images without known vulnerabilities. While adding an image scanner and control policies in the CI/CD pipeline and policies at deployment chokepoints will prevent vulnerable images from being deployed, it does not proactively eliminate vulnerabilities that can be found in images that are already in the registry.

As part of our active reconnaissance strategy, the DevSecOps team have gone through all images in Agoda’s registry, looking for vulnerabilities. The results came back with more than a hundred thousand findings. The team is now faced with a problem to come up with a solution that can address the findings with minimal resources.

The Current Scenario

“How can we patch more than 100,000 vulnerability findings in images across more than 1,300 image repositories, without chasing around 100 project teams and 1,000+ engineers?”

We encountered this problem recently, as the DevSecOps team is in the process of assessing the security posture of Agoda. During our last sprint, our goal was to gather data regarding the threat landscape of our container images. We wanted to assess where we are and what relevant security controls we will need to put in place.

After scanning more than 1,300 image repositories in our Harbor registry, we found more than 100,000 vulnerabilities across all latest container images. Most of which are fixable by either updating the image packages, using the latest image version, or using a different hardened base image.

Vulnerability Management

Usual Vulnerability Management cycle follows these 6 stages:

· Scanning the asset

· Evaluating each finding’s impact on the business

· Prioritizing by severity

· Reporting to the asset owner

· Remediation by an accountable team within security policy SLA

· Validation by the Application Security team if the vulnerability has been fixed

The IT-Security team has been using this process for some time. While it does work, it has its own set of challenges when applied to this situation.

Prioritizing and reporting 100,000 findings to asset owners require a sprint or two, that is if we can easily identify who the owners are and if they can be contacted immediately. This is not always the case. How can we avoid the where’s waldo scenarios, and automatically associate the findings to the owner and team?

Depending on the severity, remediating the finding would require the team to make time in their sprint. This causes disruption. How can we reduce the impact and involvement of the team during remediation?

Experiment

The fix requires project teams to either:

· update the image packages

· use the latest image version

· use a different hardened base image

For the second option, the project team would need to update the version of images in their Dockerfiles, make a pull request, and have the CI/CD do its work to push the hardened image to the image registry.

I looked into how to automate the whole dependency update process while maintaining the existing developer workflow. One solution I looked at is to use a Github app that can automatically identify the latest image versions and create the necessary pull requests. Teams would only need to approve the pull request created by the bot, which will use the same CI/CD workflow.

One of the popular dependency management app in github marketplace is called “Dependabot” which is owned by Github.

To test this out, I used a Dockerfile using a stale image version of golang:1.12.

Scanned it using two different image scanners (Trivy and Anchore) for benchmarking.

Next is to log in to Dependabot and add the repository for the bot to know that I want the repo to be managed. Once added, wait for the bot to pick it up from the queue. Depending on how long the queue is, it may take several minutes.

After several minutes, a pull request automatically appeared on my github PR list.

Using the PR’s comment section, I can command Dependabot to do different actions such as:

· @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

· @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

· @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

· @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

· And many more!

Inspecting the pull request’s file changes reveal that Dependabot has detected the stale version and automatically suggested the latest version.

Combined with a proper test flow in the CI that is passing, you’ll be confident in knowing that the dependency update won’t introduce breaking changes. All that’s left is for someone to review and merge the PR, and we would have the image’s dependency version up to date.

Validation: Assessing the Results

If you would remember, the last stage of Vulnerability Management is to validate if the remediation actually fixed the finding. Using our benchmark scanners Anchore and Trivy, I scanned the image to see if the vulnerabilities have been remediated.

Analyzing the results, we can see that updating the image to its latest version does not completely remediate the vulnerabilities. It did significantly lessen the high and medium findings, but not to the extent of what we were expecting.

This leaves me with 2 remaining options in remediating the findings. Either I stick with the latest image version and update each vulnerable finding, or I can try to use a new hardened base image or a distroless image.

Using Anchore and Trivy, I scanned Google’s base distroless debian10 image. As expected, findings are really low. I decided not to include Agoda’s base image for security concerns.

Scan results comparision for golang 1.12 vs golang 1.14.2 vs gcr.io/distroless/base/debian10

What I found interesting in this experiment is that Trivy and Anchore found new findings that were previously not being reported by our current image scanner which is Clair. Based on this experiment, we would explore in the future how we can run in parallel Trivy’s container scanning with Anchore’s policy based security and compliance controls.

Conclusion

Is Dependabot the solution for our problem? With the current version, no.

While it is clear that Dependabot and its workflow makes it a powerful app for automating Dependency upgrades and can scale to meet our goals, it does not completely solve our problem in fixing vulnerable images. However, we will look into extending its capabilities. Maybe we can later add an option that will let the developer choose a different image source rather than sticking with the version update.

Something like: “@dependabot use ORGANIZATION/INPUT_IMAGE:VERSION”

Will we use Dependabot for automating dependency upgrades for development language packages such as javascript, .NET, java, Rust, python, go, etc.?

We’ll see.

References:

Dependabot — https://dependabot.com/

Vulnerable repository: https://github.com/jemueldalino/vuln-image

Scaling Vulnerability Management across a thousand image repositories

Written by Jemuel Dalino