Driving Secure Releases

Tales from Platform Engineering Program Management

Guidewire Engineering Team
Guidewire Engineering Blog
6 min readJun 4, 2024

--

By: Umang Jain (Director, Program Management) and Yoganand Ghati (Senior Program Manager, Engineering)

This is the third in a series of blog posts about the on-prem to cloud journey that Guidewire is making and how program management in our cloud platform engineering team has been instrumental in enabling our teams and stakeholders to deliver consistently and predictably. If you haven’t read our first blog — Organizing Work for Simplicity and Improved Collaboration — or our second blog — Designing and Managing Support Processes for Internal Platform Users — we would encourage you to read them so you can understand the role of program manager within platform engineering here at Guidewire.

We do not intend to claim that “we have figured it all out” or “this is the way to make the journey and therefore every organization should subscribe to it.” Instead, we intend to share some of the problems we’ve faced and how we’ve solved them so that other program managers facing similar challenges don’t have to start from scratch. Additionally, we are approaching this blog from the perspective of wanting to learn from others as well. While reading this, if you think of alternative suggestions we could explore, we would love to hear about them in your comments. We are a team of adaptive individuals who invest in experimenting with new approaches and are open to new ideas.

Driving Secure Releases

Context: “When evaluating business applications, we look for feature completeness. Security is just an optional feature that is nice to have, but not required.” — said no enterprise procurement officer ever

Applications in the cloud are a gateway to networks and servers, which makes them an ideal attack vector for malicious actors who intend to exploit any security misconfigurations or known vulnerabilities in the application. The increase in attacks and the cost of those attacks to the organization, both in terms of monetary and brand value, have made security a top priority for organizations committed to their customers’ success. Since Guidewire has been a customer-focused organization since its inception, we’ve always taken security seriously. But establishing intent is just the starting point. The real challenge is in the execution. As we initiated our efforts to move to the cloud, our decisions were guided by the following questions:

  1. What do progress and success look like for this initiative?
  2. How do teams — that value security and shift security left — behave daily?
  3. How do we leverage tools to create an effective, efficient, repeatable process?

Solution

Our first experiment within platform engineering to implement security as a priority in feature completeness was to explicitly reserve 20% capacity per sprint to focus on security issues. By including security in the sprint planning process, teams were able to make progress on this effort with fewer decision-making tradeoffs. We planned to track this effort by labeling security-related issues in Jira to increase visibility into the progress of each team. This experiment helped us learn an important lesson: Metrics are an essential tool for driving team behavior.

When deciding which metrics to measure, it’s important to validate that pursuing those metrics will generate the desired behaviors. In this case, our sprint reports showed that teams were spending effort on improving security but the problem was that we were still accruing security debt. As we analyzed this deeper, we found that all teams were interpreting and utilizing this 20% sprint capacity for security in slightly different ways.

  1. Some teams looked at their security issues and used their allocated time to identify long-term solutions that could fix that problem with less toil.
  2. Some teams only used the “security” label in their feature stories when the feature requester was InfoSec and not for other security work.
  3. And some teams, who had a lot more security debt, only addressed what they could fit into their 20% capacity for the sprint and left the remaining for the next sprint.

The cumulative result of these efforts meant that our teams met the ask of investing 20% of sprint capacity on security-related issues. However, we didn’t move the needle on the aspects that really mattered: Meeting our security vulnerability SLA with every release.

After reviewing the situation, we immediately took action and changed the metric. If security is the highest priority, then it should trump all other work unless explicit exceptions are obtained. This means teams were no longer bound to just 20% capacity within the sprint to tackle their security debt. Success was now measured as 0 vulnerabilities in overdue status or resolved outside their SLA in the release.

To help us in this journey, we partnered with our InfoSec team. They provided us with a new tool that made it easy for our teams to access crucial information within a single portal.

  • All the open vulnerabilities for a given team
  • SLA due dates
  • Triage status

Thanks to our partnership with InfoSec and the work they did to provide critical tools to our engineering teams, it was now easy for us to:

  • Filter security issues assigned to our team.
  • Sort security issues by their SLA due dates.
  • Validate if the team has triaged a security issue or not. Triage here means the presence of a Jira ticket which indicates if the team intends to fix it over the next few days or a risk treatment workflow depending upon whether the issue is a False Positive or should be Risk Accepted.
  • Identify which teams have missed SLAs in the past so that we can work with them to fix the systemic problem.

When we combined our new approach to metrics, our partnership with InfoSec, and our focus on continuous improvement, teams started leveraging automation to improve how they address security debt. One automation improvement we adopted changed how we implemented and upgraded Open Source Software (OSS), an integral part of modern cloud-native applications.

OSS arguably becomes more secure as more organizations consume it because the contributions of each organization add to its evolution and success. For example, suppose a vulnerability is observed on an OSS component that multiple organizations widely use. Instead of just letting the maintainers of the component fix it, engineers from the consuming organizations work together to fix the vulnerability.

As new patches are issued, it is vital to keep OSS updated to utilize new features and benefit from the improved security posture. If you don’t keep up with patches, vulnerabilities in these OSS components will contribute significantly to security debts.

Previously this upgrade process was manual. This introduced an increased burden to keep our tech stack updated. To reduce time spent on maintenance, our team designed an automation to search for newer versions of the OSS components we use.

  • IF the script finds a new version of the OSS Component,
  • THEN it runs the scan on that new version to evaluate if upgrading to the new version would improve our security posture,
  • IF YES (Security Posture could be improved), then the script runs a new build using that new version and validates regression.
  • IF the build passes, the component is upgraded
  • ELSE that component is not upgraded.

When done, the process sends a summary on our team’s Slack channel informing which OSS image versions were updated as part of that run.

This automation eliminated the burden of spending time checking for updates and tremendously reduced the Security Debt for our teams that need triaging.

Note: Multiple Open Source products offer these capabilities and we are exploring them as our landscape continues to change.

Measure of Success

  1. SLAs on our security debt are above 95%.
  2. Increased efficiency in sprint planning and feature prioritization.
  3. Less developer frustration in identifying the security backlog that needs action.
  4. Improved collaboration between Engineers and InfoSec.

Key Takeaways

  • Capacity allocation towards addressing security issues does not work.
  • When it comes to security management, your choice of tool matters. To minimize conflicts and improve outcomes, ensure that your engineering teams and InfoSec are referencing the same dashboard.
  • Vulnerability remediation should be run like a program that needs to be orchestrated and actively monitored until the teams have developed the necessary automation.

We hope you enjoyed reading this post. If you have questions or feedback, please leave them as a comment. We are constantly experimenting and learning new things so be on the lookout as we will be sharing more such stories.

If you want to work on our Engineering teams building cutting-edge cloud technologies that make Guidewire the cloud leader in P&C insurance, please apply at https://careers.guidewire.com.

--

--

Guidewire Engineering Team
Guidewire Engineering Blog

Guidewire Engineers regularly write about how they are building a range of technologies to fuel P&C industry innovation.