Automating Security — Turning it up to 11!

Richard Haigh
Compliance at Velocity

--

BY RICHARD HAIGH

My name is Rich, and I am an automation addict — there, I’ve said it! For the last decade I have been evangelising and putting automation into action to help provide repeatable and reproducible (more on these two important words later) systems to deliver code and monitor systems.

If you looked at the projects I have delivered you would suggest my main drivers have been to provide low-risk systems that can operate at scale — satisfying both the Devs’ appetite for pace, and the Ops’ appetite for stability. On one level, this is true. However, in reality I have been working to make people happier — saving them from day-to-day drudgery and freeing up their time to focus on interesting and challenging work. My recipe is fairly simple. Find a process (one that gets repeated a lot) and codify it where you can — you know the story.

I’ve had a lot of success over the years putting these sorts of systems in place but there has always been an area that has eluded me, my automation nemesis: security and governance.

This changed in March 2014 after I joined Betfair. I was new to the business and working hard to understand the key drivers and blockers for automating their software delivery in a standard way. For context, Betfair runs a microservices estate with some 300 components and an engineering team numbering in the hundreds that is split across several core products. Naturally, during my investigation, I crossed paths with the security team. My expectations could not have been more wrong.

Betfair’s security team wanted automation. They wanted it as much as the devs and for all the right reasons:

  • They wanted to run security testing the same way every time. A pass or fail should be repeatable and there should be tangible reasons for the result. There should be nothing that was a subjective pass one day and a fail the next nor should a test execute differently from one run to the next.
  • Regardless of who was running the tests, the security team wanted a consistent result. This would allow them to share testing frameworks across teams and compare results across teams — apples for apples.
  • They wanted to provide instant feedback to the delivery teams, other security teams and management on the status of each test. No waiting for reports to run or aggregate, no parallel, slow-running manual effort. They wanted an instant pass/warn/fail.

To this end, the team had been working on a tool called the ‘Application Security Risk Calculator’ to replace a form-based manual risk assessment. This clever idea had the concepts of inherent and dynamic risk at its core. A certain component would carry some inherent risk that would never change. An example might be that it exists because of a certain regulatory component, or that it handles personally-identifiable information. Delivering that component would carry some dynamic risk such as the number of lines of code changed, the time of day or the unit test coverage. Together, these two concepts gave an overall risk score to the deployment. What was even more powerful was that by using a consistent test framework the security team could set a standard across the business for delivery risk. The tool would gather the information and report it into the configuration management database (CMDB) where it could be stored and reported on.

For me this was great news — but it needed to go further — it needed to be embedded as a component within the delivery pipelines themselves. This would offer instant feedback but also allow a level of interaction with the delivery pipeline itself. I pitched my thoughts to the team, selling them on the following benefits:

  • If the test fails then we can let the team know through their delivery tools ASAP, much in the same way you want your CI testing to be fast so the committing team does not context switch. Let them know it failed, the reason, and where to look to fix the problem.
  • By embedding this sort of process into the delivery pipelines themselves, you can block a deployment if it fails the delivery risk assessment whilst letting all other changes sail through unimpeded. It gives the level of assurance and control needed whilst still allowing a risk-based decision around when and how we do manual assessments (within or outside of the pipeline). It gets better though — you can warn if a risk rating is getting too high without blocking the deployment. Not only are delivery managers informed that they need to take medium-term action but they can also look back and see the change in score over previous releases.
  • Instead of trying to link change tickets to deliveries to manual risk ratings, you can simple export the database and show the risk scores for each and every deployment across the business.

There was one final advantage which, to me, made this approach ground-breaking. With this sort of tooling, which gave us a standard risk framework and a standard control feature, we could go one step further. We could introduce a risk-volume dial. On days where we wished to limit our delivery risk (large sporting events, periods of high change elsewhere in the business, Christmas) we could turn the dial to a low setting and reduce the allowable risk for each delivery pipeline. This would stop some of the inherently high risk pipelines regardless of dynamic risk and give less wiggle room for others — exactly what the business does via a heightened change awareness period. When life got back to normal, we could twist the dial back up to 11 and let changes flow through the system faster than ever — providing a level of assurance without impacting the pace of the delivery teams.

And that is exactly what Betfair has done with the vast majority of its components that have the Application Security Risk Calculator in their pipelines. The dev teams no longer have to fill in forms and the security teams no longer need to repeat manual testing and deal with subjective data — pace and stability co-exist. And, as important as ever, we have manufactured happiness as a by-product of automation. Our talented teams are now able to focus their time on more challenging activities. Betfair is not alone however. The benefits of this approach are starting to be appreciated elsewhere. Some of the DevOps household names — such as Chef — are working towards ways to integrate similar ‘compliance at speed’ features into their tooling to allow almost everyone to benefit by automating their main security and compliance testing.

Richard Haigh is the Global Head of Reliability and Operations at Betfair

--

--

Richard Haigh
Compliance at Velocity

Richard Haigh is the Global Head of Reliability and Operations at Betfair