Secure DevOps #2 — Environment

Published in

SecurityBytes

10 min readJan 31, 2017

This is the third in a series of posts that will outline a framework I’ve developed and successfully deployed for applying effective security to a DevOps environment.

Part 0 — Setting The Scene
Part 1 — Identity
Part 2 — Environment
Part 3 — Secure SDLC
Part 4 — Embedded Expertise

Now we have Identity nailed and surreptitiously acting as a security control — and no one suspects a thing — we can move on to doing some things that the DevOps people don’t like, so that we can get back to being Business Prevention Officers.

Just kidding. Here goes.

Hands-off Dev

Yeah, you heard right. Hands off.

Leave Dev alone.

The fact of the matter is that Dev is where smart people do smart things and having them jump through production-like hoops from 15 years ago is wholly unnecessary.

What’s needed here is a deal, a deal between security and DevOps teams. I’ve made deals, I’ve made the best deals. Great deals. So strong, so powerful. Let’s make security great again.

The deal is this:

No sensitive data in Dev
Dev separate from Prod

= security leave Dev alone.

The Change & Release section below will detail one slight addition to this — namely knowing what is leaving Dev and on its way into production, but this is also buoyed by our Identity cornerstone too as we can attribute everything to an individual which gives us accountability (and non-redupidation should it ever come to that).

What this deal essentially boils down to is attempting to focus the security effort where the security effort is needed — where we’ve got valuable assets to protect. Instead of the security function wasting its time chasing its tail over a constant stream of alerts and changes to an environment pretty much defined by its need to change.

This goes a long way to building a working relationship between security and the people we’re supposed to be protecting. This tells them that we understand what they’re doing, we understand that they’re capable of looking after themselves, but crucially it also allows us to not materially weaken any of our controls. (There’s another blogpost to be made one day about the wasteland that is security’s dedication to failing to secure the endpoint whilst fretting about it exhaustively and yet, paradoxically, selectively…perhaps one for the podcast, oldmanpete?)

Change & Release

This is already happening. We’re not proposing anything new here. In fact, security is almost certainly a part of this process anyway. But, what you may find is that security’s part is this is a bit…token.

Your change and release teams are there to protect service and to keep the existing stuff working whilst new stuff is added. They do not want security things going wrong, so they get security sign off for new things going in.

Existing implementations of this control that I’ve seen have been mostly illusory though. The market forces of a business quickly undermine the gamut of checks that need to be made — unless those checks can be made quickly and cheaply.

What we need to do here is leverage the existing choke point between Dev and Production. We do this by implemented a known-good build bypass, some on-demand scanning, and a simple scorecard leading to a pass/fail outcome.

We’ll explore known-good builds in the next section, below, and we’ll expand on-demand scanning in the next article, but for now let’s concentrate on how the overall control will work.

The Change and Release teams want to get things released as cleanly and safely as possible. To them, a software flaw and a security flaw are uninterestingly different from each other. They’re equally wont to reject a new release for either failure — or more accurately, they’re equally wont to not have to reject a release for either failure.

Building the security scorecard with an (ideally automated) set of tests leading to a pass/fail result means two things — firstly, if it fails then it’s not ‘security says no’, it’s a failure of the known release criteria. And this is the second thing — in having your security controls tightly coupled to the release process, widely known and publicised and, crucially, understood by everyone then we allow teams to help themselves by clearly laying out what the minimum expectations are.

Baseline Builds — Automated

Known-good builds is one of the strongest benefits that we can take from the rapid, automated deployment of infrastructure and solutions through the DevOps practice.

Baseline builds are nothing new in security. You almost certainly already have them, and you almost certainly have a harder time that you should over implementing what amounts to standardisation of IT.

The Center for Internet Security offer what tend to be a comprehensive set of baseline controls for most systems and are a good place to start if you don’t already have build standards.

But what is new here is that since the infrastructure we’re getting is coming through in an automated manner, it’s a very natural leap for us to automate the security of these builds in a standard way by default too.

It’s important here that we get in on the ground floor though. This is another example of a control needing to be transparent to be effective, and of a control needing to be there by default rather than relying on someone to implement the control successfully each time. Operating system hardening can be easily scripted and should be built into the build pipeline. Whatever the standard build you agree on, based on your own organisation, you should be confident that it is being consistently applied.

This is another scenario when source code control can be invaluable, as the configuration can be stored as a declarative list of settings — and if it changes, there’s an audit trail. Moreover, we should be scanning the infrastructure to provide a belt and braces control — since this is another one of those fundamental things that we need to get right. If we don’t know what our environment is supposed to look like, how are we supposed to tell when something is different. More on source code control and scanning in the next article though.

The Known Good Build paradigm isn’t just about us getting the basics right. It allows us to front-load security effort, and minimise wasted effort too. And when we get to implementing at-scale, this minimised effort can represent a substantial saving.

The Known Good Build should encompass Operating System, Hardening, and Application.

Here’s how it works.

The versions of all the software in-use get recorded in a manifest:

RHEL : 6.8
PHP : 4.3.2
Tomcat : 8.5.11
CIS : 2.0
Java : 1.7.0
Amazing Super Money Making Application : 1.0
etc

Then trigger an automated build for that virtual machine. Once the machine is ready, perform all manner of security testing on it — use your vulnerability scanning software, arrange penetration tests against it, etc.

For every finding returned, get it addressed. Is there an unnecessary service running? Update the hardening. Is there a bug in the software or your application? Fix it, upgrade it, or work out how else you might mitigate it.

Then tear down, rebuild, and re-scan.

Eventually you will get to the point where the results of your scanning reach a consistent set of results which you’re happy with — because the risk is either below your threshold or mitigated elsewhere. And this is your Known Good Build.

What was all this effort for? Well, since you did this well in advance of any need to deploy to a production environment, when it comes to the time to make an actual deployment, if the same manifest is being used (…which can be version controlled too…) then exactly how much involvement do security need to have at the point of deployment for that first server?

Nil.

And what about the second? Third? Hundredth?

Still nil.

Because you’ve front-loaded your basic compliance checks for this this infrastructure, platform, and application. All of a sudden, security are completely off the critical path for new deployments. The magic of a sub-1 hour deployment of a new stack of infrastructure doesn’t bring with it any need to be concerned — or any need to spend so much as a minute checking.

And when the version of something changes? Well, you re-scan, you re-harden, and you return to your Known Good Build as quickly as possible. And the beauty is that everyone can understand this process well in advance so it needn’t come as a surprise to anyone — particularly at the point where a service is minutes away from going live.

No More Tweaking

One of the most powerful fallouts from from infrastructure-as-code model is, as described above, the idea that everything gets built in a set, known, and approved manner, and therefore works in a set, known, and predictable way.

Suddenly, the operational nightmare of one of five servers behaving oddly is removed as a factor — all the servers are precisely as defined and therefore if one server is doing something different from the others it’s because it’s being subject to something different from the rest — not because it might be misconfigured.

A very learned former colleague of mine described this as “treating your servers like cattle, not pets”. This is the best way I can think to describe it, so I’m going to stick with that analogy.

Your standard support paradigm is one where something is working in production, it stops working at 3am, you jump on the box and fix it swearing that you’ll figure out the root cause in the morning, but then get sidetracked forevermore such that the root cause analysis never happens. And why should it? You’ve fixed the problem, it’s still working, and there are other things on fire. Que sera.

Your infrastructure being defined and deployed via code completely kills this scenario. Furthermore, it saves DevOps from being an untenable free-for-all of intertwined support nightmares. If everything needs a bespoke support profile then support costs are going to skyrocket faster than your ability to innovate.

So, to the point.

Your Known Good Builds aren’t just a security thing. They’re the only thing. You’re going to get your orchestration software to alert on deviations from its standard build for you at a minimum and you’re going to choose to have it enforce your standard build where you can.

Why is this viable?

Because you’re no longer building your solutions at the point of delivery — when you’re ready to deliver, your solutions have existed through your Dev and Test environments for long enough to have been thoroughly tested and approved. Tweaking your servers at this point would not only break your production pipeline, it would be perverse — it would undermine the whole point of the DevOps capability and you’re building.

This represents the first real compromise that an organisation looking to implement DevOps is going to need to make. And it will be a hard compromise — all of your existing Operations instincts are going to be questioned and undermined. But that’s the choice we’ve made in implementing DevOps and we need to position ourselves to make the most out of it.

No More Patching

How often do you find yourself pulling your hair out in meetings because your security guys are telling you that you shouldn’t be patching your live deployments?

Bane of your life, right?

Quit patching in live! Period.

I hope by this stage it’s obvious why (even if the alternative isn’t yet obvious)? If you’re patching in live then you’re still babysitting your deployments. You’re still treating them like pets, and you’re still doing your security activities right of the line, when we’re working to move them leftwards.

Remember: if you’re trying to do your security at the point of delivery, then you’re not going to have enough time to succeed in this new world.

So how do we patch? Continuous rolling releases.

Now, this has a number of moving parts:

At the point of delivery, your operating system and applications all pick up the latest patches available in your internal repository and have them applied to their build.

This pre-supposes that you have a patching cycle, of course. If you don’t have a patching cycle, well, what can I say? Get a patching cycle. Patching is still hard, but in a DevOps environment it can be massively simplified. Instead of having to run patch tests against hundreds of types and permutations of servers, you’re running them against a tiny number — only they’re deployed on a much grander scale.

Whilst your efforts may impact hundreds of servers, there only exist a handful of configurations, so the effort of getting the patches ready is minimused.

Your orchestration software needs to be set to cycle all of your infrastructure regularly. I’d suggest no longer than every 3o days. This means that you’re always running on fresh servers holding the latest patches for everything on them.

With IaaS and PaaS, this is entirely achievable — all of the major providers offer solutions. Even with in-house deployments this is achievable with a number of vendors already having completely opened their APIs for orchestration.

Patching should be a transparent control that is there by default. I see a theme developing here.

Summary

With Identity nailed down, what we need next is consistency. The controls we’ve built in here all utilise the identity provision we’ve put in place, and build on it by providing us with consistency and predictability.

Firstly, we’ve isolated environments which change rapidly by definition and left them to their own devices. Secondly, we’ve leveraged an existing control mechanism to give us some surety that the crazy we’ve enabled remains corralled, and only the things we’re happy with can ‘get out’ of that environment.

Finally, we’ve leveraged some of the key technologies we’re playing with to provide us with something special — scalable, secured, commodity Iaas and PaaS with controls built into their very core such that they can be deployed without oversight and without constant validation. Such that they can be deployed by practically anyone in any circumstances and still not cost the security team any sleep.