Continuous Security Compliance: OS Hardening

Published in

Financial Engines TechBlog

8 min readNov 27, 2018

By: Alex Demitri, Daniel Richardson

At Edelman Financial Engines, we have a vast array of platforms and servers. Our infrastructure is rather elegant but at the same time complex, making the need for reliable replication paramount. The challenges associated with maintaining an immutable infrastructure, also pose questions of scale and security.

While it is obviously crucial for us to maintain our infrastructure replicable and codified, we also want to

make sure we are able to consistently test and ship security remediations in the fastest and most reliable way possible.

In AWS (or any cloud provider for that matter), it all starts with a strong and secure foundation: the image bakery.

The bakery

The problem at hand was clear: patching and maintaining an agreed upon, and respected, patch compliance policy is one of the cornerstones of securing infrastructure.

To increase flexibility, implementation, and execution velocity, agile companies rely on infrastructure-as-code to ship configuration changes. But what about security remediations?

And also, how do we test for some of the delicate structural changes at different runtime levels, and at the operating system level, reliably? How can we say, with the highest degree of certainty, that we can rapidly revert to different granular points of our image configuration?

Back in my on-prem cloud virtualization days, sure, we relied heavily on snapshots. But more “agile/DevOps-ish” needs are pushing us to be able to respond quickly and with a high degree of confidence when doing rollbacks. Moreover, we want to be able to operate in a“surgical” way to remove the exact “configuration culprit” where needed. Rollbacks shouldn’t impede the speed and advancement of progress we have made iteratively, hence the snapshots limitations.

This is where HashiCorp’s Packer makes a significant difference and enables us to reinvent this step of the process.

Enter HashiCorp Packer to help:

{
 “variables”: {
   “aws_access_key”: “”,
   “aws_secret_key”: “”
 },
 “builders”: [{
   “type”: “amazon-ebs”,
   “access_key”: “{{user `aws_access_key`}}”,
   “secret_key”: “{{user `aws_secret_key`}}”,
   “region”: “{{user `aws_region`}}”,
   “vpc_id”: “{{user `vpc`}}”,
   “subnet_id”: “{{user `subnet`}}”,
   “security_group_id”: “{{user `security_group`}}”,
   “instance_type”: “your.instance.size”,
   “source_ami_filter”: {
     “filters”: {
       “virtualization-type”: “hvm”,
       “name”: “*ubuntu-trusty-14.04-amd64-server*”,
       “root-device-type”: “ebs”
     },
     “owners”: [“your.number”],
     “most_recent”: true
   },
   “ssh_username”: “ubuntu”,
   “ami_name”: “desired-AMI-name”,
   “encrypt_boot”: false,
   “tags”: {
     “packer_timestamp”: “{{timestamp}}”,
     “launch_date”: “{{isotime \”2006–01–02\”}}”,
     “packer_managed”: “true”,
     “os_version”: “Ubuntu”,
     “release”: “14.04” ,
     “env”: “packer” ,
     “owner”: “you@yourcompany.com”
   }
}],“provisioners”: [
 {
   “type”: “shell”,
   “inline”: [ “sleep 30” ]
 },
 {
   “type”: “shell”,
   “inline”: [
     “echo [INFO] Putting `cat /etc/hostname` in /etc/hosts”,
     “sudo sh -c ‘cat /etc/hosts > /tmp/etc_hosts’”,
     “sudo sh -c ‘echo 127.0.0.1 `cat /etc/hostname` >> /etc/hosts’”,
     “cat /etc/hosts”
   ]
 },
 {
   “type”: “shell”,
   “execute_command”: “sudo {{.Vars}} sh {{.Path}}”,
   “scripts”: [
     “scripts/install_packages.sh”,
     “scripts/update_os.sh”,
     “scripts/security/cis-scripts.sh”
   ]
 },
 {
   “type”: “shell”,
   “inline”: [
     “echo [INFO] Restoring Original”,
     “sudo sh -c ‘cat /tmp/etc_hosts > /etc/hosts’”,
     “cat /etc/hosts”,
     “sudo sh -c ‘rm -f /etc/facter/facts.d/server_facts.txt’”
   ]
 },
 {
   “type”: “file”,
   “source”: “scripts/security/files/cis-unmount.sh”,
   “destination”: “/tmp/cis-unmount.sh”
 },
 {
   “type”: “file”,
   “source”: “scripts/security/files/98-cis-settings.conf”,
   “destination”: “/tmp/98-cis-settings.conf”
 },
 {
   “type”: “file”,
   “source”: “scripts/security/files/securetty”,
   “destination”: “/tmp/securetty”
 },
 {
   “type”: “shell”,
   “inline”: [
     “echo [INFO] Copying filesystem unmounts”,
     “sudo sh -c ‘cp /tmp/cis-unmount.sh /etc/modprobe.d/cis-unmount.sh’”,
     “sudo sh -c ‘cp /tmp/98-cis-settings.conf /etc/sysctl.d/98-cis-settings.conf’”,
     “sudo sh -c ‘cp /tmp/securetty /etc/securetty’”
   ]
 }
 ]
}

Packer allows you to choose a starting point image from AWS (described under source_ami_filter) and pile on the changes you desire at every step.

Once changes are completed, checked-in in Git, Jenkins automation kicks off the Packer build. Packer spins up an EC2 AMI and ssh’s into the temporary machine, applies all the changes described in code and copies files over via provisioners. Once the process is completed without errors, the temporary machine is automatically shut down and packaged into an AMI in your AWS account, ready to be used.

With a detailed and well documented list of configuration changes in code, we can quickly identify what needs to be added and/or removed to prevent vulnerable infrastructure.

The last and final step is to patch the image to the latest version of packages, per our repositories, and take an inventory scan via AWS SSM to ingest and use as a baseline to track compliance.

At Edelman Financial Engines, we have created the following process to stay secure, agile and responsive to our platform needs.

CIS benchmarks as reference

While it was clear to us it was important to follow best practices around OS configuration for image hardening, we wondered where could we find a list of documented and trackable recommendations. We found the answer in CIS Benchmarks.

These are well described, industry standard, audit-able and repeatable configuration changes shipped in a detailed document, based on the starting operating system of choice.

Follow these changes and add them to your operating system images and configuration. Packer proves to be an excellent tool to add via code and scripts the changes at each image iteration:

# CIS Benchmark AMI v1.3.0## CIS 14.04 v2.1.0 change 5.3.3echo “[INFO] Changing configuration for password reuse to {x} old pwd”
sudo sh -c ‘echo “password required pam_pwhistory.so remember={x}” >> /etc/pam.d/common-password’## CIS 14.04 v2.1.0 change 5.2.5
echo “[INFO] Allowing Max retry of SSH connect to <=x”
sudo sh -c ‘echo “MaxAuthTries {x}” >> /etc/ssh/sshd_config’

In code, it is fairly simple to add new changes and document them properly. The flexibility added allows you to:

Quickly identify what broke an image and remove it
Point specifically at the CIS Benchmark change via comments
Pin a change to a specific version of the AMI

What makes the CIS Benchmarks even more attractive, is the ability to measure and reliably conclude where a resource configuration stacks against an ideal score in percentage (you gotta love this for KPIs):

Example: CIS benchmarks compliance scores tracking ensures forward motion and progress

Once a security team sets up a schedule for when to work on additional changes, a Security Developer can work on the changes to increase the score iteratively and mark more things off the list to make it more secure.

To reliably measure our score against the benchmarks, we use Rapid7 and scan for results. Not only it does keep a history of our scans, but it allows you to mark configuration overrides in case your organization is using different solutions for problems that the scanner can’t detect (ie. Using Splunk vs syslog).

Example: Scan results in Rapid7 — drill down on what is left

The pipeline verification

So far, all sounds good. But what do you do in case of errors? How do you automate the addition and removal of security changes together with a verification step? And once the checks have passed, how can you copy encrypted versions of the golden-hardened AMI across all your account for automation to pick the new AMI up from now on?

Enter our own DevSecOps Pipeline:

Illustrating the AMI Hardening Compliance Pipeline

Once the code has been checked in and packer has completed baking a new AMI,
A testing environment is spun up via Terraform with the new image fed as the AMI to use. Once the environment is up, our latest build is pushed to it;
Tests verify all our services come up correctly; if the tests don’t pass, go back to the first stage and continue working on the scripts to secure the image; if the tests do pass:
Tag the new golden-hardened image;
Encrypt and copy the AMI to all our AWS Accounts.

The whole process is automated and kicked off the moment after security scripts are checked in and a new packer build is initiated.

The beauty of this process is that:

It is extremely fast, giving us immediate feedback;
Once an image passes, any other environments we maintain will autoscale (when needed) to the new version of the image.
Any operating system can be the starting point of the pipeline. We only need to reference a different edition of the CIS Benchmarks and adapt our scripts to work with said OS (various flavors of Linux and Windows are supported — even Containers and Kubernetes have their own CIS Benchmarks of reference).

And especially:

To patch servers and environments as a whole, we just need to destroy nodes in autoscaling groups in a blue/green deployment fashion and patch it all during business hours without downtime for our customers.

This process improved our confidence in our platforms and gives us an immediate way to track where do we stand and how much more secure we are with every iteration that goes by.

What to do when things go wrong

Now imagine something goes wrong with your automation. Murphy’s laws knows it (If it can happen, it will happen.) How can you quickly revert and prevent your automated pipelines from breaking all over?

Recently, a bug in cloud-init and aws-cli has inflicted significant amount of pain to many cloud implementations, with shops all over scrambled to find a solution: https://github.com/urllib3/urllib3/issues/1456. To make a long story short, since we are not focused on this specific issue, a broken dependency between the requests and urllib3 libraries in python broke the cloud-init automation responsible to finish setting up a server in AWS.

Once we identified the issue and reported a solution (https://github.com/aws/aws-cli/issues/3678)

we were able to pin the desired version of requests in our AMI in code, ship it through our continuous compliance pipeline, tests passed and we were back in business with the confidence that all our checks and verification passed!

All in a matter of seconds:

code was changed giving us the ability to edit the exact line responsible for mayhem;
we stayed secure and committed additional hardening scripts;
we avoided scrambling or regressing by reactivating old snapshots keeping us anchored to the past; and especially
we shipped a new AMI and all automation started feeding from it like nothing happened.

Conclusion

Staying secure and compliant is crucial for us. We want to ensure our customers and their data flows through our infrastructure are at the highest level of security and compliance available at the moment. The ability to make quick pivots and adjustments, in case of incompatible code, and to verify quickly our infrastructure-code is working as expected, is what makes this a true DevSecOps pipeline. The continuous compliance pipeline allows us to maintain a secure, yet innovative pipeline, verified by our tests allows us to ship infrastructure at the highest level of confidence possible.

Security does not have to mean slow pace. Compliance does not have to mean bureaucracy and sluggish procedures. We can be fast, secure, and confident while highly available.