Planning Labs is a team of four, which has produced a remarkable amount of high-quality software in its 13-month existence. In that time, however, the team realized that they could use improvement in the way that their software is deployed and operated, while ensuring that it is doing so reliably and securely. I was asked to look at their current practices, help implement better solutions, and ensure they have a better operating model going forward.
This team has a mission to deliver tools for their agency and the public, not to worry about infrastructure. My goal was to find the biggest quickest wins with the lowest ongoing effort.
Where do we start?
With a short-term engagement focused on infrastructure and security, it’s tempting to go off of a wish list of “well, we’ve been meaning to do X for a while.” While checking these kinds of tasks off can be satisfying, this is not actually a good way to prioritize. How could we make sure we were focusing on tasks with the highest value?
I started by interviewing the team. I asked each person to explain their different tools and systems and how they all fit together, where the known weak spots and pain points were. From there, I extracted relevant items into a (public) Trello board, organized into columns of Concerns, Underlying Problems, and Solutions.
Concerns — at worst known as “threats” — were essentially “what keeps you up at night?” Concerns might be things such as a server getting hacked.
The Underlying Problems were the reasons one of the Concerns may become reality. In the server hacking example, one Underlying Problem might be that servers aren’t getting patched regularly. Other Problems might things that would help recover from an incident, such as being unable to recreate a server quickly.
Some cards came directly from statements in the interviews, while others were derived from other Concerns/Problems/Solutions that came up. The Concerns link to their Underlying Problems, and the Problems link to their corresponding Solutions, so the relationship can be traced all the way through.
Once we had most of the cards in place, I held a session with the team to prioritize the Concerns through dot voting, ignoring all the other columns. Once those were sorted, we then sorted the corresponding Underlying Problems and Solutions to match. The Solutions that rose to the top weren’t necessarily what the the team had expected to focus on, but instead reflected the highest-value work that could be done to mitigate the top Concerns. These Solutions became the backlog (short-term TODOs).
Configuration as code
Now that we had prioritized tasks, it was time to start implementing. Planning Labs has five servers that they are managing at this point; not enough to justify beefy orchestration tools, but enough that maintaining them by hand was getting unwieldy.
My early focus was to start building up configuration that should be common across all of the servers, and getting that into code that could ensure consistency. This included things like:
- Managing the users
- Locking down (“hardening”) SSH
- Setting up monitoring
- Ensuring the servers were being backed up
- It is open source
- It is readable
- It is relatively easy to get started
- You don’t need to run infrastructure for the configuration management itself
With that base configuration, we then wrote up Ansible code for configuring particular types of servers. This code can be used to spin up test servers, or in case an existing server is hacked or destroyed.
Not all went smoothly
Doing a deep-dive on infrastructure can turn up problems you didn’t know you had, or problems that you didn’t expect to happen. First, we found a secret committed to a public GitHub repository. Thankfully, this particular secret wasn’t especially valuable (and didn’t seem to be used), but finding it gave us the opportunity to practice credential rotation.
Next, we took down most of the Planning Labs sites by submitting a DNS change request with a typo.
Last but not least, we deployed a change to the SSH configuration that we hadn’t tested thoroughly enough, and it locked us out of all the machines.
In investigating this incident, we found that there were thousands of attempts to SSH into the servers with common (but random) usernames by some malicious actor. Good thing we were making those connections more restrictive!
With each incident, we held a blameless post-mortem to understand the root causes of the issues and learn from them. Kudos to the team for keeping calm throughout!
While my time with Planning Labs was short, we wanted to ensure that the team maintained good operational and security practices going forward. They started a team handbook to lay out everything from their core values to how to get a badge for a visitor.
The Guides also ended up containing a list of the software-as-a-service used by the team, which ended up being a greater number of products than anyone expected. This list is particularly useful for offboarding, because the team needs to know what services outgoing staff need to be removed from.
There is also now a checklist of what needs to be in place for every server, much of which is one-time setup to ensure the server’s safe ongoing operation.
We didn’t get through everything in the backlog in the two months, but made good progress in raising the DevOps and security maturity of the team. A few lessons I left them with:
- More servers, more problems. Things like SSH hardening are only necessary if you’re managing your own servers, so look where you can offload that burden to third parties. This can be done by using higher-level abstractions, specifically a platform-as-a-service (PaaS) rather than infrastructure-as-a-service (IaaS).
- DevOps and security work is hard to prioritize. When you are working on features that provide value to your users, it’s hard to set aside time for things like security. Try doing things the right way the first time, rather than saying “oh, I’ll come back and automate that later,” which won’t happen.
- Protections work best in layers. A single layer of security, whether it’s a firewall, SSH hardening, or alerts, isn’t enough to protect your servers on its own. These need to be used in conjunction with each other to fill in where the others fall short.
Hoping this is helpful to other teams out there getting started with DevOps and modern security practices!