Infrastructure Management with AWS Systems Manager
At Disney Streaming Services we run infrastructure in multiple Amazon Web Services (AWS) regions and multiple AWS accounts. Some teams use AWS Systems Manager to help maintain and troubleshoot infrastructure with minimal overhead. In this post we’ll walk through some of the problems AWS Systems Manager solves and open source tools we wrote to help use it at a global scale.
Systems Manager Introduction
In an ideal world nothing would ever break, with bug-free software and no hardware failures. In reality, when running fraction-of-percentage edge cases, things break. When you operate at scale, things break all the time.
It is a constant source of job security for engineers to fix known bugs in their environments, but there will always be new unknown unknowns which cannot be prevented or tooled for. Trying to prevent unknowns is usually wasted engineering time and doesn’t add business value. At some point, the most efficient way to debug a hard problem is by running a command or interactive shell to gather more information.
Many companies accomplish running commands with tools like Chef Knife, MCollective, SaltStack, or Ansible. All of these tools are capable of solving the needs of engineers to gather more information or temporarily fix problems across a fleet of servers, but they all have trade-offs.
SSH is an obvious choice for interactive access, but it has significant management burden when it comes to managing POSIX users, SSH keys or certificates, or using centralized authentication services like LDAP or Active Directory. Newer tools have been released, such as Netflix’s bless and Gravitational’s teleport, which add features like auditing and certificate revocation, but there is still a management cost for bastion servers, open firewall ports, or lambdas to access machines.
These tools require you to set up, secure, and maintain infrastructure with persistent connections to your servers, or to open network ports to allow engineers’ computers to communicate directly with infrastructure. Even with the additional overhead and trade-offs, these are still viable solutions depending on your environment and capability to scale and maintain the tools.
Amazon Web Services provides Systems Manager (often referred to as SSM, in reference to its original name of “Simple Systems Manager”) for exactly these tasks. Systems Manager is free for the use cases described above, and if you’re using Amazon Linux it’s probably already installed. Once you configure the required IAM permissions you’re ready to go.
Some of the benefits we’ve found with Systems Manager include:
- User management is tied to IAM permissions. One place to allow or deny access to systems via tagging reduces the burden of user management.
- No direct network access or VPN required. Users only need access to the AWS API and systems only need access to AWS VPC endpoints.
We’re not going to show you how to configure Systems Manager in your environment, but we’d like to focus on some of the tools we developed to make the user experience and features of Systems Manager better for engineers.
One thing we would highly recommend is setting up automatic updates for the Systems Manager agent in your infrastructure. Yes, this requires the agent to update itself instead of immutable OS deployments which may be cause for concern for some environments and administrators. However we have found having the agent up to date in our environments has been a huge time saver, and has eliminated the need for a lot of time spent coordinating and testing new OS deployments.
While AWS Systems Manager solves a lot of problems, it leaves a lot to be desired in the user experience (UX). The aws
command line isn’t always the most intuitive and the session manager plugin isn’t open source which can make the tools harder to use and debug.
To make Session Manager easier to use we’ve developed some tools for our engineers which we have open sourced on GitHub. We’ll look at the tools briefly to show how they can be used in your environment.
SSM RUN
First up is ssm-run
which can be used to run a command across a fleet of EC2 instances based on tag matching. Like some of the other tools described above, it supports batching command execution and different levels of log output for interactive troubleshooting or automation.
ssm-run
is a standalone binary so you don’t need aws
cli installed. It works with your ~/.aws/config
and ~/.aws/credentials
to give you access to infrastructure.
One of the really nice things about ssm-run
is the ability to work across multiple AWS accounts and regions. This helps a lot to avoid one-off scripts and for
loops that we often needed to use to manage our global fleet.
An example would be if you want to run uptime
on all of your accounts and in 3 regions on instances with the env=dev
tag. With ssm-run
you might use something like this.
ssm-run --region us-east-1,us-west-1,eu-west-2 --all-profiles --filter env=dev --commands 'uptime'
SSM SESSION
The unfortunately named Systems Manager Session Manager needed some experience management to manage user expectations in our environment. To make the command line experience better for our users we wrote ssm-session
.
Session Manager doesn’t require a bastion host or open ports in your VPC. No local user provisioning is required and sessions can be recorded and audited for greater security.
ssm-session
can be used as a replacement for ssh
for connecting to an interactive shell on a single instance.
ssm-session with a single instance
It also can be used to start multiple interactive shells on instances via tmux
. Just like ssm-run
it can be used across multiple AWS accounts and regions at the same time. We find it really handy for debugging and verifying systems.
ssm-sesion with multiple instances in multiple profiles and regions
Because the session manager plugin and session protocol are not open source or publicly documented, ssm-session
is a go wrapper around aws ssm start-session...
so you’ll need to have AWS cli installed along with the session manager plugin. In the future we would like to also make this a static binary, but we think the tool has enough utility to open source now.
To help with this limitation we also package the tools as docker containers with all of the prerequisites installed and ready to go. A simple shell alias will let you mount your AWS config and run the tools without installation.
We’re glad we could open source these tools and welcome community contributions to make them better. We also open sourced the go-ssmhelpers library we used so you can make your own tools and enhanced an existing gomux library to automate tmux the way we needed.
We hope these tools and libraries help you build simpler and more secure systems and hope that all of your unknown unknowns will become known in no time.
This post was written in collaboration with Justin Garrison.