Keeping servers healthy
Part II — Running your server
Building the server is only step one (if you missed that blog, we recommend you read that before you start this).
Now it’s up and running, we can revisit the list of packages Puppet installs and look at what that means for our engineers on a day-to-day basis.
Two factor authentication software
We run LinOTP for providing a second factor of authentication; to prove people’s identity in a number of applications, including logging into ‘sensitive’ (usually live production) servers. When you initially join with Code Enigma one of our engineers will set you up with a temporary password so you can register your phone, or a YubiKey if you bought one from us, with LinOTP, so that when you login to a server you will be prompted for your username, password and the code from your 2FA device, enforcing the ‘something you know and something you own’ double-verification of identity.
Directory server connections (for user management)
As part of our server management service we run a user management dashboard for our clients. Via this dashboard our clients can manage users under their account within our user directory, selecting their level of access, notifications and also managing things like their SSH key(s), passwords, additional contact information and so on.
ClamAV and its settings
One of the things we manage is the antivirus software and respond to any threats. We get nightly emails from ClamAV if there are any issues, and we respond accordingly, sometimes quarantining bad files and purging them, sometimes just letting clients know we have what we think is a false-positive. But if ClamAV triggers on one of your servers, you’ll get a ticket from someone on the systems team, even if it’s just to tell you not to worry.
No backup is really a backup if it isn’t tested, so while we discussed setting up backup software, we also add the server to a rolling calendar for backup testing. Once a quarter, your Duplicity backups off site will be restored entirely to a temporary AWS EC2 machine in our private account and manually inspected by an engineer, to ensure we’re seeing the data we expect to see.
While most environments have built in physical firewall protection or some equivalent (for example the AWS Security Groups feature is basically a managed firewall for your cloud network) we operate pretty paranoid software firewalls as well. These are centrally controlled by Puppet to a degree — we ensure critical files are in place, the init.d script is not altered, and so on — but we also manage exceptions via the Ansible playbook for ‘iptables’, which is still our kernel firewall management of choice. We have rules as standard for both in and outbound traffic, that way even if a machine gets infected, perhaps by a badly configured website that allows an upload which doesn’t get immediately picked up by virus scanners, it’s hard for that software to ‘call out’ — announce its success to its creator. We also block all traffic inbound that isn’t specifically permitted, and restrict certain activities to certain addresses, for example secure shell is only permitted from our VPN and other IP addresses specified by the customer.
We do a lot of monitoring! In all cases we install Nagios, the industry standard real-time alerting system, and we run two separate servers on separate sites to help identify false-positives. We use this to monitor pretty much everything, all services, disk space, RAM usage, system load, DNS health, even VMware balloon size (if applicable — it can be a warning of impending RAM starvation for a VMware guest, that’s why we monitor it).
We also set up key web pages with StatusCake, our external monitoring partner, so we get SMS and email alerts if anything goes away (this is also connected to Nagios, we maintain a Nagios integration module which is open source on GitHub). For machines in environments where there is no additional historical recording of performance data, we install Munin, which tracks a machine’s health over time, to help us spot trends in resource usage and act before there’s a capacity problem.
For clients on AWS we configure and manage CloudWatch, including alerts for services such as ElastiCache and RDS, so we are warned in real-time if there are resource problems.
Our IPS, OSSEC, is running all the time across all machines. It automatically triggers blocking at a kernel firewall level on all our infrastructure if a particular address breaks one of the ‘rules’ in place. It also sends a report of all suspicious activity throughout the day, across the network, every single day. Our engineers check these reports line by line to ensure there are no holes or new attempts we should concern ourselves with, rules we could add or strengthen, and so on.
We also get automated warnings nightly from the rootkit detection software, rkhunter. This is a simple piece of software that monitors the checksum of applications on the system and warns us if that changes. If we’re expecting the change (because of patching or an upgrade) we can just update rkhunter’s database and carry on. If we weren’t, it’s a “heads up” something more sinister may be happening!
That’s an overview of all the software on these servers that we’re continuously maintaining monitoring. But there’s more! Critically, there’s package management. All the software needs updating regularly, so we run a calendar with all our servers in it and we aim to patch many servers a day, every day, so that over a three week cycle every single machine gets refreshed to latest versions of software and rebooted if necessary. We do this manually, because with the best will in the world — and regardless of how robust the process usually is — upgrading software is a risky business. It’s always better to have a human watching a dry-run and applying the updates than leaving the computer to update itself and finding it dead in the morning.
What if it all goes wrong?
That’s when we earn our management fee!
Firstly, there’s emergency patching. Sometimes a ‘zero day’ vulnerability will emerge and there will not yet be a solution available. We use Debian for most things because the Debian security team are really sharp, and usually one of the first — if not the first — to release a security update, but we monitor security forums and feeds and if there are mitigations and patches to critical software we can apply to servers while we’re waiting, we do! This we typically do by testing on a few of our servers first and, if all looks good, applying rapidly across the network with a custom Ansible playbook. We keep some templates handy to make the process faster in an actual emergency.
Then of course there’s incident response. The whole point of all the monitoring is we know almost the instant there’s a problem and we intervene. Sometimes we’ll make adjustments to software configuration, sometimes we’ll need to restart services, sometimes we’ll make infrastructure changes like adding disk, adding a server to an autoscale cluster, and so on. The main point being the monitoring we have in place allows us to do so in a timely fashion, so unless there’s a catastrophic failure we’ll be able to have your website back up in minutes.
And if there is a catastrophic failure, we’ll be pulling your backups. Hopefully your local snapshot backups, as they’re far faster to restore, but worst case your tested off-site backups will be decrypted to temporary AWS machines and brought online.
That’s all, hope you enjoyed learning more about what’s really involved in proper, professional server management. If you missed the article about how we manage cloud services, you can find it here.