#chatops
The joke that runs our life.
A term was introduced to our Technical Operations team a few months back by one of our vendors, Leo. I believe the comment was “Hey, you guys should give a talk about how you use chat ops.” In response, our small team stared at him like he was speaking gibberish. He then went on to explain that the usage of chat bots to interact with infrastructure (something we do often) is called chat ops.
Ever since then, we have been laughing at the amazing buzzword. It is so silly and embodies the hilarity of the tech zeitgeist so perfectly that our four person team often prepends it with a hashtag when talking about it during standups. Therefore, #chatops has become our answer to most problems. Need to quickly ban an IP at the edge of our network? #chatops! Write a chat bot that accepts “ban 54.45.12.34” and modifies the list of banned CIDR ranges on our edge. Need to audit employees who don’t have multi-factor authentication enabled on their account? #chatops! Write a bot that posts daily with a list of users we need to shame.
Our team follows a pretty common playbook for a web operations or site reliability engineering team. We always have a primary and secondary oncall that act as the first responders to incidents across our system. We have a microservice architecture, all behind a third party edge cache. We have lots of vendors which we get and send data to. The main difference is we’re moving to a world where our chat rooms are basically a shared terminal for our team. We’re not as advanced as some of the early proprietors of #chatops like Github, but we use #chatops as a way to surface what is going on in our infrastructure constantly.
For example, a lot of the different parts of our infrastructure heal themselves, or operate in a way that failures don’t cause some of our more important operations (serving content and accepting donations) to go down. That being said, we want to know if they are a degraded state, but we don’t want to page people. Said more simply: we probably don’t want to be woken up at three in the morning unless we absolutely have to be. Because of this, we have three levels of alerting and bots to tell us when things aren’t perfect.
The first level is a page. It is fired from our monitoring service and attempts to wake up our oncall. This level is only for incredibly urgent things. If something pages, it needs to have actionable output, and be a customer affecting outage. We also have a chat bot that accepts commands like “page ops Everything is on fire!” or “page events I can’t seem to create an event!” if we need to easily get a hold of an oncall person for a specific service.
The second level is a warning in our #tech-alerts chat room. This is just a simple message to warn the current oncall that something isn’t great. We assume that these will be looked at, but if they need action, they should page. They tend to have links to metrics to dig further into the problem, or suggested bots to message to get more information.
The third level is an audit message. This arrives in our #tech-audits room. They tend to be messages for things to research that are not health related. Things we get audit messages about include: unusual security groups and load balancers, users on vendors that don’t have multi-factor authentication enabled, log analysis from the previous day, and general statistics about our organization.
A lot of these bots can also be called to provide analysis for the last hour, for example if we want to see the top IPs creating 4xx HTTP requests to our edge or instances created recently. An example conversation might be:
> Me: opsbot top 4xx by ip
> opsbot: Top 10 4XX Last Hour:
| 104.2.1.1 | 404 | GET | /blargh | 4001
| 207.1.2.3 | 400 | POST | /foo | 3210
| 105.2.1.1 | 404 | GET | /blargh | 912
| 201.1.2.3 | 400 | POST | /foo | 805
| 102.2.1.1 | 404 | GET | /blargh | 730
| 203.1.2.3 | 400 | POST | /foo | 608
| 109.2.1.1 | 404 | GET | /blargh | 592
| 201.1.2.4 | 400 | POST | /foo | 455
| 102.2.1.1 | 404 | GET | /blargh | 372
| 201.1.2.3 | 400 | POST | /foo | 201
As you can see, most of our bots are for providing read operations into our infrastructure. We have other bots that talk about other services we work with, and they post messages when new Github Pull Requests are created, new tickets are created and when certain accounts tweet.
The area where we are still looking to make large improvements to our chat bots are write operations to our infrastructure. We are looking into bots which would allow people to kick off service deploys from chat, add new URL redirects to our routing map, and run database queries against our data warehouse. The biggest concern to this is security. Right now, for the write actions that we allow, we use a whitelist and an audit log. To improve security, some ideas have been floated around, including private messages querying for a two-factor auth or sending the requestor to a Google OAuth page. But given the time constraints of the campaign, if it can’t be built in an hour or so, it usually gets ignored to deal with something higher priority.
So, despite how hilarious the term #chatops is, we use it daily. It provides visibility for our entire organization into what is going on with our infrastructure. If this sort of environment interests you, we’re hiring! https://hillaryclinton.com/tech