(Especially if you’re On-Call)
In it’s most basic form, ChatOps is nothing more than providing additional context to the persistent “group chat” conversations that take place throughout the day related to the operations of a team.
Each team has their own roles and responsibilities, but through modern software delivery philosophies such as Agile & DevOps, teams are now finding great benefits and efficiencies simply by collaborating and sharing more about what takes place. By sharing the conversations, the context, and (where it makes sense) the actions operators take on a variety of tasks … a greater awareness is shared throughout teams, locations, and organizations.
With regard to on-call teams aiming to provide high availability and uptime, the practice of moving conversations and context in to “chat” can make a dramatic reduction to Mean Time To Repair (MTTR).
It is now known that failure is inevitable and cannot be engineered out of complex systems. Newer technologies and best practices can provide ways to minimize the impact to end users but the overall focus is now reducing the time it takes to repair from a disruption rather than attempting to remove the possibility of a disruption all together.
The fastest way to repair services is to ensure that the right person is found as quickly as possible, that they are provided with information that is actually actionable, that they are provided supporting information and context to the disruption, and that they can collaborate and take action with other domain experts on their team.
A ChatOps approach means moving information from anomaly detection services in to a common area (timeline or chat room) where everyone has the same visibility of what is happening. It means allowing operators to share what they are seeing AND doing in the same common area (timeline or chat room).
This can be automated through features like the VictorOps “Transmogrifier” or it can be as simple as an operator copying and pasting images, graphs, or links to logs and additional resources in to the common area. For more advanced teams who have seen a dramatic reduction in MTTR they have even implemented bots to leverage from the common area, effectively manipulating the timeline or chat room to behave very much like a terminal Command Line Interface (CLI).
When teams have reached this point of efficiency and sophistication, they find that running commands from the timeline or chat room (through the use of a bot) provides not only the quickest path to resolution, but additional “by-product like” benefits.
First and most importantly, the tasks to repair a service are completed quickly without the operator having to leave a single and common interface (timeline or chat). The execution of commands and results of those actions are immediately echoed to everyone at the same time. This means that as operators are performing actions they are immediately communicating what they are doing to the rest of the team. Everyone is aware in real-time what the operator is doing and if it is working.
Operators don’t have to take the extra time to communicate back to everyone (in a different interface) what step they are on, what they are seeing, and if their actions are making a positive impact. Because everyone is able to see the same context, conversations, and actions others on the team who have joined the firefight are able to have real-time awareness of what is taking place, jump in to support without duplicating efforts, and offer additional thoughts to mitigate and repair.
Everyone involved knows what and how things were done. Knowledge sharing is maximized. Junior members of the team, as well as seasoned veterans have all just taken part in a tutorial on how to solve that type of problem.
Learning is an extremely important component that is often overlooked by IT teams and ChatOps provides a very natural way for teams to learn from each other on how to “get work done” even in the most stressful scenarios.
Because chat bots have limited capabilities and can only run scripts someone has given it explicit privileges to execute, the possibility of running a bad command from a CLI is removed. This means efforts to repair services from a timeline or chat room (through the use of a bot) is much more secure than an operator accessing a host with ‘root’ or ‘sudo’ privileges and possibly causing more harm than good. All executed commands are logged in the timeline and everyone is completely aware of who is doing what. Compliance control and action logging are baked in without any additional effort.
At the end of the day, reducing the time it takes to repair from an incident is the goal.
By collaborating as much as possible on not only the conversations but the context and actions from a common timeline, everything happens much quicker.
Timelines and persistent group chat rooms are the interfaces where everything we do takes place and it means shortening every feedback loop so that tasks (good and bad) can be accomplished as fast as possible.
For more on ChatOps … download the free e-book today!