SRE Effectiveness Needs Slack Integration

What if I’d had SlackOps Earlier in my Career?

Learning from Historic Operations Battles

Mark Williams
SlackOps Magazine

--

Like Slack’s rise to prominence in the last four years, my Ops Leadership role at Zynga spanned its most rapid pace of market share acquisition, and as a result, the infrastructure we wielded became one of the largest footprints of hybrid (public and private) cloud technology of its time. The capabilities of communication tools from this period (2008–2012) were never scalable enough for Zynga, and these inadequacies often led up to my least favorite situation…learning from my customer that they have an incident instead of being alerted by my team. Embarrassing and stressful, but without the right tools, this can become a pattern of performance that leads your customer to lose confidence.

Looking at how Slack has become the de facto standard today, I’m replaying some of those historic Ops battles back through my head and seeing the way things could have been with Slack.

I first go back to what we in Ops all learned together and prioritized as an organization. Ops Imperatives:

  • K.I.S.S. (Less is more).
  • Simplicity scales.
  • Work smarter, not harder.
  • Good news should travel fast. Bad news should travel faster.
  • You can’t fix what you don’t measure.

Through the lens of these imperatives, Slack scores well on each front.

Collaboration: Keep It Simple Stupid

We had to have real-time communications, but with some many chat tools to choose from, we couldn’t count on different teams and studios to be on the same chat platform. Now, that’s simple — Slack is widely used in the Enterprise, and the problems of scaling chat channels, multiple threads per user, and optimizing for signal over noise have been largely conquered through the flexibility that Slack’s app and platform offers. It harmonizes exceptionally well with Ops groups serving Cloud, IT, Development, Financial Services, and Professional Services.

The ability to integrate disparate tools and systems into the Slack interface is where one might ask, “Is Slack providing something really ‘Simple’ here?” Absolutely, it is! If one platform or tool can replace multiple disparate systems (like email) or make a smattering of different browser-based tools accessible without having to venture away from the Slack interface, then you’ve delivered simplicity. Switching context out of Slack to go get something or look something up to bring back to the conversation consumes valuable time. Instrumenting Slack with bots and add-ons to automatically bring that information to the thread preserves the focus where it needs to be — in the incident.

Less is more applies here too. Imagine reduced distractions from other apps trying to notify you to reattach your focus elsewhere. Some companies have abandoned email all together and replaced it with Slack. Superior abilities to mute specific threads and stay in the zone with “Do not disturb” help preserve that precious focus and yield greater problem solving abilities.

Collaboration Automation: Simplicity Scales

Slack crushes it when it comes to reliably handling multiple concurrent users on its platform. Orthogonal to that is another vector of simplicity that Slack has designed in its Slack Apps architecture. Opening the APIs and hooks to let others integrate their wares to Slack provides scalability through its flexibility. Existing collaborative tools and automation now have a shared space to be summoned, and again the preservation of focus within the app to consume information from these connected tools.

The Apps capability coupled with the multicast nature of Slack channels affords the next level of scalability with bots designed to respond on behalf of the human users. These Collaborative AI tools are the onramp to providing the most mature capabilities to the incident response process. This dovetails nicely with the next imperative…

DevOps: Work Smarter, Not Harder

The time it takes a human responder to switch out of Slack to fetch information from a different tool and come back in isn’t just a measurement of the time involved typing and clicking elsewhere in the OS. That time outside of the Ops channel adds up, and it can stress participants out to be jumping between tools and browsers. It’s also a measurement of the efficacy of that user’s competence in the use of disparate tools. Compare a user’s performance venture to a browser, finding the resource or tool they were thinking of, capturing some artifact (e.g. Copy/Paste), and returning to the channel to resume the dialog with a Slack App with a bot that is instrumented to summon that same tool and provide information. Bots like this can be summoned on demand through a command line request or from a bot that uses Machine Learning to detect relevant dialog in the channel and engage automatically based on its past observations.

Ops teams excel at tool innovation. DevOps culture effectively mandates that you never do the same thing manually twice. Availing new automated solutions useful to a wider audience is often a challenge. Where should it live? Does it have to align or integrate with existing tools or a product roadmap? In a fast-paced, scaling Ops environment, having a lower barrier to publish a useful tool reduces a lot of friction and provides a platform to facilitate continuous improvement for the whole team. Slack’s extensibility is a catalyst to making home-brewed tools work effectively for others in the workspace.

SRE: Good news should travel fast. Bad news should travel faster

This one is simple. There are no long-term career opportunities for those that hide the bad news. Being professional and forthright as an Ops employee builds and sustains trustworthiness. That may be all well and good, but when “it” hits the fan, how can you cover all the communication requirements accurately and quickly? In our Zynga days, the SRE would maintain a document of phone numbers in a decision tree and start war dialing. Today, PagerDuty and similar offerings systematize this decision tree, and as you might have guessed, Collaborative AI apps are easy to instrument into Ops channels to not only answer the simple question of “Who’s on Call?”, but take the next step to summon PagerDuty to contact the person and connect them to the Slack channel of interest.

Learning: You can’t fix what you don’t measure

Ops organizations tend to be good at quantitatively assessing performance of the organization and the assets it supports. When it comes to measuring downtime and business impact from it, it’s table stakes to know revenue at risk per second of downtime. SRE and DevOps teams must know that downtime can cost tens of thousands of dollars per minute, and measuring this helped rationalize the investment in tools that would help prevent or minimize such impact. The aforementioned optimizations in reducing response time develop over time, and the most mature organizations tend to invest in automated solutions to eke out more increasingly precious time. AI/ML tools are often leveraged, and many can accelerate the identification of the help component and make patterns of human behavior more repeatable by suggesting workflows that ensure methodical and efficient engagement by the incident handlers.

Parting Shot

I will make this quick, if we’d had, back in those days, Slack tied to our Ops and Incident Management applications, it would have helped improve our team’s performance by about 30%. Several of the tools that SRE developed for their own use could have found more useful homes being tied into Slack, and the automation supporting critical processes would have shortened a lot of the repetitive work cycles. Going forward, I can tell you that Ops teams will be more pressured to get work done, and unless you have cash to throw at the problem, I can’t imagine not collaboratively automating repetitive human tasks in Slack. This would be SlackOps.

--

--