How to diagnose incidents (and sleep better while on call)

Published in

relay-sh

5 min readJun 23, 2020

Incident management is an always-on responsibility, but after your last sprint, you deserve a long rest and peace of mind. Your incident response plan may be the furthest thing from your mind and if it’s well-configured, you can finally sleep soundly. A robust incident response plan gives your organization coverage and confidence to identify emergent security threats and rapidly execute the appropriate response. In this post, we’ll cover detection and communications workflows that will enhance your incident response program and take some stress out of your on-call cycle.

We’ll assume you’ve done some preparation to create an action plan with stakeholders and you have some detection methods in place, but there’s always plenty of room to streamline, especially around reporting and team collaboration.

Let’s drill down into the first response detection and communication duties of an action plan. Even better, we’ll look at how to optimize them with an event-driven approach — the sort you can get from Relay’s DevOps automation.

Optimize Response Plans with Smart Monitoring

A lot can go wrong in the time between an unexpected login and disabling the suspicious account involved. One of the foremost objectives in detecting and acting on a threat is limiting that amount of time and thereby limiting the risk. In this case, the key to expedient remediation is to recognize abnormal network behavior and classify the potential impact. A versatile tool for assessing both the suspicious login and its inherent danger is an incident classification matrix. They are useful artifacts of your preparation plans that can layer richer intelligence into your action plan and provide data points that can act as conditional triggers to drive resolution.

You can classify threats by various taxonomies, such as likelihood, severity, or type. Taking severity as an example, maybe the abnormal login originated in Germany, which is outside of the normal range of user activity, but not too far out of range for some traveling consultants in your company. Tag Germany as a minor risk with possible likelihood.

But if it came from an infrequently visited destination, tag it as an unlikely or rare likelihood incident, with potentially higher impact, to inform the next steps. If you created the matrix with stakeholders in mind, events categorized as such can trigger appropriately scoped remediation plans to reduce time spent dwelling on assessment.

Know Your Audience

A communication plan is another important piece of guidance for incident response management. If properly designed, all need-to-know parties are accounted for and designated to different stages of response. Not only does that implicitly guarantee buy-in, but it also reduces noise and confusion during a potential incident by broadening communications only when necessary. With different incident and severity types mapped to specific response teams and communication channels, time spent dwelling on which communications manager to inform is drastically reduced. Moreover, the notification system can be moderated from blaring midnight phone calls to targeted (yet emphatic) pings on Slack.

Imagine that the security engineer on call deems a mysterious login as suspicious enough to warrant alarm. They need to activate a response team in one fell swoop of a notification. This is where binding specific names to specific communication channels ramps up start time. Group these response teams into dedicated channels in your internal messaging tool, such as Slack.

Better yet, if you’ve classified events by severity, as described above, the event’s classification can serve as a trigger to automatically generate a Slack channel for the invitees dictated by the communication plan. Then, streamline the plan even further through customizations like Slackbots that enforce an update cadence or extend more invites if the incident remains unresolved beyond a critical threshold.

Streamline Your Response Plan with Modular Tooling

Now you have some ideas to take into your next incident response planning session. Make plans to score potential threats and organize them into a referential matrix and group your response teams according to those classifications. Remember to examine your current DevOps environment and consider how to leverage these new diagnostic tools within your ecosystem. The important final step of incident remediation is to update and refine your response plan. Pick a tool that allows you to easily update the order and services your plan requires.

Relay allows you to organize and adapt your incident response plan in a singular, convenient and repeatable environment that integrates with the tools you already use.. It enables workflows-as-code by translating the plan into YAML and sequencing it stepwise with the appropriate triggers. From there, each trigger leads to an automated set of scalable actions.

For example, here’s how a PagerDuty trigger can be configured in Relay:

triggers:
  - name: pagerduty
    source:
      type: webhook
      image: relaysh/pagerduty-trigger-incident-triggered
    binding:
      parameters:
        incidentTitle: !Data title
        incidentUrgency: !Data urgency
        incidentURL: !Data appURL
        serviceName: !Data serviceName

Run conditional actions based on incident severity categorization to trigger on-call Slack notifications:

steps:
- name: high-urgency-message
  image: relaysh/slack-step-message-send
  when: !Fn.notEquals [!Parameter incidentUrgency, low]
  spec:
    connection: !Connection {type: slack, name: my-slack-account}
    channel: !Parameter highUrgencySlackChannel
    username: PagerDuty via Relay
    message: *message
- name: low-urgency-message
  image: relaysh/slack-step-message-send
  when: !Fn.equals [!Parameter incidentUrgency, low]
  spec:
    connection: !Connection {type: slack, name: my-slack-account}
    channel: !Parameter lowUrgencySlackChannel
    username: PagerDuty via Relay
    message: *message

Full example can be found here

If you find the triggers are too sensitive, clamp down on their actions by requiring manual approval for escalation patterns. Moreover, if your team switches from Slack to Teams, just swap the configurations and proceed as normal. Check out some sample workflow references and see how Relay can speed up your incident response practices.

How to diagnose incidents (and sleep better while on call)

Optimize Response Plans with Smart Monitoring

Know Your Audience

Streamline Your Response Plan with Modular Tooling

Written by Adam DuVander