Trust the Process: A Look at Handling Production Issues

James Louie
Pragmatic Programming
10 min readMay 27, 2019

--

Today’s systems are complex — distributed, data driven, continuously deployed systems allow for things to go wrong at any given point, at any hour of the day. Even the greatest of organizations experience issues that escapes suites of integration tests, QA team approvals, and advanced release mechanisms. This can be some of the most chaotic times for development, product, and management to deal with due to the pressure and impact on the business. One popular saying is that when a production issue occurs, it’s the development team’s responsibility to “put out the fire”. Aptly named because as a developer it sometimes feels like that when you get CC’ed on an email that your boss’es boss’es boss is saying your team’s system is causing the whole website to crash.

Original Image

But it doesn’t have to be that way. Just as we are trained to follow a given set of procedures for natural disasters, knowing a process before will greatly increase the effectiveness of the recovery efforts. When teams know exactly how to react to production issues, they are more likely to resolve them quicker and more effectively.

When we think about production issues, the first thing we think about is fixing this issue — but that’s only the tip of the iceberg. If an organization is constantly putting out fires, they won’t have time to devote to building business value, and the organization comes to a halt. The most important areas of improvement come before and after the issue has been fixed, which will determine how effective you are to reacting to issues, and how you can prepare your organization for the future.

The process can be summarized in six steps:

  1. Design—Design your system for visibility
  2. Notify— Implement a mechanism to notify the right people
  3. Prepare — Define a strategy for how your organization will respond
  4. Assess — Evaluate the severity and responsibility of issue
  5. Fix — Resolve the issue
  6. Review — Review current process and designs to determine preventative actions

Step 1: Design

Design is the first step of the process of the development team enabling visibility into the inner workings of the system. Before we can even fix an issue, we need to know when they are happening. We need to be able to see when errors occur, what errors occur, and how often they occur to determine if there is really an issue going on, and if someone needs to be alerted.

There are several strategies we can use to expose the health of our system:

  • Structured logs: Logs in a consistent format can monitored for errors. For example we may have a consistent field level that defines the severity of the log — whether it is a standard information level, or it is a critical level that needs to be addressed immediately. The fields will depend on your organization’s strategy for handling errors, but some common fields may be: correlation id (Rapid7 blog), service name, environment, and team ownership, etc.
  • Centralized logging: Following up on structure logging would also be to aggregate the logs of all your systems into one location which you can create queries to find advanced metrics of your system. This will also allow you to define queries that monitor the health of all systems, which can create alerts.
  • Custom reporting: Sometimes logs may not have enough information or live long enough to get the statistics you need. You may find yourself having to build custom application logic to generate reports that can use the raw or transformed data to identify issues. Some examples would be variance over time or business heavy data transformations.

Step 2: Notify

Notify is the second step of identifying when issues occur based on the data provided in the design step and to notify the correct people. The goal should be to have a consistent notification mechanism, regardless of the data origination (logs or reporting). This will streamline further processes downstream and allow for more effective handling because it will give the issue handlers a consistent view of the issue at hand.

Original Image

Some common solutions that organizations use may be:

  • Email: This is common when you’re first setting up your system, and you just want minimal implementation to get the notification out. Typically this can go to a mailing list based on the issue, which then gets delegated out to the handler. The key to a good notification email is to the consistent structure we went over in the design phase. Utilize an html email template to create an easy to read report that will cover the initial finds so that it can be delegated easily and fixed in a timely manner.
  • Incident Management Software: There are many products out there that offer fully integrated incident management systems that have many core features that a NOC (Network Operation Center) team would manage such as history tracking, on-call tracking, incident elevation, and many custom integrations (phone, text, task-tracking).

Step 3: Prepare

Prepare is the third step of preparing your organization for when issues happen, by defining your strategy and structuring your organization. From an organizational perspective, this is the most important step and will determine how effecting the entire process is. You can have the best tooling and architecture, but if your organization is unprepared to handle it — many more things can go wrong, communication can break down, and the issue unresolved.

Original Image

Defining your Strategy

This will vary between organizations, and each organization will have different strengths and weaknesses that will lead to different strategies. The takeaway from this section will be to formalize the process with your organization, and make that your solution contains the following:

  • Decide on a structured log format that your organization will follow
  • Decide on the tooling and notification system that your organization will use
  • Decide how your organization will respond to different severity levels, and who will get involved. Does the director really need to be notified in the log level is only error?
  • Decide how cross-team issues will be handled. How will communication channels be setup? Who leads the discussions?

Structure your Organization

Currently there are two trains of thought: Ops and NoOps (Full cycle developers).

In an Ops approach, the issue tracking starts with Ops team (NOC) which then leads and executes the strategy. They can still reach out the development teams, but they can also resolve issues via a runbook of procedures for handling issues composed of historical fixes and development team knowledge.

Pros:

  • Centralizes responsibility of incident management, enabling them to act as coordinators between interested parties
  • Are dedicated resources for incident handling, and usually have more experience handling issues
  • Have a higher level view of the system and can see more of how the system is impacted to make judgments

Cons:

  • With constantly evolving systems, the Ops team doesn’t know when/what changes are being released, and therefore don’t know how the system changed that may have caused the error.
  • Error are usually only understood by the development team, which forces the Ops team to rely on the development team anyways to decipher what is going on.
  • Communication lag between Ops and development team may cause the issue to take longer to resolve.

In a NoOps approach, the issue tracking starts with the developers, hence the alternative name “full cycle developer” at Netflix. There is a great presentation for this at InfoQ that goes over what it means, but it can be summed up that the developers are in charge on the full development-release-maintenance software life cycle. Issues will be directly forwarded to the development team for which the alert is setup for. This approach is becoming popularized with the rise of continuous development strategy, where services are constantly being deployed, and the complexity of the system becomes too much for one team (Ops) to handle. Ops team don’t really have context into what the systems are doing, and it usually relies on the developer to understand what is causing the issue. I highly suggest that a managed incident management service be used with this approach, as it will assist with a lot of the duties that the traditional Ops team would accomplish.

Pros:

  • More effective handling of issue due to development team knowing more about the system.

Cons:

  • Requires more involvement and discipline from the development team to be able to handle issues and follow processes.
  • Teams usually only know have a narrow vision of their own systems, and don’t have full field of what other systems are being impacted. Which may cause incomplete assessments to be made.

In a hybrid approach, we can take the best of both worlds by having the development team be the first line of defense, similar to NoOps approach, but also have a Ops team member playing a supporting role. This seems to make the most sense in that most problems the development team can handle, but there are some cases where they don’t have the complete view and need the Ops team’s vision to look at the bigger picture and communicate with other teams.

Pros:

  • Agility of NoOps approach for most cases
  • Complete view of the system for better assessments

Cons:

  • More personnel burden to have people on-call

Step 4: Assess

Assess is the fourth step of collecting information about the issue. It is important to have the details correct and clear before deciding on any corrective action.

The staff that is handling the issue has three important questions to answer:

  1. What is the issue? — What is happening to your system.
  2. How severe is the issue? — How immediate does this need to be fixed.
  3. Who needs to be involved? — Determine scope of people needed to be involved.

These questions are reliant on the work that we did in the first step, Design. We need to have proper visibility into what the current behavior of the system to diagnose what is going wrong. With highly visible system, you will be able to pinpoint the issue faster. By having some agreed logging format you have have metadata about the error to make better decisions. Based on the log level, your organization can categorize the immediateness of the bug. You probably don’t need to worry about a description not being loaded for a toolbar, but if your signup funnel breaks you’ll probably need all hands available. The log can also have information about potential issue, if you are expecting data from an external team or source, the assumption can be called out and it can be relayed back to the external party to resolve the issue instead.

Step 5: Fix

Step 5 is fixing the issue. Do what you do best, and get that problem resolved. Focused on only fixing the immediate problem. Don’t worry about the long term solution. If the problem can be resolved by simply rolling back, do it! Your job is to get the system back to stability. Production is not meant to be a troubleshooting environment. If you have a better solution that would take some time, get the quick fix in, test your change in lower environments, and push it out when it has been verified. Introducing bigger changes when there is an existing issue in the environment is a recipe for disaster, potentially compounding you the issue at hand.

Step 6: Review

Review is the last step of the process of reviewing the details of the issues and determining how to resolve it more effectively in the future. The discussion should review both the technical aspects as well as the process aspects of the issue.

From a technical perspective, what was the root cause, and how do we prevent this and similarly scoped issues from happening.

From a process perspective, was the process in place able to handle the issue, and what parts could be changed or added to be more effective.

The review process should be approached with incremental mindset. How do we improve what exists so that it will be better in the future. To be successful with this mindset, it must be agreed upon that problems need to be identified without blame. This will give you the most accurate depiction of the events that took place. When blame comes into play, you get ineffective communication, inaccurate recollections of events, and general distrust among all parties.

Who should attend?

  • The assessor
  • The fixer(s)
  • (If applicable) Cooperating resources/teams
  • Direct managers of those parties involved
  • (Optional) Depending on the severity of the issue, upper management can be involved — but usually the report and action items should be enough for them
  • (If application) If the assessor/fixer are the same person and there are no participating teams, then it is a good idea to invite a third person (including the manager) from the team to get more feedback.

What should be the outcome?

From this meeting, the group should come up with a set of action items for improving the design and process. These items should be logged with whatever issue tracking software your team uses with some level of priority in relation to the impact it caused. This would be where you can plan to enact larger scale efforts to fix the greater issue you may have passed up in the Fix step.

Documentation should be created to detail the issue, how it was resolved, and what action items were generated from it. This should be made available to the organization, and be distributed to the relevant team(s) and upper management. It is important for the team to know of the issue for contextual information, such as making design decisions, or knowing how to solve similar situations in the future. It is important for upper management to get this information because it gives them more insight into the status of the organization to determine what higher level decisions need to be made.

The Takeaways

  • No matter how large and advanced your organization may be, production issues are inevitable.
  • Define a process ahead of time to reduce confusion and miscommunications and increase effectiveness of your development teams fixing the issue.
  • Iterate. Iterate. Iterate. There is no one solution that all organizations can follow to fix issues. Always look to see what you can improve in your design and processes to reduce the future number of issues.
  • Don’t play the blame game. Things happen, but it is in everyone’s best interest if they can speak honestly without judgement, so that the organization can move past this and be better for it.

--

--

James Louie
Pragmatic Programming

Developer looking to make the code a little more clean