Jon Anhold
Listen To My Story
Published in
3 min readFeb 11, 2016

--

Troubleshooting Basics: 4 Steps to a Better DevOps Workflow

Listen to this Story or Read it

No attribution required

Troubleshooting technology, specifically production incident troubleshooting, requires an analytical mind. Identifying cause, remediating incidents and automating around these problems in the future is more art than science, and comes largely from experience and not the classroom. Here are 4 steps to guide you through troubleshooting today’s technology and applying DevOps concepts to help smooth the way.

Defining The Problem

Before you start the troubleshooting process, take some time to evaluate how much you know about the incident.

  • Where is the problem you need to solve? Are you sure?
  • What other systems does the one you care about connect to or rely on? Are you sure?
  • Do you have data / documentation / monitoring to show it?
  • Do your tools give you a complete picture?

Make sure you can answer these questions before proceeding. Troubleshooting an unknown error as an independent, isolated issue without context could result in unnecessary churn, more work, and additional or prolonged downtime.

Infrastructure today is complicated. You need to be aware of all the moving parts to find where the issue may lie. This approach to incident troubleshooting gives you a more holistic understanding of the problem.

Using Tools

It’s important to employ the right tools to help you with infrastructure monitoring. Software analytics solutions like Ruxit, Dynatrace and Sensu let you visualize events and data in real-time to measure performance and detect issues. Choose wisely. Your monitoring tool depends largely on your business, your process, and your infrastructure.

Unfortunately, even the best tools have limitations. A developer can’t leave diagnosis entirely up to tools. This is where experience is key. Once detected, you might need to further investigate performance issues to detect the root cause.

  • Is it your service or a third party service causing the issue?
  • Was there new code deployed?
  • Are there any recent environmental changes?
  • How scalable are your tools and apps? Are they running optimally?
  • Does data seem to be missing?

All these scenarios need to be considered and accounted for through automation, with added support from your own investigative efforts.

Showing Your Solution

Once you’ve detected and diagnosed the problem, it’s time to fix it. But remember, fixing a problem isn’t a one and done matter. After you find your solution, it has to be applied and tested in real time, and the knowledge learned documented. There are a few questions you should be asking yourself:

  • Can you reproduce the test results?
  • Does your solution work in different environments?
  • Will the service behave differently after changes take effect?
  • Is this desirable?
  • How is the service performing for real customers?

Your solution should produce positive test results on all tiers of the system, in various environments, while maintaining or improving service quality. Your solution is only successful if everything is running optimally and the end users are happy.

Showing Your Work

Show your work. This won’t be the last time you see this particular problem. Fix it in such a way that you minimize the chances of it reoccurring, and document it in such a way that if it does, it’s easier and faster to resolve.

  • What did that error message really mean?
  • Did you write it down?
  • Did you write it down where other people can find it?
  • Would some additional automation prevent this issue in the future (Something we have done at Razorfish and Rosetta with our productized offerings for DevOps and Managed Services with AEM, WebSphere Commerce and Hybris on the Cloud)?
  • Did your monitors catch it? Why not?

Tracking your progress when you’re solving an issue with your technology is crucial. This way you can provide the most accurate picture of your solution as a process even after it’s been applied and integrated, and everything is back to normal.

You and other members of the DevOps team will have a storied history of your environment’s challenges, solutions, and impact. You can also better anticipate connected or future performance issues or incidents, and address them proactively.

DevOps is still a new concept. It means different things to different people. But this lean approach to cross-functional development and operations also calls for new thinking. Color outside the pages of outdated textbook procedures to engage all levels and all parts of your technology with a customized combination of automation and expertise as unique your or your client’s business.

--

--

Jon Anhold
Listen To My Story

dad, geek, hamradio, linux, golf — VP, Technology at Rosetta / Razorfish. Opinions are my own.