Production issues: the owl effect

Edwige Fiaclou
8 min readSep 13, 2021

As a Manager and Software Engineer with over 10 years of experience, I have seen numerous problematic and complex issues during production. However, legacy systems are no different than others when it comes to identifying a problem.

The complexity of legacy and system, this is the driver enabling my team and me to better manage production outages. Over the years, I have grown my mindset from a constant state of emergency management to a well-oiled outage prediction machine. From implementing monitoring probes before developing fancy features to having a recovery application to take on after prod issues. The sum of my experiences have proven useful to date, and today, I would like to share a few with you.

Let us start with a few existential questions; Can we really be an expert? Are we fully aware of all the hidden blocks in our system?

As an engineer, your main goal is to design, develop and deliver on projects that your clients and/or company desires. Your ideal state is to build a system that functions flawlessly. It is a path to value creation, new functionality development and deployment, and, of course, also a way to justify your salary!

A — On the road to a good mindset

(Careful! The following paragraph can bring back bad memories to the most sensitive developer or support people.)

Do you remember that time when your system crashed? Or maybe, that other time, when the bug unknown-unknown happened? Yes, you read that correctly: UNKNOWN. Let me share with you several feelings that some of us have already experienced.

The first feeling you felt was STRESS, followed, maybe, by FEAR! Numerous questions then instantly popped into your head:

What are the impacts? Stakeholders/Clients — will they be unhappy? What will be lost? But how could it happen? Why have I not detected this before? How much is the company losing? Let’s check the logs! Look at the monitoring board! Dang, there’s nothing there!

Imagine that your system is like a car. When you look at the car bumper below, do you see anything out of place?

tip: Did you notice the OWL? 😉

The owl hidden in the car’s grill above is just like a small bug hidden in your system.

The irony of this kind of a situation is that you cannot anticipate what you do not know. This exactly portrays my initial feelings when a crash happens. Despite all the tests, all the validations, and checks that were completed we did not detect the small bugs…

“The problem with experts is that they do not know what they do not know!”

Nassim Nicholas Taleb, The Black Swan: The Impact of the Highly Improbable

Keep calm, it is impossible to imagine something that you have never heard of before. When you created your system, you got that false sense of confidence — that you were becoming an expert. You believed that your application was under control. On top of that, you were sure that no bugs could happen, nothing in your mind was able to disturb the logic of your code!

The first step is to admit that we are not the expert we think we are, and the second is to build an incredible system to manage our mistakes. The waterfall failure effect that happens in micro services oriented systems actually can be prevented. So how do we get to this ideal state you may be asking yourself; by creating a fully revved up, kitted out, incredibly robust recovery mode system that will drive us right back to normality!

B — How to scale your application?

How do we eliminate the blind spots? What do we need to do to recover 20/20 vision?

In literature, writers work with blindspot matrices, and surprisingly, this methodology can also be applied to software development.

Our natural mindset is always focused on the comfort zone:

Blindspot Matrix (to orient strategics perspectives)

Our goal, however, was to push out of our comfort zone and into the uncertainty in order to manage these blindspots via other known skills. See the matrix above for visualisation.

  • Known weaknesses: help in the creation of a system with minimal client input requirements and no required understanding of application logs
  • Known strengths: help create strong systems with minimal required metrics
  • Unknown strengths: help to construct part of the uncertainty areas and definitely provide an indicator on what we do not know

The Blindspot matrix pushed me to deeper analyze my needs and to shift my mindset towards prediction.

Based on the latest talk given by Pierre Vincent from Weareglofox, we learned that to anticipate the red blindspot zone from the above matrix, the following Three pillars of Observability may prove useful:

  • Metrics: Grafana board, Prometheus, JMX; all these tools help you monitor the overall health of a system, and the aggregation of these metrics will provide you with the events that generated trouble or issues
  • Traces: CorrelationId for end-to-end overview
  • Logs: to better understand the process and story in your system

Below is what I used as a developer to introduce more resilience and possible exploration points:

Mandatory logs to understand what’s happening!

Based on the above workflow we were able to identify the issues. This greatly helped my operations colleagues and based upon this information they quickly knew how to proceed. In case of outage or issue in your system, what is your first reflex?

  1. Rollback
  2. Reduce the load
  3. Act on the load balancing
  4. Check the logs
  5. Communicate and explain the issue to the stakeholders
  6. Attempt to reproduce the issue on your own (when it is possible obviously), but this may increase stress if the issues is confirmed

There are many books that cover observability patterns, and they can precisely explain how to implement the best pattern in your system to predict or prevent issues. Now that the cat is out of the bag on the issue(s), we need to get our system back to normal.

Time is scarce. You have to prioritize which metrics, safety mechanisms, and dashboards you should implement first. My teammates and I learned valuable lessons post-chaos; all consequences need managing. While metrics are useful in a system, a recovery system is vital.

In the next section, let us review the key takeaways from my experiences.

C — Main lessons learned

Lesson #1 | As a developer, I want to be able to explain to my stakeholder what happened with precision.

https://www.monkeyuser.com/2018/final-patch/

After recovering our systems, we need to know how to recover the core business (transactions, appointments, orders, articles, etc.) and guarantee that during the blackout nothing was lost.

Thus, we implemented our own creation, lovingly called, the Workflow re-processor. Such a big name, sounds like a steamroller, it is not, but it really saved our lives and sanity!

So, how did we build it?

We achieved our build by simply understanding every step, every detail, every nook and cranny of our workflow, and how our system was orchestrated. At every step of this system, we identified the essentials needed in order to keep the system efficient. These essentials included but were not limited to the caching system, database, buffer, memory….

At some point, depending on the legacy items you are managing, you will have to take the strategic decision between monitoring or debugging. This means that you have the choice between analyzing all metrics on your wonderful dashboard which will point you to the specific point of failure, or you could try to reproduce the issue step-by-step in a SAFE environment.

to monitor or to debug that is the question?

Lesson #2 | As a developer and third level support, I should be able to have a proper state at each stage of my workflow.

This lesson taught me that we need to be able to extract each stage of our core business from our workflow. This enables us to understand what each part is in charge of, and truly follow the process behind the scenes.

Example of module interaction with caching and persistencies combined

In the above example, you can see that each component is independent. Unfortunately, in case of an outage, the communication between two system components is not automatically saved in memory. Hence, we needed to find a way to resend messages between components. We decided to use Redis as a caching system in addition to a database persistence layer. This enables us to restore the core business orders after an outage.

In our case, consequently, all messages are saved and cached between each component and can be replayed at each step to guarantee the reprocessing in case of an error. You are probably asking yourself right now: “Is it possible that a component reprocesses a message?” Each component is idempotent to be sure that a single message cannot be processed several times.

What are the benefits of this architecture?

  • The caching system contributes to the save process in progress. We always have a “clear” status before continuing the workflow
  • All messages, regardless of status, are displayed in the web application. We can then reprocess a message as we want it at any stage of the workflow
  • Keep the technology simple, no huge name like Kafka…

Through this workflow, we guarantee that every message can be reprocessed in the workflow as if it were the first input.

Lesson #3 |Observability is not only mandatory but should be introduced at the beginning of each artifact created in any Distributed System.

Perspective reinvented?

It has been my experience that it is always preferable to prevent and predict when a system will become unstable rather than frantically managing through the chaos of a breakdown!

I like to allude that my experiences as a software engineer run in parallel to that of my peers in the medical sciences. Both sciences deal with a complex mass of interacting components where symptoms are difficult to diagnose and there is no one size fits all diagnostic. It is simply not always obvious what is a hiccup in a body or a system and what is a full blown disaster.

Now that we have introduced a way to manage a hiccup; what about a solution to prevent a disaster?

In disaster prevention, always manage events through a variety of diagnostics in order to reinforce your system’s predictability:

  • Aggregation of events provides us the correct metrics to initialize
  • Correlation of events can improve the traceability and highlight the workflow from producer to consumer
  • Indexation of events in the logs can show you the full picture

Remember, when you “inherit” a legacy code ask yourself this question: How do I apply the above observability examples to prevent any un-monitored outages in this new system?

I hope this article serves as a guide to help you avoid chaotic messes and common big system mistakes.

May the force of observability be with you!

--

--