Four guiding principles in dealing with software problems to achieve great client experience

Daniel Moldovan
DevOps Dudes
Published in
7 min readOct 10, 2021
So we can design scaling strategies. Perform capacity planning. Determine if the software needs further changes to handle expected client load.

You build a great software product. You offer it as a service. Your business grows. Clients demand new features. Load is increasing on your service. Clients start to rely more and more on your service and expect a good quality of service.

You realize that to grow and be successful, your clients need to be happy. How to keep them happy? By providing the best client experience possible. But how to optimize for client experience?

Well, what negatively impacts client experience the most? When your service is slow. When your service is down. When your service is not functioning correctly. But we all know that problems are a fact of life. Software will be slow from time to time. It will crash. Bugs will creep in. Problems with your service are guaranteed to occur sooner or later. And each such problem decreases client satisfaction.

There are a couple of guiding principles around dealing with problems I find useful for achieving the best client experience:

  • minimize problem occurrence rate
  • minimize problem detection time
  • minimize problem impact
  • minimize problem recovery time

There are multiple techniques and mechanisms to help us follow these principles. Going forward I name a few as examples, to give a better view over when the above principles can help us and how.

Principle 1: Minimize problem occurrence rate

Of course, the first instinct is to try and reduce as possible the rate at which problems with your service will appear. By reducing the sheer number of problems, we make the first step towards improving our client experience. Here we can focus on pre-emptive approaches.

  1. Functional testing. Problems can appear after a code change. Maybe a newly added feature breaks old functionality or introduces incorrect computations. We can rely on functional testing to weed out as many such problems as possible. We can use unit testing to spot problems in the new code. Integration testing to spot problems in how the new code interacts with other components. End-to-end testing to determine how new code behaves in complete client use cases. We can integrate and constantly execute all tests in a Continuous integration setup. Execute the tests after each code change. Execute tests periodically to validate changes in the components you are integrating with.
  2. Performance testing. Another class of problems that can appear is performance problems. Newly added code slows down your service. Or your service starts to slow down after a certain load level. We can rely on performance testing to catch such problems outside of our production environment. We can use load testing to understand how the service behaves at certain load levels. Endurance testing to understand how the service handles certain load levels for long periods of time. Stress testing to understand the system’s load limits. So we can understand when the service will behave badly. So we can design scaling strategies. Perform capacity planning. Determine if the software needs further changes to handle expected client load.
  3. Continuous delivery. The risk of problems appearing after a code change increases with the size of the change. With how many new features are deployed in production at once. What once was counterintuitive, is now common sense: to reduce the risk of new software causing problems, you increase the number of software releases you do, decreasing the number of changes in each release. Through Continuous Delivery, we release new code as soon as it is ready and tested. Releasing often helps us release fewer changes at once. A smaller release can be tracked and tested more easily and makes it easier to spot problems.

Principle 2: Minimize problem detection time

We all know that problems will always occur. Regardless of how much we try to avoid them. So, the next 3 principles focus on “making the best of a bad situation”. Helping us mitigate the negative consequences of problems.

The first thing to do when problems occur is to detect that a problem has occurred. There is generally a big decrease in client satisfaction if the client detects the problems and reports them before you do. So what can we do?

  1. Continuous testing. We can periodically execute end-to-end tests to exercise production flows. To determine quickly if a production flow does not function as expected anymore. And alert the appropriate team.
  2. Monitoring and alerting on metrics related to client experience. We can monitor metrics strongly correlated with client experience, such as response time, error rate, processing latency. And alert on those metrics to quickly detect and notify that your service is not fulfilling its SLAs.
  3. External monitoring. It helps to use an additional external monitoring system. A system that is running somewhere outside of your infrastructure, to check the health of your service. Clients are calling your service from various external networks. Your service needs to be accessible and work for them. An external system that calls your service and alerts on availability/response time/error rate is a great way to double-check that your service is usable by your clients.

Through a combination of monitoring client experience metrics, and constant end-to-end testing, we can detect most of the problems before our customers. Detecting problems early allows us to fix them faster. Which in turn increases client satisfaction with our product.

Principle 3: Minimize problem impact

Problems will always occur, regardless of how much we try to avoid them. So, our next step is to try and reduce the impact (blast radius) of problems when they occur, limiting their impact to as few clients as possible.

  1. Infrastructure segregation. We can achieve this by isolating clients or client groups on dedicated resources. For example, clients with very large traffic, or with very strict SLAs, could be separated from other clients. In this way, you can tailor and control things better according to client particularities. This should not be abused, to avoid having a snowflake setup for each client.
  2. Prepare your code for bad clients. At the code level, it is useful to avoid “stop the world” scenarios in case of errors. For example, for a processing pipeline. If one client sends erroneous data that cannot be processed, it’s best to just continue and process data from the other clients. And retry the processing of the erroneous data in a separate flow or later. In this way, you happily process as many clients as you can, without making everyone wait after a bad batch of data.
  3. A/B, canary releases. Minimizing impact can also be done at the deploy phase of any new feature. Basically, for any change we make, try to reduce as much as possible the number of affected clients in case things go bad. You can limit the number of impacted clients using canary releases. E.g. release a new feature to 1%-5% of your clients, and evaluate impact. If all is well, proceed to a larger set of your clients. Evaluate, and so on until all use the new code. In this case, if something goes wrong, only a few of your clients are impacted until you roll-back or fix the issue.

Principle 4: Minimize problem recovery time

Another objective important in ensuring client happiness is trying to fix any problem that occurs as fast as possible. Here we can split the discussion in two. Recovering from a bad deployment, when the new code does not function as expected. And recovering from an issue that surfaces during runtime, without doing any changes inside the system or code.

  1. Fast roll-back. To mitigate bad deployments, we can implement fast roll-back mechanisms to ensure that we can switch back to the old code as fast as possible. Blue-Green deployments come here in mind. Where we spin up an entire set of services with the new code, redirect traffic to them, and if something goes wrong, switch back quickly. Of course, this can be expensive in terms of infrastructure and traffic orchestration. Feature Flags are another way of supporting fast roll-back. You can set a configuration value that determines if your feature is enabled or not. This implies that your code supports the flag internally and executes different code paths depending on the flag value. However, when this is done, you can release the code in production, enable the feature, and instantly disable it if problems occur. A fast roll-back ensures the clients are negatively affected by issues for the shortest timeframe possible. The alternative, roll-forward, is almost always slower and decreases client satisfaction.
  2. Standard operating procedures. We assume that your auto-remediation mechanisms have failed or cannot handle the problem that has occurred at run-time. Maybe the traffic pattern or level has suddenly changed. Or some component entered a bad state. Basically, a human is needed to step in and recover your service. We want to recover as fast as possible from such situations. Thus, we want to avoid human single point of failure. We can achieve this by providing detailed, up-to-date, and complete standard operating procedures for troubleshooting and resolving problems. A good goal for a standard operating procedure is to be designed for a random person off the street. Any random person, if given the required access level, should be able to follow the standard operating procedure and recover the service in 80% of all cases. Of course, there will be cases that require special knowledge for resolution. But quickly recovering from 80% of your problems will make a great positive impact on client experience.

tl;dnr

To get the best client experience we have to consider what negatively impacts that in each step of what we do. When we design a new feature. When we implement it. When we release it in production. When we operate and maintain our service. In everything we do, we can keep in mind a few guiding principles: minimize problem occurrence rate, minimize problem detection time, minimize problem impact, minimize problem recovery time.

--

--

Daniel Moldovan
DevOps Dudes

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.