How does Qoala become better at Production support?
We at Qoala are fully committed to making our customer’s life easy by getting the right insurance product for them. We are obsessed with customer satisfaction that’s why we take our production support very seriously. It’s bread and butter of our engineering practice.
For those who do not know, Qoala is an omnichannel insurance company, with the purpose to democratize, empower and redefine insurance for customers. Qoala believes in order to redefine insurance, it takes all stakeholders to be involved. Hence, Qoala is committed to providing the utmost value to all of its stakeholders including insurance partners, customers, and regulators.
Given our mission and the criticality of businesses, for us supporting our customers, 24*7 is super important. But supporting is not just setting up the Customer Service(CS) department, it’s a process where from engineering to the business every one involves in making the platform up and available for our customers. As we are a startup and exponentially growing it’s very critical to set up a process and create awareness about production support.
So this is not a new practice in the industry. But still, some people struggle with this sensitive topic and do not implement it the right way.
Even with the best system, there will be issues in your production environment that you never anticipated. And when this issue happens if you do not educate and invest in managing the production issues, your customer is going to be dissatisfied with your platform.
Ok, now what! How to approach the production support for your team and company.
When we started, we wanted to be better at it but there was no plan and objective to achieve it. We were using one account per team to reduce cost but this was not reducing the cost but increasing it as there was no visibility who is on-call whom to reach out to. We were thinking more about transactional than non-transactional output.
We did not have any formal SLA to achieve we just wanted to reduce the number of production issues and had a very optimistic view that it will be fixed asap. We even did not know how many issues were occurring per week and how many of them were re-occurring and how many are tech-ops issues. We were having RCAs etc but not being followed in a disciplined manner. Being in this state is very normal for early-stage startups as they are prioritizing the building than optimizing the product. But we wanted to fix this broken loop in the organization as early as possible to have more time to build and innovate.
We started with very basic steps which are written below to become better at production support
- Incident management tools
The first thing we did was educate the team about production support and most importantly why it’s an important skill for an engineer to know about production support. And we were so surprised that most of the engineers in our team wanted to learn and contribute during production support but they were not aware of how to do it because some of them felt shy, afraid of judgment from others, etc. But with time, we were able to make a no-blame culture in our organization and be more proactive and open about production issues.
When it comes to processes, engineers do not like the word itself 😐 . So we were very careful to not make it feel like additional jobs or tasks for the engineers. We introduced and started setting up the on-call-engineers for our respective teams with more flexible shifts and rotations, the idea was to make ourselves more comfortable around the on-call concept. we made dedicated on-call groups along with an SOP for what to do when someone is on oncall.
We are trying the runbook concept as well where the on-call engineer maintains the mitigation and debugging steps so that the knowledge sharing becomes easy.
We are doing recurring RCA’s retrospectives too, to review together with the root cause and learn from it to avoid it in the future
The most important thing about the production support is, the on-call engineers should own the respective business. And ownership comes with more power and right responsibilities. We started with a very basic step which is giving all access to our engineers, when engineers have access and exposure to the production env they themselves started checking and owning the platform.
This is the most important thing a company should do if they want to become better at production support. For Oncall/production issues the most important metric is MTTA(Mean time to acknowledge) and MTTR(Mean time to resolve). We started with 30 min MTTA and 180 min MTTR. This was an easy target to start with and at the same time introduce the team to SLAs and metrics around the production issues. We are now targetting a <5 min MTTA and <90 min MTTR.
To improve MTTA and MTTR, our system should be able to detect the anomaly asap, to detect anomalies we need good monitoring infrastructure. So we use extensively Datadog and Cloudwatch, we push our business metrics to DD according to business unit and started using DD APM for our microservices to know service level metrics. We invest a lot of time in capturing the right metrics so that we can create the right alerts/monitors. We have now around >1000 unique monitors to detect the anomaly.
Logging makes engineers' life easier to debug during the production issue and to know the root cause we need the right logs at the earliest time possible. So the first thing we started doing is logging our logs with Trace-Id, this is something we are still improving though. By logging trace-id, we can easily detect the lifecycle of a particular request, doing so automatically reduces the time to resolve. The faster you know the root cause the faster we fix it.
Incident Management Tools
The right tool is very important and many times most companies do not invest in on-call tools like Pagerduty, Opsgenie, Zenduty, etc. We started with Opsgenie and now moved to Zenduty due to pricing/metrics/support reasons. Zenduty helps the team keep a track of weekly occurring, re-occurring issues, we design the on-call schedule on the tool to escalate the alerts to the on-call engineer and provide us with a robust interface to manage the incident within Slack, which is our team communication channel. And lastly, the MTTA and MTTR are recorded and visualized on the tool, to help us compare the actual and target numbers for improvements. We are happy to pay for on-call tools because the value this tool adds for our engineers and customers is much higher than the money.
I hope I explained everything clearly enough for you to understand. If you have any questions about engineering practices, feel free to ask. You can find me on Twitter.
Make sure you click on the clap below and follow me for more stories about technology :)