How a cat inspired system reliability at Knowlarity
Imagine the reliability of Knowlarity’s systems as the functioning of the body of a cat. A cat is a self-healing animal, which means that it can efficiently identify and heal wounds, broken bones, and scratches by its internal bodily mechanisms. Looking at Knowlarity’s operative functions, it is not hard to identify why it is excessively self-solving and pedantically smooth.
To explain well how Knowlarity’s system is maintained at a vigorous high, let us start with an idea that construes the measurement of the efficiency of a system. Consider this: whenever someone calls your business or you initiate a click-to-call, the outcome of the call is the only key metric for the business. More often than not, it broadly comes under following buckets:
- Was it a happy customer? Are we going to get repeat orders?
- Is it an angry customer? Do we need to revamp our delivery execution process?
- Did our agent initiating the Click-to-call note dispose the lead as “Promising” or “Not-ready-yet” ?
When a business relies upon cloud-communication, it is questions like above that should actually matter to the business — and not the reliability of the cloud service itself. The effectiveness of a business’s operative model is hallmarked by how smoothly and ‘invisibly’ it functions. The more calls for queries, product services, confusions, and complaints one gets, the more likely it is that the present system needs oiling, or perhaps a change in gears.
Asking the right questions
Ensuring high reliability of a system is a process that is not an inherent part of a product. All auxiliaries around the product need to support stringent standards to meet reliability metrics. The bare minimum of these procedures ensure consistency and dependability of a product. It is required that a product be perceived as such, because without it, it cannot sell, even if it is the next big thing since sliced bread. Diligence is required to measure, benchmark and monitor the product to ensure high uptimes.
Knowlarity ensures system reliability by asking pertinent questions at every stage. What helps maintain the standard that empowers our work are the insights that come from questions such as these:
- Are we using Carrier grade servers to ensure reliability ? L2 Cache is available to store codec transcoding tables for fast access?
- Are all switches and network devices in the network Gigabit-enabled switches?
- Are all Datacenters having more than one network link with auto-failover to ensure minimal outages in case one link goes down?
- Is all code going to development going to production reviewed for reliability and high-availability. (Eg — Will my code retrieve IVR details from backup database in case main database is not reachable)?
- Do we have failover infrastructure in all operational geographies? (This includes redundant power plugs for servers and routers.)
- Are we measuring performance of Telcos regarding Lag and Voice Quality periodically? (Eg — What if calls to a particular region start experiencing low call quality all of a sudden? Will we come to know about it?)
What happens when (not if) things go down?
One is bound to face hurdles while attempting anything of magnitude. Things will go down. To ensure reliability in such case, the following broad coverage is done here at Knowlarity:
- Coming to know about failure — How would you come to know that one component or hardware went down? In Knowlarity, we have something called “Zero Blindness” (ZB). Zero blindness ensures that one is constantly on high alert. It will be known to teams to initiate redundancy workflow or failover immediately after a system goes down. 95% of failovers across all components are automated, which also helps in keeping constant track and reducing the follow-up error of humans. . SRE team ensures no moving part exists anywhere which will not be monitored. This even includes entities which are not owned by Knowlarity, but required by Knowlarity for production.
- Creating redundancy and an actionable item on failures — The product development phase and engineering signoff phase incorporates failovers in the case that a component goes down. This enables us to catch onto mistakes and not make the same ones again. We believe in “Failure Driven Development”.
- Establishment of quality and reliability requirements for suppliers -Often, one has components outside their scope which are required to keep their products up and running, while still planning for eventualities around them to ensure high uptime. Knowlarity is as good as its weakest link, hence a set of rigorous benchmarks are established for admitting a new supplier in the system.
- Collection of Field Data and Root Cause Analysis of failures — Anything which is a measurable number is tracked for performance measurement. If a number goes beyond a benchmarked threshold, it triggers an alarm and initiates a procedure for analysis.
- Monitoring and Alarming System — All components of all systems, whether hardware or software, emit events and metrics that aid in monitoring systems.
Designing efficient alerting metrics
Although most of the alarming systems have been taken care of with industry-standard tools, we have reached a point where our metrics need alerting implementation on a larger level. The Alarming systems are automated, and they trigger calls/SMS alerts to the concerned teams in a elevated-escalation-failover manner with details of issue in TTS/SMS/WhatsApp. So, the failover calls are equipped to ultimately reach the Chief Technical Officer, and then even to the CEO. (Yes, you read it right. However, no calls so far have reached that level😉). Some sample questions that we ask ourselves while designing the alerting metrics are as follows:
- How do you know that your system is performing 5% lower at this moment compared to average of past 3 days? What are the sub-components that have caused this dip?
- How do you know that a particular Telco A has started sending call rejection messages in past 7 minutes for all numbers destined for Telco B?
- How do you know that visitors from Australia have suddenly getting higher latency in opening application pages? (Fibre-cuts, damn it!)
System reliability, therefore, is not just ensuring that there are no outages on a larger level. It is mostly about ensuring that the systems run the way you wanted them to run, and whether or not there is an impact on the system or on the customers . If all the components run by themselves and operate at the intended level, then the whole system is operating well.
So, now you know that all calls powered by Knowlarity are destined to reach their destination, given that the called guy has not gone to Himalayas for a break!
(Authored by Ashutosh Kumar, who drives System Reliability at Knowlarity and is maniac about metrics going down in system graphs)