The importance of Proactive Service Monitoring
When I started my journey in the IT world, my badge bore the motto of the company that saw fit to grant me this opportunity (Comverse technologies, you are forever in my mind 😃): “Our goal is to meet or exceed our customers’ expectations”.
Take a moment and think about this motto. We all want to do a good job in service of our customers, but the thing that defines a successful service is when you foresee what your customer needs before they know it themselves. I’ve done my best to follow this motto during my career and to implement it in any workplace I’ve had the good fortune to work at, and Gett is no exception.
The primary customers of Gett’s Global tech support team are the front line support tier, the regional customer care operators who take the calls from the riders and drivers and do their best to resolve their predicaments. In cases of “Fire” incidents, they are the first line of defense to tackle the situation along side R&D to resolve the issue as soon as possible.
Meeting the customer’s expectations means — resolving the incident as soon as possible. But what about exceeding their expectations? This is where our proactive monitoring mechanism comes into play, where we foresee an incident before it starts and proactively alert our frontline support to be ready: ”There’s a Storm Coming”.
In the following blog post, I will review how we created this array of proactive monitors, what are the tools we use for this and our roadmap for this project.
Monitor the world…
Gett uses 3 types of monitors:
- Applicative: we monitor each and every micro-service we deploy using NewRelic. This monitoring includes (but is not limited to): latency, error rate, throughput and several other parameters that tell us if there’s something wrong with this service.
- Infrastructure: we monitor the health of the system infrastructure using DataDog. This monitoring includes (but is not limited to): Memory usage & CPU of DataBases, Lag between replicas and master DBs, Error 500s recorded by the ELBs and more.
- And last (and most relevant for this blog) — Business monitoring. Here we use a set of Grafana based monitors that enable us to detect anomalies in specific business flows based on predetermined thresholds.
While we strive to be as proactive as possible in each monitoring aspect (i.e. alert our customers when we see an application flaw in a microservice, or a degradation in an infrastructure element, that can affect the business), it is that set of business monitors that is our primary aid in being proactive.
How did we get there?
As with every good thing, the road was long… As you know, Gett is a ride hailing service first and foremost, so we huddled together with several business analysts and mapped what we call “the golden flows”:
- Login to the application (Rider and Driver).
- Get a taxi as soon as possible.
- Pay successfully for the ride (when using a credit card).
- Receive the ride summary report by mail on time.
- And a few more (the list is not that short… 😇)
Each and every one of these golden flows is mapped to a specific part of the code and assigned an event that is triggered with success, or failure, of that action. These events are written to a database that can be queried and the set of collected results is presented in Grafana. This way, any anomaly is visible to even the untrained eye and can be used to proactively alert our customers on an impending production issue.
For example: if we detect that a timeout from one of our 3rd party vendors is exceeding the usual threshold and as a result, we fail to provide a critical service to our customers (for example deliver notifications on concluded transactions), we can assume something is wrong here, send a proactive notification to our users and advise that some disruptions in service may happen in that specific area. Or, if we detect a problem with a specific cellular carrier, we can provide our Tier 1 operators with a tool that will significantly minimize the MTTR of tickets raised by customers who use that specific cellular carrier.
As I always say, “A picture’s worth a thousand words” (well, maybe not an original saying of mine, but you get the point 😉), so below you can see examples of our Grafana graphs when a business related anomaly took place and we successfully alerted our customers before they started to “feel the pain”.
As you can see from these dashboards, an anomaly was detected, so we reacted and provided our first line of defence the means to be ready for the storm.
So what’s next?
Well, that’s a good question, with a very simple answer: Be as proactive as possible and as collaborative as possible. Each new service we deploy that is part of any “golden flow” comes equipped with the needed events to trigger a “success” or “fail” alert which we then add to our Grafana monitors.
Additionally, we share these monitors with our Tier 1 support so they get notified along with the tech support team (keeping them less surprised with production incidents). The ultimate goal is to automate this procedure and I’m sure we’ll get there eventually.
For now, this is your friendly Gett incident management team signing off and reminding you of the answer to life, the universe and everything…