Monitoring large scale e-commerce websites at MakeMyTrip — Part 2

Shashilpi Krishan
Dec 30, 2017 · 5 min read

Disclaimer: This blog series objective is to share that how monitoring is done at MakeMyTrip. I will publish series of 4–5 blogs to cover up so stay tuned. This part will cover app monitoring aspects and monitoring pipeline at MakeMyTrip.

“(What/Why/How) to monitor Multi-tier Web Apps”

Few Concepts

Real-time monitoring with Granularity: One also should understand that in multi-tier API based application, what’s important is how fast the metrics are available to visualize & that too with utmost granularity i.e. one should be able to see overall app response time with various grouping/splits like API-wise, Server-wise, datacenter-wise and even traffic source type-wise (User vs Bot vs internal). Having all these splits available expedites the diagnosis & greatly impacts MTTR (Mean Time to Recover). Equally important is that sooner you get the metrics better it is, the MTTD target (Mean Time To Detection) should be within 2–3 mins.

Response Code & Request trends: In web applications, the vitality of the system also depends on HTTP response code response sent to client requests. If response code returned is 4xx series it means client has erred in sending request e.g. 400, 403, 404 etc. and similarly 5xx series clearly represents server side error i.e. server has understood the request but is incapable of serving the request hence ideally 5xx should be zero. Request trends is another must have metric as request pattern will surely drive app performance e.g. sudden requests spike may cause response degradation if the request rate breaches the capacity limit beyond application max. throughput. Another behavior we have seen is that when request volume is very low then also one may see higher response times, as chances of outlier disturbing the latency percentiles becomes very high. Last but not the least, having this trend is super-critical while performing capacity planning as this metric is widely used as unit of rated output/throughput of an application.

Now! How we do this at MakeMyTrip?

Stage -1: [Logs collection & Shipping] It all starts with logs collection of every application server which is done through a ultra-light weight “SYSLOG-NG” agent deployed on all of them, purpose of which is to ship the logs to a centralized Logstash server on a specified port over UDP protocol hence no overhead on app server. Different logs are pushed on to different ports of Logstash server to aptly utilize the hardware resources. In syslog-ng.conf we define the source log file path and destination (Protocol, Host & Port)

Stage -2: [Logs parsing & stashing] Logstash is an open source, server-side logs processing engine that ingests data from syslog-NG, transforms it using codecs (GROK, JSON) and then stash it to Elastic search. Our central Logstash server has multiple isolated Logstash processes where each process listens on specific port to read the logs ingested by Stage-1 from multiple application servers. Currently we ingest ~700 GB of logs per day which are processed by hundred Logstash processes hosted on multiple machines. Each process uses GROK parser to filter relevant log lines and separates logs into key-value fields like URL, METHOD, IP and output them to 2 different Elastic search clusters. Below is an example GROK:

Request: 10.10.10.10 GET /api/index.html 15824 0.043

GROK: %{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

Stage -3: [Metric extraction setup & aggregation] Now Elastic search has metrics for all the application but for every single request event. We have a home-grown Django-Python based tool “DATA MONSTER”. Here we setup the metric aggregation jobs which will poll elastic search at defined interval, querying for last 2 mins and also the ES index host URL with request body query . Also we specify JSON expression to extract metric value from JSON response sent by Elastic search.

Data monster UI Metric configuration

Stage -4: [Metrics computation] Now metrics are configured via data monster, all those configurations are stored in Django-based MySQL DB with URL & scheduling details. Celery Beat is a scheduler which kicks off tasks at regular intervals and in our case it picks tasks from MySQL DB and queue it into Rabbit MQ. There its picked by celery workers which hits Elastic search with specified query, extracts metrics using JSON parser and finally generates OpenTSDB put statements and sends it to KAFKA cluster.

Stage -5: [Metrics Storage] KAFKA messages ingested by celery are then consumed by Storm based processing engine where a continuous running topology reads the messages from KAFKA queue, validates the put statements generated by Datamonster for errors and inserts into OpenTSDB which is time-series metric storage database that works over HBASE. It has in-built metric aggregators like sum, zimsum, avg, percentiles and auto-down sampling capabilities.

Stage -6: [Visualization & Alerting] This is the final stage or the harvesting stage because it’s time to finally ripe and make use of the gathered metrics. We use “Grafana” for visualization which is tool that provides a powerful and elegant way to create, explore, and share dashboards and data including features like Fully-interactive graphs, Multiple Y-axes, template variables, plugins like panel plugins, diagram plugin etc. Another usage of these metrics is for alerting, for which we have “ZABBIX” (an enterprise level open source monitoring platform”. ZABBIX handles all kind of alerting through its key features like Trigger dependency, custom monitoring setup through scripts, event handling (auto-remedy script triggering in case of alert).

Let’s call it a day but not the end — Next part will be covering BIZ metrics & app generated error codes.

MakeMyTrip-Engineering

MakeMyTrip Engineering & Data Science

Shashilpi Krishan

Written by

MakeMyTrip-Engineering

MakeMyTrip Engineering & Data Science