Monitoring large scale e-commerce websites at MakeMyTrip — Part 1

Disclaimer: This blog series objective is to share that how monitoring is done at MakeMyTrip. I will publish series of 4–5 blogs to cover up so stay tuned. Here we are going to just focus around system & network aspect of monitoring.

About MMT Infra

At MakeMyTrip (MMT) to host all business verticals we have hundred of applications running on several virtual & bare metal machines. Our infra is spread out in hybrid environment of data centers, public & private cloud.

“(What/Why/How) to monitor”

Before jumping on how to monitor, first one should know “What to monitor & why” i.e. what are different KPI (key performance indicator) to monitor for an e-commerce website.

System Metrics: CPU/Load Average, Memory (RAM, Swap, Heap), threads, connections, Disk space, Disk performance.

CPU Utilization vs Load Avg: Both can be checked using top command at same time. Higher CPU utilization could be one of the factor which might affect performance but same can also be impacted due to higher disk utilization or memory deficiency. Basically Load average value means that how many processes are waiting for CPU cycles which in turn would be busy or waiting for disk IO or Memory to respond. Hence if any system resource is choked then load average will go up. Having said that, CPU is a good metric but might fluctuate hence Load average is always something to keep a tab on. Linux displays 3 Load averages like load average: 0.75, 1.20, 3.30 which means average of 1min, 5min, 15min respectively. Now 1 min average might go high due to spikes but if 5min average goes higher than number of CPU cores then one should be worried. E.g if a system has 4 CPU cores then having load average of 4 means that all 4 CPU are 100% utilized and any more processor requests will be queued.

Memory/Swapping: Having zero swap usage is ideal scenario as swap usage directly impacts performance. Same can be checked using free -m command and if SWAP is being used then this will surely degrade One best example is that if someone runs GREP or VI a large log file then all memory gets eaten up by this GREP/VI process and hence killing the application due to out-of-memory.

Disk performance: Disk space utilization is something which every one monitors however what gets overlooked is disk performance i.e. read/write ops per second (tps), average wait time (await), disk utilization etc. These all can be checked using commands like iostat, iotop, sar -d and if you are using SATA or SAS drive then max. IOPS supported is 200 however SSD supports ~2000 as well. Higher IOPS than supported can very well degrade performance which is mostly ignored while debugging problems.

Network monitoring: Network availability is vital for an e-commerce company and so is it at MakeMyTrip. Primarily we use PING checks for inter-DC & internet connectivity. For bandwidth monitoring at network device level, we highly rely on low-maintenance monitoring platform “Observium” and for external network content checks we use “Uptime Robot” which constantly pings our externally exposed domains from external network every minute and alert in case of unavailability .

Heap/Threads/Connections: Many times we have seen the early pointer for a problem is indicated through these metrics. One can prevent or reduce the impact considerably by early detection of deviation around these KPIs. This group of metrics most of the time directly points to memory leak, non-optimized thread pool configs kind of problems and hence must be part of your monitoring quiver.

How we do system monitoring

Concept & Tools: Monitoring can be done either through push or pull model, for system level monitoring where you have to monitor so many individual servers its always better to rely on push model. For that we have deployed light weight agent ZABBIX on each server. Zabbix is the ultimate enterprise-level software designed for real-time monitoring of millions of system metrics collected from thousands of servers, services and network devices . Zabbix provides highly rich alerting along with visualization of captured metrics.

Visualization:

Grafana visualization screenshot

We also use Diamond to flush system metrics to OpenTSDB (centralized time-series metric storage db) in every 2 mins for all production systems to visualize a comprehensive picture of system, application & biz metrics. For all production servers we have 40 different metric which generates almost hundred of thousands metric data points every 2 mins. We have multi-node cluster for OpenTSDB (setup details of which I will share in later blogs) to support 6 months retention of captured metrics which can be visualized on Grafana (our visualization-dashboard tool).

Alerting:

Zabbix Alerts Screenshot

As already mentioned that we use ZABBIX agent on each server which send metrics to centralized ZABBIX server of respective DC. One super good feature we use of Zabbix is Auto-registration & Templates which is that as soon as a new server gets launched whether in DC or cloud, it will have Zabbix agent bundled as part of OS kick-start image which means that after new instance boots up it will automatically gets register on Zabbix as new host under relevant host group, gets linked to basic LINUX common template which will auto-create monitoring setup without any manual intervention. Basic LINUX common template will have alerting for CPU Utilization, Load Avg 5mins, Java threads, Disk space usage, Disk performance, ICMP ping. Other than common template, for each host group we have another template which has application specific content check based HEALTH CHECK alerts, PORT monitoring, application process monitoring etc.

Let’s call it a day but not the end — Will be back soon with next part where I will show details on application SLA & business monitoring including the architecture, tools, challenges we faced.