How To Monitor Your Production?
Automate the Process and Know Before Your Customers Get To Experience the Fault
I used to work on a system with SLA of 24 hours and used to have around 3000 users. Our production support system used to work like in the following order:
This is a good to go process if the blast radius is small, when the issue is affecting less users and less services. But in a huge system with more than 100 components and with more than 1 million customers, this reactive model will fail.
The trick is to know before your customers that your system has an issue.
For huge and complicated systems, it is recommended to automate the production monitoring. If something goes wrong, alarms will trigger so that engineers can be engaged to mitigate the issue.
Here are the few most important components of automated Monitoring ecosystem:
- Logs:
For debugging any issue, you need logs. Logs are the source of truth of the stack trace of all errors. Every component from application services to dns servers, every component should have logs published somewhere. For visualisation of application logs we can use ELK stack, Splunk.
2. Dashboards:
Like Logs, Dashboards are one of the most important parts of Monitoring. Dashboards will display some of the basic stats of each service like —
Total requests, Error Rate, Latencies(P9x), CPU Usage, Memory Usage, Pod healths, Upstream Error Rates and Latencies, Downstream Hits and Error Rates, Database Stats.
Apart from these basic dashboards, every service has their Core functionality, which should also be mapped in Functional Dashboards.
Graphana(MMS) Dashboards can be used extensively as it works on different metrics emitted by different components which are light-weight, realtime and scalable. Custom functional dashboards can be configured by writing custom queries in log visualiser (Kibana/Splunk)
3. Alerts:
As part of the production monitoring ecosystem, there should be alerts configured based on your historical usage and state of application. Manual intervention will be needed when something is wrong, but it is not possible to always diligently monitor all dashboards plus this will be an extra cost to the company.
The automated approach with alerts will let the on-call team know that there is something wrong. Alerts can be of severity critical or warning such as —
- Increase of Error Rates in Percentage
- Increase in latencies of the APIs
- Increase in Requests/sec than regular traffic
- Increase in System Memory Usage above a threshold
- Increase in CPU usage
- Restarts of Pods
- Cluster down
- Network errors
There should be associated playbooks with each alert so that on-call engineers understand what to do in minimal time.
4. Watcher / Alert Manager
Alerts can be configured in different components — Prometheus alert, Splunk Alert. But once an alert goes off, there has to be a Watcher or Alert Manager, who identifies the alert and lets the notification systems react as configured.
Spotlight does a great job, aggregating alerts from different systems and triggers notifications to different systems.
5. Notification Systems
There are different ways of notifying a team about some alert — like by the chat apps like Teams, Slack, etc. or by emails. One of the most easy to use notification systems is Xmatters, through which on-call teams will get calls to their registered numbers and they have to acknowledge.
Pro tips
- Identify the most important metrics of the products. Plot a graph of those metrics in Graphana with historical data of the same time of the last 2 weeks. This will help to understand the overall picture about how the whole system is performing based on external traffic.
- Perform stress tests of your system in a regular interval to maintain the sanctity and performance of the product.
- Identify upcoming events of high load and keep the system ready for it.
- Proper Rotation and Handoffs between OnCall teams supporting the whole system.
Conclusion
Based on the complexity and SLA of your system, you should design your monitoring process. The examples I have spoken above, are the two ends of a scale of a Production Monitoring system. Based on the use cases of the system, cost appetite of the company, we have to decide till which point we want to implement.