Towards building a Self Healing System — Make metrics as first class citizens

Neeraj Sharma
6 min readMar 26, 2018

--

A lot that goes behind keeping systems available all the time

There is a lot of talk making systems more and more intelligent. This is primarily to offload a lot of stuff which can be automated, so that we (humans) have to spend a lot less time doing the obvious (maybe-reset-systems-to-recover). There are numerous times when it is not as simple as pushing that reset button (no pun intended). Additionally, with the advent of ever more complex requirements further translating into building highly available/fault-tolerant systems the issue detection and recovery are getting ever more difficult.

This post is a sequel to my earlier story on “Soft Real-time, Fault Tolerant and Scalable E-Commerce Transaction Platform @ redBus powered by Erlang/OTP” [1], so feel free to look that up for an overall context. Having said that, any sufficiently complicated system requires provision for extensive real-time system and application monitoring. More often than not application monitoring do not get enough attention, while is left out for later. Additionally, many of the solutions rely on retrofitting existing generic strategies to cater to application monitoring while they add very little value eventually. This strategy though quick initially is bad for building towards self healing systems primarily because of lack of planning, wherein once most of the system is built it is very challenging to justifiably fit application monitoring.

There is a lot that can be copied as-is from the well established telecommunications industry (where I spent most of my professional career), which has a very strong focus on application metrics and alarms. Although, a lot is driven by regulations and standard bodies but once you have worked with them it is very difficult ignore the benefits it bring along. It is very important to give equal (if not more) treatment to monitoring in comparison to business features. This is one of the key ingredients to keep up with the ever growing demand for maintaining up time of such complicated systems in production in addition for paving the way for building autonomous self healing systems.

Rules of the Game!

  • Measure whatever needs to be measured as per business and system requirements.
  • Handle complexities of transactions (if any). Did I not tell you that business metrics must honor transactions, so any rollback must be handled as well.
  • Measure them in real-time (anything slower than per-second metric collection and publishing is too slow).
  • Ensure that business compute takes priority over system monitoring (do not spend too much time measuring stuff under heavy system load).
  • Employ high-performance monitoring infrastructure to handle massive load of metric collection.
  • Automate (program) what you learnt, so the system takes corrective actions automatically.

In order to achieve the above it is very important to spend sufficient time upfront and worry about business cases, while ensuring that you do not go overboard and collect too much of it as well. It is critical to understand that for per-second (with soft real-time guarantee) metric collection the information must be stored in the fastest digital medium available, which is at present the system main memory. This fact alone makes the log based monitoring solutions obsolete for such a use cases. Additionally, when you can separate important system attributes from the rest then the requirement dump text into log files (which would them be required to parsed unnecessarily later) do not arise. If you notice closely then a lot of compute and network i/o can be saved by taking that one (seemingly) simple step of planning up front and “Make metrics as first class citizens” while designing your systems.

You have help — Say hello to DalmatinerDB

There are numerous libraries (like folsom for Erlang), which allows easy storage and retrieval of in-memory metrics. Additionally, there are numerous alternatives for massive storage and retrieval of metrics. The one which is used within the redBus transaction platform (sparking this post) is the DalmatinerDB, which is an amazingly fast and optimal metric datastore. Although, it is rough along the edges but it is totally worth it.

A single hardware node of redBus Transaction Platform generates more than 17 Million metrics in a single day, while data is retained forever (nearly); thanks to the massive compression made possible by both DalmatinerDB internal file format and ZFS (suggested by DalmatinerDB as well).

What is so special about Business Metrics?

It is no surprise that business metrics are very different from general system metrics (like CPU). The former must be accurate, while the latter can be interpolated when missed (or dropped). There are multiple challenges when trying to capture real-time business metrics at scale, while ensuring very little impact on system latency. It is important to note that application nodes must be loosely coupled with one another, so the intent should be to directly push metrics independently to metric datastore. In this case the application nodes directly send metrics (through a carefully crafted soft-realtime policy) to DalmatinerDB simultaneously. This is possible with the provision of dimensions, wherein different nodes can independently send metrics without colliding yet whose aggregation is easily possible.

DalmatinerDB automatically interpolates metric data, so special mechanism is required (in this case my patch) to ensure that interpolation is switched off. See my replace_below_confidence patch, which reset data to some default value when the confidence level is below certain value.

Why per-second metrics?

Isn’t it synonymous to say why 99.999% up time?

The faster you detect and better yet predict the lesser failures you’ll ever have. There is one inevitable truth in a rapidly changing industry as e-commerce: “Deploy new logic faster than ever”. There is no getting around the fact, which coupled with limited resources (Dev/QA) makes system availability and consistency very challenging.

Needless to say, avoid too many man-in-the-middle components while publishing metrics (for intermediate processing), because that will add additional compute and latency requirements. Let the application be directly made aware of raw metrics (counters, etc) and better employ in-memory histograms as well for making computation of 99.9th percentile possible in real-time.

Detailed in-memory histogram for critical metrics

The above screenshot is a sample metric collection one of the production systems capturing latency in milliseconds towards some external service (with the help of Folsom).

What about Self Healing?

The first step towards self healing, is capturing as much relevant information (remember the upfront time which you should have spent) with very low latency. It is important to note that information and especially metrics have very short span of relevancy. A temporary blip in say database connection could have huge impacts within seconds, so start worrying about what you capture and do that fast before you could act on it.

As a sidenote, it is easy to reroute calls within Erlang application cluster, when say you are able to detect an unbalanced connection pool usage for SQL database. This can be useful when you see a unbalanced requests to you application cluster wherein say most of the slow external calls (say costly SQL queries) reach on smaller subset of nodes.

There is a lot more to building self healing systems than just collecting stuff, but then I did not say this was the end of it and more importantly achieved. It is an ever changing target, but the further you get the less you’ll have to worry about your systems at night.

References

--

--