Tale: One Line To Monitor Them All!

10x Engineering Series – Microservices

UC Blogger
Urban Company – Engineering
6 min readMay 24, 2021

--

At Urban Company Engineering, we take a lot of pride in the platforms we build. One of the most leveraged platforms is that of our microservices. This blog is a part of the 10x Engineering Series – Microservices.
If you haven’t read the introduction,
go here.
This blog showcases an example of what-is-possible in our engineering world — based on just 2 simple steps: RPC, Directory of Services/DBs.

“We have grown to be a team known for bold interventions, timed well.”

There is no conversation complete without bringing in metrics! So here it is, what does it take to measure a service at Urban Company?

One line!

/////////// server.js
const RPCFramework = require('openapi-rpc-node');
RPCFramework.initService();

The above is the standard way ALL services are initiated, and that’s what gives FULL monitoring capabilities by default to a service. It’s all baked into our RPCFrameowork. Nothing more needed. 100% adoption.

That's simple. But what all does it allow?

Let us not talk. Instead, show you what our monitoring looks like.

The things you should take away are – breath of the metrics captured, the depth in segmentation, singular consolidated dashboard, alerts on top.

We use Grafana with Prometheus on top of standardised logs / metrics to power our monitoring. We also have Elastic APM for a tracing use case.

What Do We Monitor?

Throughput, Response Time

Nothing special. The standard things – like p95s, p99s, last-week-same-time, etc.

Sensible Errors

We have a standard definition of an error called UCError. Any service error thrown knowingly (dev created error) or unknowingly (system/framework error) is wrapped in this following a simple structure. This allows us to standardise what we capture – error name, error stack, error parameters!

Segmenting Metrics on API & Client

All our service metrics are collected at a (service) x (api) x (caller) level. This allows us to view metrics at a very detailed granularity. Eg – errors by calling service per API!

Databases

Nothing special. Good APMs will give you this.

Events

We monitor all the async communication happening for a service (published/consumed events) with their throughput, delays, etc. This is structured by event-name making it very easy to understand.

Infra & Auto Scaling

We are able to track infra metrics for a service in the same monitoring tool. The most basic information such as CPU, or Autoscaling in action are easily displayed.

Cost

This one is interesting. Yes, at a service level, we are able to show the daily cost of running that service. This is spread across — application servers, database servers, queue, scripts.

Examples — Throughput, Response Time & Errors for a service available at — service , api , calling client
Examples — Databases, Events & Infra for a service

Alerting by Default!

Since we have good control on the platform and have standardised errors, we have also taken the opportunity to build service-level alerts by default!

Alerts bring our metrics at our doorstep!

The secret behind good alerting is having a good on-call process, having a roster of on-calls (that is known to everyone), having a way to reach out to the on-calls, and having a good mechanism to figure an alert.

All three for us are well platformised.

  • We maintain a roster of all on-calls by teams.
  • We have all services mapped to teams. (allows us to map on-call to service)
  • We have standardised monitoring & error capture. (allows us to build common alerts)
  • We use common tools to track an alert and talk about it. (Jira, OpsGenie, Slack)

We are able to capture a wide variety of alerts – service downtimes, throughput or latency anomalies (service & databases), slow queries, etc.

What’s super helpful is, the conversation takes place at one common place – so that multiple people can chip in.

We use a ton of Slack! Alert tickets show up there, we open up threads, and get on debug video calls. All that simple. No chasing around!

( you can customise the alerts on top of course. easy to do those. )

Simple example: Create Alert > On Call on it! > Closed

Monitoring Traces!

In our monitoring stack, we also track slow queries for a service. The sampled slow-queries can be expanded into it’s trace – of the different services & different databases called. This helps a ton while debugging.
( we use Elastic APM for this, which is integrated into our RPC Framework. )

How Does This Work?

The secret sauce is in structuring the information and capturing it across all the right points. Since all our services follow a standard RPC framework, we can insert whatever we want as a middleware. Over time, we have been able to build the framework out with the best of tools and technologies to help our engineers.

If you recall the “Zero Boilerplate Service” blog, it would have walked you through how to build a simple RPC and why it’s awesome as a platform layer.

Let’s extend that to see how do we capture simple things.
(this is a simple example to demonstrate how easy it is to get started!)

const RPCFramework = {
createServer: function(port, service) {
service.foreach ( method ) {
app.post('/' + method.key, function(req, res) {
var params = req.params;
var ret = method.value(params);
var start_time_ms = Date.now();
ret.then(function(d) {
res.send({
error: false,
data: d
});
}).catch(function(err) { // ERROR HANDLING & LOGGING
Logger.log('error', .... log error ....);
res.send({
error: true,
err_name: ...,
err_stack: ...,
err_params: ...
})
}).finally(function() { // LOGGING API LEVEL MONITORING
var end_time_ms = Date.now();
var latency_ms = end_time_ms - start_time_ms;
Logger.log('apm', .... log latency ....);
})
})
app.listen(port);
})
}
module.exports = RPCFramework;

Since this is a central framework, you can add a lot more to this. Our framework started off as simple as the above, but has evolved into a very mature codebase. We leverage existing technology wherever possible, or create our own tools. All of this is however made easy to adopt because it belongs to a singular stack that is adopted across the whole of engineering.

Summary

We have invested in bringing all our errors and service tracking under a single platform , that helps standardise and structure all the information. This has enabled us to build a powerful monitoring stack across all teams that gives them all information about a service at a single place and brings alerts to our doorstep!

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger) . Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

You can read up more about us on our publications —
https://medium.com/uc-design
https://medium.com/uc-engineering
https://medium.com/uc-culture

https://www.urbancompany.com/blog/humans-of-urban-company/

If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com

--

--

UC Blogger
Urban Company – Engineering

The author of stories from inside Urban Company (owner of Engineering, Design & Culture blogs)