Sinch Blog
Published in

Sinch Blog

What is a System Metric and why do we use it at Sinch?

Hello, I’m Regis Gabetta from the Engineering team and I’m currently working in Sparta/Engineering team as a Senior Software Development Analyst. I have been working with our Latam platforms since February 2018.

Hello, I’m André Satorres, and I’m also working in the Sparta team. Currently, I’m a Software Development Analyst, and I work here since January 2020.

In the Sparta team, we manage the applications, services and resources and make sure that the platform is running correctly for all our customers.

Today we’ll be talking about system metrics: how we ensure that the service is working as expected and how to detect problems, fixing them fast.

A system metric is basically any information about your application that you can use to analyze, understand and track your system behavior. Systems metrics can be execution times, memory usages, logs, error rate, etc. We are going to discuss more about some of these types in this document.

The greatest gain that metrics provide to us is to be able to fix issues quickly. If, for example, some system is down or some error begins to occur in the platform, we’re instantly alerted, and start working on it right on. Since our platform traffics thousands of messages per second, the more time we take to fix issues, the more messages we lose, which directly implies in our finances. It’s very important to be always aware if the platform is up and working correctly.

Here’s an example: suppose we create a metric about how many messages we have in a system in each moment. For example, at 3pm we have 1,000 messages per second. So, let’s suppose that for some reason, 30 minutes later the count goes to 0. Something’s wrong here, we can’t have 0 messages because there’s always at least one customer on it. So, what happens if the number goes to 0? We have alerts that ring on Slack and there are a few teams who receive these alerts — Sparta is one of them. If we receive one of these alerts, we can check in Grafana — a dashboard with an open-source visualization and analytics software — what happened in more details.

The teams may cooperate solving these issues, some issues can happen in the database, other can happen in the resources so the DevOps team may help us, and sometimes the errors can be related to the project itself. So, we as Sparta team are in charge to fix it, or maybe a whole system is down, and we must restart it, for example.

One of the most important metrics to implement in the systems is the synchronous call. We have a system called Configuration-API that uses synchronous calls, which means we must wait for the responses. We can use metrics in this case to verify if the latency is very high, or the system too slow. For example: the expected values should be 1 or 2 milliseconds, but for some reason it’s taking 1 second to calculate the result, then something’s wrong. The machine for some reason is not processing (or receiving) the requests as fast as we want to, so we need to verify CPU or memory issues, network latencies, etc.

Number of requests per second X average response time

A metric that we have for almost every system is the throughput. This is uses to check if the system is processing the expected number of requests for that time in the day. We generally have the heap size and JVM memory usages for all the systems to check if they’re healthy, not consuming too many resources or presenting memory errors. We also generally have error metrics, so if the system is presenting too many problems at a time, we check system logs to search for the problem.

Another very useful metric is the queue size. If it’s increasing too much, we may conclude that maybe there’s a bug that the application is not consuming, and we must fix that. We also must check if the application is healthy and ready to receive requests. So, there are plenty of metrics that we use here to different approaches.

Metrics can also be used to identify where your system uses more resources or where it takes more time to process something. You can use a metric to check the time used to conclude some operations. Generally, we use the StopWatch implementation, a java class that can be used to count the time used to perform some operations. This way, we can see how much time an operation takes to finish, helping to understand if it’s taking more time than it should or even if there’s some point in the application that we can improve to have a better performance, reduce costs and be more profitable.

There are some tools that we use, such as Micrometer, Prometheus and Grafana. The Micrometer is used to export the application data to Prometheus, which sends the metrics to Grafana. In Grafana we have custom dashboards which are graphs where we can see all the metrics mentioned before and more, and you can filter all dashboards by time.

These metrics are very important to the product we deliver to our clients because they expect to have a system without errors and that properly sends the messages. If one client is not receiving messages, it’ll have to personally complain to us or instead we won’t know it. If we have these metrics, they don’t have to contact us, because we’ll be alerted about it by the system, so the client wouldn’t even need to contact us.

An important feature is that with the metrics we can make service level agreements, such as “95% of our messages are going to be delivered” and set rules for that to happen. The metrics indicate to us if we are in this service level, if we’re really delivering 95% or more of the messages. The metrics helps us to have proof that our system is working correctly and that our system is delivering the expected by our customers.

There is another important metric called SLA, that shows how long the message is stuck in our platform. For example, if it took 5 seconds to send the message it’d be a good metric. But if for any reason we’re taking 15 or 20 seconds instead, we have some problem.

The metrics can also help us track behaviors and make procedures for them. So, if some error occurs when the system is down, we can link with the metric a procedure and then the support team can quickly solve the problem and the customer gets more satisfied. In this case, we as Developers don’t even have to take any act on it, because the support team has the capacity of solving it by themselves. But, of course, if there are more specific issues, then it escalates to us.

We have high qualified people in our team, so we’re constantly learning here. When there is a complex problem, you’re never alone and the teams work together.

Besides that, we have monthly direct meetings with our managers, which gives us rich feedbacks about our work. Therefore, we always know about our performance inside the company. These meetings are also very good to share experiences, not only professionally but also individual experiences that might make us better as a person.

At Sinch, we’re empowered to share our ideas and be vocal about how we believe a problem should be solved and make decisions.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sinch

Sinch

210 Followers

Follow our publication to see stories about technology and culture written by Sinchers! medium.com/wearesinch