Making Time for Instrumentation and Observability

Published in

TeamZeroLabs

5 min readSep 13, 2020

How are the numbers looking?

Working in tech start-ups, we are often asked about metrics of each technical component in production. Ambitious new startups have the “build it first, then they will come” mentality. Whether their offerings are going viral or dead in the water, they prefer shipping new features than thinking about platform reliability.

Bigger shops with an analytics person (or team) on-board can put together analysis on which direction the business should go. However, as the product becomes bigger and more complex, it will grow to have a temperament of its own.

You may be familiar with these customer service reports:

“Customer is saying the form loads for them, but it does not complete.”

“Customer is saying their identification request has not completed in a week. Shouldn’t it finish in three days?”

“Customer is complaining about the product not loading and blank documents are shown.”

If you have heard of those things, you may be familiar with these internal chatters:

“I thought we rolled out that fix last month, why is it coming back?”

“Didn’t we QA for international names?”

“Are we being hacked? Why is the site behaving so badly?”

For every customer report you heard about, ten to a hundred other reload, retry, and gave up. By the time you hear about it in the engineering team, it has impacted thousands or more.

How can we stay on top of these technical issues? This where instrumentation and observability comes into the picture.

Is your code instrumented today?

Can you give me the number of successfully handled requests vs failed ones?
How about the number of times where a customer order was inserted into the database?
If this data is in database and remote log storage, how much effort would it take you to put together a report?
How about a report that updates hourly?

Conventional troubleshooting relies on building pattern matching rules on log files. In some cases, operators log into the server to look at logs directly. The more elements there are in the system, the more places errors can spontaneously appear out of thin air.
We know the mantra to keep everything as simple as possible, of course. But, some problems do require taking on additional expertise and complexity. At the end of the day, you may have more logs than time available to scan through them all.

How do we detect warning signs before it impacts business revenue, given limited team bandwidth?

The answer is: learn about open source instrumentation systems. The two products highly recommended by the community are Grafana and Prometheus.
https://radar.cncf.io/2020-09-observability

Forget about logs. Focus on metrics first.

Instrumentation allows us to keep tabs on a program's current state.

We can declare a counter that gets increased whenever a record is successfully inserted into the database.
We can measure the amount of time an external system takes to handle
our requests.
We can measure the current environment's cpu/memory usage to reflect on the possibility of memory leaks.

Where as logs would allow an investigator to pinpoint exactly where a user journey goes wrong, metrics builds a top-level model for the team to operate with.

Imagine a person going on diets and exercising routines:

they are instructed to keep tabs on the calories content of the food they eat, and the intensity/length of the type of exercises performed. Lastly, they must record their weight at a regular interval! When we are not thinking in terms of metrics, we lack the proper units to even frame our goals with.
A diet routine without numbers may work very well, but we cannot be absolutely certain until we measure with numbers.

The same can be said for creating and maintaining software offerings. If we are not thinking in terms of metrics such as:

99 percentile request response time
server up time
error and disconnect rates

We are already lost in terms of quality. We can make tweaks, fixes, and push new features in our platform, but we aren't sure if they make matters better or worse. The only certainty when flying blind is that we know the errors in the logs have stopped, but was it due to our fixes? Or a restart would have fixed it? Who knows?!

Forget about logs. Focus on metrics first.

“But we are too busy to spend time on measurements!”

No one is perfect. We have a limited amount of time available in choosing winning strategies and implementing them. Writing code without customer inputs/feedback and insights on how the code is running is a frightening reality many developers face.

If you care about winning and staying in business, you need to keep your customers happy, and your services reliable enough. Just as the dieting and exercising person must measure calories, time exercises, and record their weight, so can developer teams sit down to figure out what numbers to measure, and leverage Prometheus and Grafana to keep measuring them.

You literally cannot set objectives without measurements.

“Ok, but how many steps are there?”

Now onto the business of monitoring itself. Here are the 8 steps any engineers can follow to get insights into our platform:

Install Prometheus. It will retain two weeks of metrics by default, and is enough for most start-ups. If you are a bigger business that needs longer data retention, you know who to reach out to.
Install Grafana. Ideally, it would use an external database (such as Postgres) instead of the internal sqlite3 database. We want all monitoring related components to be as reliable as possible.
Install node_exporter on our virtual machines. Packages are available for ubuntu, centos, and other flavors of linux. This light weight agent helps monitor resource usage on the machines.
Import Prometheus client library into the application, and start with tracking the number of errors and exceptions occurring in the system.
Configure Prometheus to scrape both application metrics and node_exporter machine metrics. Verify that samples are flowing through.
Setup Prometheus as a data source in Grafana, and create your very first dashboard to see how many errors are occurring in the system.
Setup a slack notification channel in Grafana, so you can be warned about error rate rising.
Iterate, add, and refine metrics as the team becomes more knowledgeable on the types of metrics it cares about.

“How long will this effort take?”

How long should these items take?

For smaller footprints of under twenty machines and five applications, this exercise in instrumenting and monitoring will take a single engineer no longer than two weeks. That's 80 hours. Hire a contractor, and you
will be done and complete with training and documentation within a month.
This is much less time than the many hours engineering team will spend reading through logs in the future. Once the pipeline is established, more different types of measurements is possible.

Better metrics can lead to better business decisions.

Conclusion

Just as you wouldn't trust a hospital's treatment when they do not take measurement, we cannot be sure of our product's reliability until we actually look at the numbers. If you are serious about service reliability, but are not sure where to begin?

Reach out to us with questions about observability at info@teamzerolabs.com