Clickstream analysis for monitoring microservices
Microservice Architecture is an approach to building small, independent, and loosely coupled application services and it is used extensively all over the world.
It makes applications easier to scale, accelerating the development of new features.
While microservices architecture enables a lot of benefits to developers it also has some drawbacks and it’s not a silver bullet at all. For example, monitoring is an Achilles heel for microservices. It is safe to say that detecting and fixing errors in microservice architectures is a time-consuming affair. Monitoring a monolith application would be easier than monitoring tens of microservices, each one running its own programming language and database.
Background
As Trendyol’s seller onboarding team, we widely use monitoring tools to ensure our services are running properly and providing the best user experience to our customers.
We have grafana dashboards, kibana log alerts and new relic alerts to monitor our microservices and early detect any problem which may occur in our system.
We use everything we have, to ensure our production system is running as expected. For example, we have kibana alerts for tracing logs and detecting any unexpected errors. Kibana sends alerts via slack messages.
New Relic is one of the monitoring tools that we use to monitor our microservices. We set alert rules for error rates, throughput, and transaction times. If any anomaly is detected, it sends alert notifications to slack channels.
Our problem and solution
New relic provides a lot of information about the service health, however, we may need additional information such as:
- User information that causes of increasing error rate. So that we can understand if there is an error in our service or if a user causes an increase in error rate because he/she makes an inappropriate operation. New relic can not provide such information.
- We need to understand what is going on during the period of a user session. It’s important to understand that if there is a general problem that affects the user. We can visualize failed and succeeded requests in a user session.
- We need to gather information about error distribution per user for our service. Most occurred errors per user can be visualized and can be used to improve users’ experience.
- Error distribution per endpoint to see if we have a problematic endpoint.
What we need to implement a solution to the above requirement is simply that,
- Collect all HTTP requests and responses from the API.
- Store the collected data in a kind of datastore.
- Visualize the data to give us a bigger picture.
Today we are going to explain how we monitor our microservices with a different approach to understanding users’ behavior and service health.
Implementation
First of all, we need an approach to collect all HTTP requests and responses without changing the application code. It must be service independent and reusable. Also, it must be easy to implement for each different service.
We also should store the data in some data source. Of course, we will need a dashboard or something to visualize existing data.
Collecting HTTP call information
We need to find out a way to collect HTTP request and response information since we want to understand if the rest endpoints are healthy or not. We have a lot of different rest endpoints in our microservices. It means getting HTTP request data from each of them can be very annoying and time-consuming. We need an aspect-oriented approach.
A spring filter can be used for that purpose. That filter can collect the data without affecting the main purpose of the application
That filter runs for every HTTP request and extracts information about the call such as URI, method, executor user, response body, and response code.
Storing user trace in session
So, at this point we have a filter to collect data but how can we store it?
We thought of some options;
- Create a database to store the data
- Use Apache Kafka as a streaming platform and datastore
The first option has some issues such as transaction management, affecting response time, single point of failure, etc.
The second option looks more viable since Apache Kafka is a high-performance data streaming technology and we don’t need to consider latency, data load, or transaction management. Apache Kafka can handle a lot of data with low latency.
Fluentbit and Kafka Producer
We have a solution for publishing Kafka events from application logs using Fluentbit. You can see the details in the post below.
Thanks to the above Fluentbit solution we don’t need to add Kafka dependency to our Spring Filter. We can use a simple application log to publish our HTTP request call information. Fluentbit kafka output filter and elek-api does the rest.
Enhancing raw data with KsqlDB
We now have raw data on a Kafka topic. However, we need to enhance the data according to requirements. We need to create comprehensive dashboards such as;
- Total error counts per user
- Most occurred errors per user
- Error counts per Request URI
To create such data from raw data we can use ksqlDB. ksqlDB is a database for building stream processing applications on top of Apache Kafka and Kafka Streams. It enables real-time Apache Kafka stream processing with the ease of good old SQL syntax.
First of all create a ksqlDB stream from kafka topic that we previously published application logs.
CREATE STREAM request_logging_stream (
timestamp bigint,
method varchar,
queryString varchar,
responseStatus int,
executorUser varchar,
responseBody varchar,
requestURI varchar
) with (
kafka_topic = 'http-call-info-topic',
value_format = 'json'
);
For example, we can create a ksqlDB table from the above stream like below. In that table, we store information about total error counts per user. We can extend our examples and add additional tables like error counts per request URI, most occurred errors in our system, most occurred errors per user, etc.
KsqlDB provides such simple and intuitive solutions to enhance raw kafka topic data with plain sql syntax.
CREATE TABLE total_error_count_per_user WITH (
KAFKA_TOPIC = 'total_error_count_per_user_topic',
VALUE_FORMAT = 'json',
PARTITIONS = 1,
REPLICAS = 1
) AS
SELECT
executorUser AS executorUser_KEY,
AS_VALUE(executorUser) as executorUser,
WINDOWEND as EVENT_TS,
count(*) AS errors
FROM request_logging_stream window SESSION (7 MINUTES)
WHERE responseStatus >= 400
GROUP BY executorUser;
Visualizing data
So far we have completed all the steps to collect, store and enhance the data. The last step is creating a fancy dashboard to visualize our data.
Grafana is an open-source platform for monitoring and observability. You can create great dashboards to visualize your data from various platforms such as databases, elastic search, Redis, etc.
We are going to create a grafana dashboard. But first of all, we need a data source that is queryable from Grafana. Unfortunately, there is no easy way for getting data from a Kafka topic in Grafana yet. For that purpose, we use ElasticSearch as the main data source.
The problem is how to synchronize our Kafka topics with ElasticSearch? Kafka connectors enable a very efficient way to synchronize Kafka events to various platforms such as PostgreSQL, Elasticsearch, Redis, Rabbitmq, etc. These connectors are available in the Confluent platform which is an enterprise-grade distribution of Apache Kafka.
A simple Kafka sink connector can create elastic search indexes for each Kafka topic we have.
{
"name": "elastic_connector",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"topics": "topic-name",
"connection.url": "elastic-url",
"max.retries": "3",
"retry.backoff.ms": "3000",
"key.ignore": "false",
"schema.ignore": "true",
"type.name": "_doc",
"value.converter.schemas.enable": "false"
}
}
For more information about Kafka connectors.
In Grafana we have defined a new ElasticSearch data source. Grafana provides a lot of options such as tables, charts, and graphs. We configured our dashboard like the below.
Conclusion
Of course, there are tons of tools to monitor microservices. But if you have more specific requirements, you can build your data and use it to improve your monitoring muscles. This solution can be used as an additional approach to observing our critical services. Also, it gives you endless options since we have all the data from HTTP requests and responses. We use this approach as a supplementary to our existing new relic alerts. So that we can eliminate false positive alerts, and understand user behaviors from our dashboard.
Thank you for reading.