Monitoring and Alerting for Enhanced Application Experience

Gökten Karadağ

Published in

Trendyol Tech

8 min readJul 21, 2023

In this article, we will look into our monitoring&alerting practices within the TDesk at Trendyol.

As Tdesk, we are responsible for all Trendyol and stakeholders' ticket management processes.

Introduction

A problem inevitably occurs in any software application. These problems can be as follows; human error, network issues, software bugs, integration issues, data inconsistencies, edge cases, etc. So we need to be ready for issues to avoid them or recover from them as fast as possible.

Monitoring and alerting are essential processes in software development, enabling software teams to track and address issues for optimal performance and reliability.

Monitoring

Monitoring involves continuously observing the health and performance of software systems. It provides real-time insights into various metrics, such as resource utilization, response times and error rates. By monitoring key indicators, teams can identify bottlenecks, detect anomalies, and gain a comprehensive understanding of system behavior.

Alerting

Alerting complements monitoring by enabling proactive responses to critical events or conditions. Alerts are triggered based on predefined thresholds or specific conditions, notifying the relevant stakeholders about potential issues. Through alerting, teams can swiftly address emerging problems, mitigate downtime, and prevent further escalation.

How do we apply alerting & monitoring?

In Tdesk, we use different tools, we have multiple APIs, consumers, and jobs in our system and we are integrated with numerous teams within Trendyol. Consequently, we integrate with various tools for our monitoring and alerting processes, striving to implement best practices.

Let’s dive into more detail and discuss the common practices & tools.

Logging

Logs are the most used approach to troubleshooting issues in enterprise applications. The applications can emit different types of log entries such as errors, warnings, debug messages and informational details into these log files using instrumentation.

We send our web application logs to Elasticsearch and analyze them using the Kibana interface.

Here are some logging best practices we are trying to follow:

Clear and Descriptive Messages
Define Correct Log Levels
Avoid Logging Sensitive Data
Implement Structured Logging
Include Contextual Information(timestamps, error code, caller service etc)
Correlate Releated Long Entries
Maintain Consistent Log Formatting

Correlation Id

A Correlation ID is a unique identifier that is added to the first interaction (incoming request) to identify the context and is passed to all components involved in the transaction flow.

We implement a middleware or interceptor that retrieves the Correlation-ID from the request header. By adding this Id to the logging context, the logger can utilize the same context to log messages. This approach eliminates the need to manually pass the Correlation-Id across multiple functions or methods, simplifying the logging process.

New Relic

We trying to integrate New Relic into all of our applications. So easily perform the monitoring process through the interface provided by New Relic before and after deployment.

Using New Relic’s anomaly detection and custom alerting functionalities, can detect anomalies and receive alerts for application failures. These features allow us to stay informed about any unexpected behavior in our application metrics, enabling us to promptly address errors and issues.

We create specialized dashboards for different purposes within New Relic. For instance, we had some applications with high memory consumption, so we created a general dashboard that examines the memory usage for all applications in our system. Query in below

SELECT rate(average(apm.service.memory.heap.used), 1 second) * 100 as memoryUsage 
FROM Metric 
WHERE appName like ‘%tdesk%’ FACET appName TIMESERIES

New Relic provides metrics for all HTTP transactions, allowing us to define alerts that align with your business metrics under different conditions. For example, within the last X hours, we can set up a query to identify users who encountered errors during the login process in our system and we can create a custom alert definition within.

SELECT count(*) FROM Transaction 
WHERE (appName = ‘tdesk-api’ AND `http.statusCode` != 200 and
name = ‘WebTransaction/Auth/Login/{request}’) SINCE 3 hours ago EXTRAPOLATE

Furthermore, issues in our applications can sometimes originate from the various integrated services we rely on. To address this, we use features in New Relic such as Service Map, Dependencies and External Services.
These screens provide a comprehensive view of the health status, response time, and other metrics for all applications within your service.

Elasticsearch Alerting (Open Distro for Elasticsearch)

In some cases, we need to define alerts based on the application logs or our data in ElasticSearch. With Elasticsearch Alerting, we can also define alerts based on the records within Elasticsearch.

For example, In Tdesk, we provide users with async reports through the UI. Users can create report requests from the interface, and we subsequently send them via email when the report is ready. We define a monitor as follows so that we can be aware of reports that have been in the “creating” (pending) phase for over an hour.

{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "status.keyword": "Creating",
                    }
                }
            ],
            "filter": [
                {
                    "range": {
                        "createdDate": {
                            "from": "{{period_end}}||-120m",
                            "to": "{{period_end}}||-60m",
                            "include_lower": false,
                            "include_upper": false,
                            "boost": 1
                        }
                    }
                }
            ]
        }
    }
}

Then we add a trigger to send notifications to the relevant Slack channel with a webhook. So we can quickly notify when the user reports are still pending.

Prometheus

Prometheus is an open-source monitoring tool based on the pull-based mechanism which helps in scraping data, querying it, creating a dashboard using it, and providing alerts based on alert rules. It supports PromQL for the search of metrics. It is a powerful functional expression language, which lets you filter with Prometheus’ multi-dimensional time-series labels

Across Trendyol, there are teams that provide us with various infrastructure&cluster logs through Prometheus.

This diagram illustrates the architecture of Prometheus and some of its ecosystem components [source]

Grafana

Grafana is a widely used open-source data visualization and monitoring tool. It offers a user-friendly interface for creating visually appealing dashboards that allow users to analyze and display data from different sources.
Grafana can query Prometheus, allowing us to interact with Prometheus metrics seamlessly.
There are shared dashboards available for all teams within Trendyol to use. We can select our cluster, enabling us to conduct our analysis and monitoring within Grafana.

We also have screens in our monitoring system similar to the APM dashboard in New Relic. These screens allow us to view the applications across all Kubernetes clusters and provide insights into resource utilization at the pod and container levels. This enables us to monitor and analyze the performance and resource usage of our applications.

Elasticsearch overview Grafana dashboard

Alert Manager

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integrations such as E-mail and Slack with the webhook receiver.

We have dedicated alert manager servers based on tribe, allowing each team to manage their alert definitions with alerts as code.

In Tdesk we stay informed about the lag in our Kafka topics, outages of the Kubernetes cluster, and status from the Elasticsearch cluster through the alert definitions with AlertManager. There is a Git repository where all these alert rules are defined. In this repository, the relevant configurations can be updated on AlertManager by commit.

Example alert expr:

- alert: DeploymentStatusReplicasUnavailable
         expr: sum(kube_deployment_status_replicas_unavailable{namespace=~"tdesk"}) by (deployment) > 0 
         for: 5m
         labels:
            subteam: tdesk
            domain: k8s
            interval: 1m

There is a project available that can be used for these alert queries: prometheus-alerts

Example receiver definition

- receiver: tdesk_k8s_notify
      match:
        subteam: tdesk
        domain: k8s

Example receiver route configuration

- name: tdesk_k8s_notify
    slack_configs:
      - api_url: "http://example-slack-webhook-url"
        icon_url: ""
        send_resolved: true
        color: '{{ template "slack.color" . }}'
        title: '{{ template "slack.title" . }}'
        text: '{{ template "slack.text" . }}'
        actions:
          - type: button
            text: 'Silence :no_bell:'
            url: '{{ template "__alert_silence_link" . }}'

After making these definitions, we receive a Slack message like below whenever we encounter any error.

Continuously Review

We are trying to effectively adopt these practices, but it is critical that we regularly review, and take action to improve on our metrics. Every month, we organize KPI metric review meetings and examine KPIs including incidents, service response time and error rates attempt to take the necessary activities to make changes.

Challenges & Improves

In Tdesk we are two teams. We develop the same services that are responsible for both teams. The endpoints/use cases that each team is responsible for are separated, so it is critical to redirect to the relevant team when there is an error. We are still working to improve this area.

New features in our applications are added frequently because of this we need to add monitoring&alerting processes for each improvement. This is a topic that requires attention and should always be implemented when developing applications.

As with any aspect of software development, there is always room for improvement in monitoring&alerting processes. We struggled with a high volume of alerts, making it difficult to make sense of them. To avoid alert fatigue, we carefully review alerts and periodically remove unnecessary ones, focusing on taking action for relevant alerts. We categorize alerts based on their topics or severity to prioritize and address them accordingly.

Last words

Monitoring and alerting are critical components of software development, enabling teams to ensure optimal performance, reliability, and user experience. By implementing best practices in monitoring and alerting, organizations can proactively identify and address issues, minimizing downtime and maximizing the value of their software systems.

References

If you want to be part of a team that tries new technologies and want to experience a new challenge every day, come to us.

Trendyol jobs

Job openings at Trendyol

jobs.lever.co