Improving Observability with Datadog

Andrés Milla
Flux IT Thoughts

--

This document explains both Datadog’s incorporation into a digital product of a Flux IT client and our improvement suggestions to make the most out of this tool.

Incorporating Datadog

The client is decoupling features in different microservices, and for that, it is crucial to have an observability tool like Datadog so as to understand and manage the complexity that arises from the interconnection of multiple services. Datadog provides real-time data such as metrics, logs, and traces, which are essential for quickly identifying and solving issues, improving performance, validating system reliability, and facilitating informed decision-making.

To integrate it with the client’s product, Datadog agents were installed on the client’s server instances. These agents are responsible for collecting system metrics, traces, and logs, and then sending this data to Datadog’s SaaS platform.

Metrics include information about system performance, while traces provide details on app execution, and logs offer detailed information about specific events. On Datadog’s platform, such data is stored, processed, and displayed, thus enabling users to effectively monitor and analyze the status and health of its infrastructure and apps. This approach provides a comprehensive overview of the performance in distributed environments.

Log Indexation with Elastic Common Schema (ECS)

Elastic Common Schema (ECS) for logs provides a standardized structure that simplifies indexation and log search in distributed environments. By adopting ECS, we have achieved consistency in the representation of log data, thus facilitating event correlation and improving efficiency when monitoring and solving issues within the client’s infrastructure. The normalization of fields such as timestamp, message, and others, through ECS, has strengthened our ability to consistently analyze logs generated by different services and components of our architecture

The key aspect of ECS is that it allows us to define schemas with the app’s data, thus enabling us to index, search for, and retain useful information.

Hereunder, we demonstrate how recognized observability techniques were implemented through ECS.

Trace IDs

Trace IDs (trace identifiers) are unique identifiers that follow transactions or requests made through a distributed system, thus providing visibility of the execution flow and providing a detailed analysis of the app’s performance.

On the client’s side, different Trace IDs were injected at different points in the app. This is extremely useful for the following purposes:

  1. Debugging: It simplifies debugging as filtering by Trace ID retrieves all execution traces from all services, thus enabling an efficient and quick debugging process.
  2. Statistics: We can infer real-time system status statistics. For instance, it allows us to determine the number of users at any given moment.

Incorporation into Django

Finally, this is implemented by means of the following Python library: https://github.com/elastic/ecs-logging-python. This makes it possible to configure metadata at a global level of the log and to send parameters through logs.

For instance, the attached image illustrates how a log can be structured with ECS indicators:

Metrics Generation

Thanks to the logs structured by ECS, it is easy to use Datadog to generate metrics regarding the app and even to set up alerts.

Some examples of metrics we can mention are:

  • Requests: metrics that allow the visualization of the quantity of requests in real time along with information associated with them. This enables rapid error detection; for instance, if there is a 500, alerts are generated from Datadog, and the error can be easily visualized. Additionally, it helps identify performance spikes and to determine which end points need optimization.
  • Cron jobs: metrics that provide insights regarding the number of running jobs, something that is useful for setting up alerts if a specific cron job has not been executed within a certain timeframe.
  • Analytics: thanks to ECS indexation, we were able to create real-time graphs depicting various user statistics without the need to interact with the database.
  • Resources: metrics that enable the monitoring of both the memory and CPU consumption of each running process, which is extremely useful for visualizing the system’s real-time status.

Hereunder, the most highlighted metrics that have been configured for the client are listed.

Conclusion

To summarize, the introduction of Datadog into the client’s environment has been a crucial step to improve the monitoring and observability of its infrastructure.

On the one hand, it is worth highlighting that indexing the logs through Elastic Common Schema (ECS) has provided a unified and coherent structure for logs, thus simplifying event correlation and facilitating data analysis in distributed systems.

On the other hand, the implementation of trace IDs has strengthened the client’s ability to track transaction flow across different services, thus providing a comprehensive overview of its app’s performance.

Furthermore, the integration of Datadog into Django’s framework has improved observability in the client’s web apps, thus enabling effective generation of key metrics.

In conclusion, collectively, these incorporations have significantly elevated the client’s ability to detect issues, analyze performance, and optimize its operations in a distributed environment.

Links and References

Know more about Flux IT: Website · Instagram · LinkedIn · Twitter · Dribbble · Breezy

--

--