Dynatrace vs Datadog: Monitoring — the real deal

Amrith Raj
Dynatrace vs Datadog
5 min readMay 11, 2021

As explained in the previous blog, the goal of monitoring is to be notified when there is breach in a defined threshold .

Part of my previous job as the lead of AWS Practice was to ensure our SRE team had all the good Benchmarking done to ensure that we don’t miss any alerts.

Monitoring is the process of setting thresholds when a metric crosses a defined value to trigger a notification.

If we configure to alert at lower thresholds, we would be flooded by alerts. If we keep the thresholds very high, it may be too late before the system becomes unavailable. Therefore, benchmarking is a difficult task. We have to see how a system responds normally and at what point a deviation affects normalcy. This helps in configuring thresholds that would help us identify problems sooner.

Continuing our experience with Dynatrace and Datadog, let us see how would each of them detect an anomaly.

Dynatrace — Monitoring:

In this below screenshots of a server we notice that the disk utilisation is at 41% (1) and to see how it reacts to a low disk space situation, I attempt to create a large file to consume the space. This is done by the linux fallocate command, in my case I created a large file of 5.5 GB in size (2).

As soon as this is done, the filesystem usage jumped to 99% (3). This should be treated as a critical event.

Commands to view file system usage and fallocate command to create a large file

Filesystems are critical components as they hold the data required to run the application. Running out of space could bring the application down. In a highly audited and secure environments, checks would be in place to shut the server down automatically when it runs out of space to store the audit logs. The audit logs are crucial for accountability and thus if the system cannot store those, the server should be forced shutdown to prevent unaccounted usage.

Immediately after this, I noticed a problem alert in Dynatrace:

In the below three screenshots, you could notice that Dynatrace detected there was a problem, showed exactly where the issue is in the host section and also its AI analysed 30 dependencies and provided a root cause.

Problem alert highlighting the low disk space
Disk usage highlighted in red (left)
AI Ops that shows the affected apps, application and services. Identifying the root case and impact.

Datadog — Monitoring:

Let us try to do the same on the server which has the Datadog agent.

In the Datadog server , we notice that the disk utilisation is at 37% (1) and to see how it reacts to a low disk space situation, I attempt to create a large file to consume the space. This was again done by the linux fallocate command, in my case I created a large file of 5.5 GB in size (2).

As soon as this is done, the filesystem usage jumped to 95% (3).

Datadog did not alert on this as there were no Monitors configured. To monitor the disk we need to create a New Monitor

In the below two screenshot, you could see that I used the system.disk.free metric to check if it is below 750MiB and if so to send an alert called Low disk space alert as an email to myself. The value of 750 MiB is something I simply choose. Depending on the disk size and the application use this can be less or more.

Creating the user defined monitor
Creating the user defined monitor continued — Sending Notifications

After creating the monitor, I could see the alert:

Datadog provides recommended monitors which are preconfigured monitors. However, in our case where the disk being full is not part of any recommended monitor.

Conclusion:

Dynatrace and Datadog are completely different in this aspect. Dynatrace not only detected the configuration, metrics and applications automatically. It automatically benchmarked, created baselines thresholds and configure monitoring of those metrics. There was no need for me to create the monitor, the message, the thresholds and the notification. Dynatrace allows customisation as required but there was no need for me to configure any of these as they simply worked as expected.

Datadog on the other hand required me to configure all of them. The problem as an engineer is not the difficulty in configuring it. It is about defining these baseline thresholds and configuring these at scale. The components between the user and the server includes many resources and implementing monitoring across all the various components would be a daunting exercise.

Having AI at the core, Dynatrace does a lot of these automatically which is crucial to deliver greater software experiences to the end user as engineers should focus on building greater software products.

Continue comparing:

Dynatrace vs Datadog: Installing the agents

Dynatrace vs Datadog: Basics

Dynatrace vs Datadog: Metrics out of the Box

Dynatrace vs Datadog: SaaS and On-prem deployment offerings

Disclaimer: The author has been a Dynatrace and Datadog user in his previous job with a variety of clients. Both products are constantly adding features and some features would change in the future. The author did this review during his interview process with both companies. The author now works for Dynatrace.

--

--