Datadog Cost Optimisation 1: Understand your usage

Clues on identifying where your expenses are going

Chia Li Yun
Javarevisited
4 min readMay 17, 2023

--

To effectively reduce costs, it is essential to first understand why our services are incurring the current level of expenditure. The article will approach the subject matter from a software engineering perspective and center on the essential observability features provided by Datadog.

After reading this, you should be better equipped to identify where your expenses are being incurred. This would help you to narrow down your focus to implement targeted cost reduction measures.

Photo by Jp Valery on Unsplash

There are pre-built Datadog dashboards that could uncover the usage of your applications:

We will now take an in-depth look at each components to understand of how everything operates.

Application Performance Monitoring (APM)

Adapted from Datadog official website

The diagram above illustrates the trace pipeline. The traces first goes through

1. ingestion sampling rules, followed by

2. retention filters that determines the retention period of how long your traces will be kept on Datadog (certainly, the longer the duration, the higher the cost).

Ingestion Sampling

The ingestion of the spans generated by your application into Datadog is determined by multiple mechanisms, which are governed by the tracing library and Datadog Agent.

Each ingestion span is accompanied with an ingestion reason. To observe the distribution of the trace ingestion sources, you may use APM ingestion reason dashboard. Knowing the origin of your traces can be helpful in identifying areas where you can potentially reduce the volume.

Below are some typical mechanism used:

Head-based sampling (default sampling mechanism)

This mechanism involves the decision to retain the trace at the start of root span and gets propagated to the spans of the other services. This can be configured on the agent level (reason = auto) or tracing library (reason = rule).

Error and rare traces

Along with the informative traces, you may also come across sampled error traces (reason = error) that provide insights into any issues in your application. Furthermore, low traffic services / resources are also monitored and retained using sampled rare traces (reason = rare).

There are a few others ingestion reason like manual, rum, lambda, xray, appsec and otel. To learn more about these reason, please see this article.

Usage Metrics

(Source)

To determine the ingestion volume, you can leverage on metrics:

  • datadog.estimated_usage.apm.ingested_bytes (billing dimension)
  • datadog.estimated_usage.apm.ingested_spans
  • datadog.estimated_usage.apm.ingested_traces

To determine the index volume, you can leverage on metrics:

  • datadog.estimated_usage.apm.indexed_span (billing dimension)

In fact, these metrics are used to construct the out-of-the box dashboard that was mentioned earlier.

Apart from traces, there are infrastructure related billing parameters that you could investigate as well:

  • APM host
  • APM & Continuous Profiler
  • Fargate

For more detailed information on the breakdown, please refer to this link.

Logs

Similar to APM, the concept of

  • ingestion (logs that are being sent from your application to Datadog) and
  • indexing (logs retention) applies as well.

Use log patterns to identify the logs with high volume and evaluate if they are necessary to be logged.

Usage Metrics

(Source)

To determine the ingestion volume, you can leverage on metrics:

  • datadog.estimated_usage.logs.ingested_bytes
  • datadog.estimated_usage.logs.ingested_events

To determine the index volume, you can leverage on metrics:

  • datadog.estimated_usage.logs.ingested_events (with tag = datadog_is_excluded:false)

Custom Metrics / Log Generated metrics

Last but not least, do review both

Generally, it is advisable to use metrics instead of logs as they offer a longer retention period at a lower cost.

However, it is crucial to be cautious when selecting tags as they can significantly increase the cost of metrics. Using high cardinality attributes can result in a large number of metrics being generated. This can quickly lead to cost escalation. You can refer to this article for a detailed explanation of how metrics are counted.

You may check out the pricing here with some examples on APM billing. However, keep in mind that your company may have a different pricing structure based on enterprise agreements. To get an accurate understanding of the pricing, it’s best to consult with your company’s infrastructure team or designated Datadog support team.

Building your own Utilisation Dashboard

See this link for all the estimated usage metrics provided by Datadog. They are for free with a retention period of 15 months 🤑.

That’s about it! The article has discussed various aspects to investigate your application’s usage of Datadog and I hope you now have a starting point on this. Next up, I will compile some strategies that you could utlize to reduce your Datadog expenses! Stay tune! 👋

If you are not a Medium member yet, click here to join!

References

--

--

Chia Li Yun
Javarevisited

Recent graduate from university. Always excited about the new technologies and love to share with the tech community here!