Datadog Cost Optimisation 1: Understand your usage
Clues on identifying where your expenses are going
To effectively reduce costs, it is essential to first understand why our services are incurring the current level of expenditure. The article will approach the subject matter from a software engineering perspective and center on the essential observability features provided by Datadog.
After reading this, you should be better equipped to identify where your expenses are being incurred. This would help you to narrow down your focus to implement targeted cost reduction measures.
There are pre-built Datadog dashboards that could uncover the usage of your applications:
We will now take an in-depth look at each components to understand of how everything operates.
Application Performance Monitoring (APM)
The diagram above illustrates the trace pipeline. The traces first goes through
1. ingestion sampling rules, followed by
2. retention filters that determines the retention period of how long your traces will be kept on Datadog (certainly, the longer the duration, the higher the cost).
Ingestion Sampling
The ingestion of the spans generated by your application into Datadog is determined by multiple mechanisms, which are governed by the tracing library and Datadog Agent.
Each ingestion span is accompanied with an ingestion reason. To observe the distribution of the trace ingestion sources, you may use APM ingestion reason dashboard. Knowing the origin of your traces can be helpful in identifying areas where you can potentially reduce the volume.
Below are some typical mechanism used:
Head-based sampling (default sampling mechanism)
This mechanism involves the decision to retain the trace at the start of root span and gets propagated to the spans of the other services. This can be configured on the agent level (reason = auto) or tracing library (reason = rule).
Error and rare traces
Along with the informative traces, you may also come across sampled error traces (reason = error) that provide insights into any issues in your application. Furthermore, low traffic services / resources are also monitored and retained using sampled rare traces (reason = rare).
There are a few others ingestion reason like manual
, rum
, lambda
, xray
, appsec
and otel
. To learn more about these reason, please see this article.
Usage Metrics
(Source)
To determine the ingestion volume, you can leverage on metrics:
datadog.estimated_usage.apm.ingested_bytes
(billing dimension)datadog.estimated_usage.apm.ingested_spans
datadog.estimated_usage.apm.ingested_traces
To determine the index volume, you can leverage on metrics:
datadog.estimated_usage.apm.indexed_span
(billing dimension)
In fact, these metrics are used to construct the out-of-the box dashboard that was mentioned earlier.
Apart from traces, there are infrastructure related billing parameters that you could investigate as well:
- APM host
- APM & Continuous Profiler
- Fargate
For more detailed information on the breakdown, please refer to this link.
Logs
Similar to APM, the concept of
- ingestion (logs that are being sent from your application to Datadog) and
- indexing (logs retention) applies as well.
Use log patterns to identify the logs with high volume and evaluate if they are necessary to be logged.
Usage Metrics
(Source)
To determine the ingestion volume, you can leverage on metrics:
datadog.estimated_usage.logs.ingested_bytes
datadog.estimated_usage.logs.ingested_events
To determine the index volume, you can leverage on metrics:
datadog.estimated_usage.logs.ingested_events
(with tag =datadog_is_excluded:false
)
Custom Metrics / Log Generated metrics
Last but not least, do review both
- custom metrics generated by your application
- log-based metics
Generally, it is advisable to use metrics instead of logs as they offer a longer retention period at a lower cost.
However, it is crucial to be cautious when selecting tags as they can significantly increase the cost of metrics. Using high cardinality attributes can result in a large number of metrics being generated. This can quickly lead to cost escalation. You can refer to this article for a detailed explanation of how metrics are counted.
You may check out the pricing here with some examples on APM billing. However, keep in mind that your company may have a different pricing structure based on enterprise agreements. To get an accurate understanding of the pricing, it’s best to consult with your company’s infrastructure team or designated Datadog support team.
Building your own Utilisation Dashboard
See this link for all the estimated usage metrics provided by Datadog. They are for free with a retention period of 15 months 🤑.
That’s about it! The article has discussed various aspects to investigate your application’s usage of Datadog and I hope you now have a starting point on this. Next up, I will compile some strategies that you could utlize to reduce your Datadog expenses! Stay tune! 👋
If you are not a Medium member yet, click here to join!