How to integrate Datadog into Flask Applications
At AVIV Group, we use multiple monitoring tools for logging, error reporting, application and infrastructure metrics. We want to investigate the opportunity to merge all monitoring into Datadog. This migration follows two main goals:
- unify monitoring tools in order to ease the developer journey at Meilleurs Agents;
- unify monitoring tools across AVIV Group (Meilleurs Agents, SeLoger, Immoweb & Immowelt).
At Meilleurs Agents, our current backend stack mainly consists of Flask microservice applications. The main target of this study was to validate if Datadog Flask integration is suitable with our requirements.
The scope is the following:
- Application profiling
- Traces
- Error Tracking
- Custom Metrics (statsd + prometheus)
- Logging
One of the goals of this migration is to enable a single tool for the entirety of this scope, by ensuring we can easily jump from one concept to another. For instance: moving from traces to logs or correlating logs with metrics.
We only investigate in this study the usage of the Datadog agent associated with an application (in the context of a Flask API). Further analysis will be performed to investigate the Cloud Provider integration to monitor logging and pod metrics.
TL;DR — Show me the code
You can find the final implementation of Datadog on a Flask API application in this repository:
https://github.com/aviv-public/datadog_flask_poc
This Flask application exposes multiple endpoints, each of them has a purpose to test an independent functionality described below. See testapi/__init__.py.
Architecture
General overview
As stated in the docker-compose file, the general overview of the test setup is the following:
In order to communicate with Datadog, we used:
- the datadog-agent containerd version to communicate with Datadog: This agent is set in the same environment as the Flask API to handle API data and send it to Datadog;
- and inside our Flask application we used the following libraries:
- the python Datadog library inside the flask-application, to communicate with the agent;
- the dd-trace python library, to send traces to the agent;
- the prometheus-client library, to expose a /metrics endpoint with metrics for the agent to fetch.
Detailed data flow
The following schema show how each element is fetched by the Datadog agent and transferred to Datadog:
Note: A locust container is also added to generate traffic on the flask application and to be able to see data on Datadog dashboard.
Application Profiling & Traces
Profiling and traces were pretty straightforward following the documentation. It was enabled by setting the following environment variable in the agent container:
DD_APM_ENABLED=true
DD_APM_NON_LOCAL_TRAFFIC=true
DD_TRACE_ENABLED=true
DD_TRACE_CLI_ENABLED=true
DD_PROFILING_ENABLED=true
DD_PROCESS_AGENT_ENABLED=true
Then in the application initialization:
from ddtrace import tracer
tracer.configure(
hostname=os.environ.get('DD_AGENT_HOST'),
port=8126,
)
As a result profiling and traces are available for your application in your Datadog dashboard.
Error Tracking
While testing in debug (FLASK_DEBUG=1), Datadog error tracking worked out of the box. Every exception raised by the application was transmitted to Datadog properly and was reported in the Error tracking dashboard.
Then, when we switched to production (FLASK_DEBUG=0), we noticed that exceptions were not reported to Datadog anymore. Exceptions were not caught anymore by the Datadog library in the Flask application.
After investigations with the Datadog support, we were suggested the following workaround to catch exceptions in a non-debug context. By the time it was fixed into the datadog-library, exceptions are properly reported to Datadog with the following tweak:
def create_app(app: Flask) -> None:
# define an ErrorFilter that process trace
tracer.configure(
hostname=os.environ.get('DD_AGENT_HOST'),
port=8126,
settings={'FILTERS': [ErrorFilter()]}
)
# create a custom exeption hanlder that explicitely set the errors spans
@app.errorhandler(Exception)
def add_datadog_spans(e):
span = tracer.start_span('...')
curr_span = tracer.current_span()
curr_span.set_exc_info(
*sys.exc_info()
) # returns type, value (aka the message), traceback which defines the error.type, error.message and error.stack
return e
With this tweak, we are now able to get exceptions with context in the error tracking dashboard
Custom Metrics — StatsD
At AVIV, we use StatsD metrics to send custom metrics to our current monitoring stack (InfluxDB/Grafana). We wanted to check whether it was possible to send the same StatsD metrics to Datadog.
We used the DogStatsd metrics aggregation service included into the datadog-agent.
Sending StatsD custom metrics were easy to implement using the Datadog library tools:
from datadog import statsd
statsd.inc(
metric="my_metric",
value=1,
tags={},
sample_rate=1
)
Further investigations have to be done to investigate the metrics types that we do not currently use:
- SET type metrics that seems to set metric value to 1 instead of the defined value;
- other metrics types (histogram, etc.).
Prometheus metrics (OpenMetrics)
At AVIV, we also use Prometheus metrics to monitor endpoints execution time and report custom metrics to our current monitoring backend (Grafana). We wanted to investigate if we were able to report the same metrics to Datadog.
As recommended in the Datadog Documentation, we used the openmetrics check to fetch metrics from our application.
For information, Prometheus metrics consist of an exposition of a /metrics endpoint on the API for the agent to fetch.
Implementing an openmetrics endpoint was also easy using:
- the https://pypi.org/project/prometheus-client/ library;
- datadog-agent openmetrics configuration (see agent_conf/prometheus.d/conf.yaml).
Application global metrics
On application initialization:
- we register functions in the application before_request and after_request to measure endpoint timings;
- we expose a /metrics endpoint for the datadog-agent to fetch these metrics.
# testapi/datadog_utils/metrics_prometheus.py
import time
from flask import request, Response
from prometheus_client import CollectorRegistry
registry = CollectorRegistry()
def start_timer():
request.start_time = time.time()
def stop_timer(response):
resp_time = time.time() - request.start_time
REQUEST_LATENCY.labels(
'testapi',
request.path
).observe(resp_time)
return response
def record_request_data(response):
REQUEST_COUNT.labels(
'testapi',
str(request.method),
str(request.path),
str(response.status_code)
).inc()
return response
def setup_metrics(app: Flask):
# register timing functions on before/after_request
app.before_request(start_timer)
# The order here matters since we want stop_timer
# to be executed first
app.after_request(record_request_data)
app.after_request(stop_timer)
# expose a metric endpoint
@app.route('/metrics')
def metrics():
return Response(
prometheus_client.generate_latest(registry),
)
Note, that this endpoint is defined in Datadog configuration, file: agent_conf/prometheus.d/conf.yaml
openmetrics_endpoint: [<http://datadog-testapi:5000/metrics>](<http://datadog-testapi:5000/metrics>)
Custom metrics
Also, in application, we can use custom metrics in endpoints, for instance:
from testapi.datadog_utils.metrics_prometheus import registry # import global registry
from prometheus_client import Counter
# create metric
CUSTOM_PROMETHEUS_COUNT = Counter(
'custom_metric',
'Custom Metric',
['type', 'value'],
registry=registry
)
# increment metric
CUSTOM_PROMETHEUS_COUNT.labels(
type="custom_type_name",
value=1,
).inc()
Logs
We investigated the two logging options detailed in the Datadog Documentation:
- Application logging: send application logs to Datadog
– 🟢 Multiline logs are grouped in Live Tail.
– 🟢 Logs levels are reported to Datadog.
– 🔴 Unhandled exceptions (without any explicit log) are not reported into Datadog logs. - Container logging: Datadog agent watch all application container logs
– 🟢 All container logs are reported into Datadog (including unhandled exceptions).
— 🔴 Multiline logs (e.g., exceptions stack trace) are not grouped.
— 🔴 All logs are in level INFO in Datadog.
Application logs
When using the application logs option.
- Flask Application explicit logs are written in a JSON file (see application logging with a file handler in /testapi/datadog_utils/logger);
- JSON log file is mounted on datadog-agent (see datadog-agents volumes in docker-compose);
- Datadog-agent reads and sends logs to Datadog website (see ./datadog-agent-python-logs-conf.yml).
Note: in environment/datadog-agent.env the environment variables required to send logs.
To test logging, use /logging/<level> endpoint of the provided application.
Container logs
Another option available is to send all application container logs to Datadog. This option allows reporting all logs generated by the application container (including traces, etc.). Gathering such metrics is expected to be done through the orchestrator system that might be in use, but you can do it by mounting the Docker socket for testing purposes. This represents a security risk and should not be done in production.
To report application traces, the tracing module (see above) seems more suited than this solution. In datadog-agent.env you need to set:
DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
In docker-compose you need to add:
volumes:
# LOGGING[OPTION1]:config logging from file
# - ./agent_conf/python.d/conf.yml:/etc/datadog-agent/conf.d/python.d/conf.yml:ro
# - testapi-logs:/var/log/testapi:ro
# LOGGING[OPTION2]: enable for logging based on containter autodiscovery
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
- /etc/passwd:/etc/passwd:r
Conclusion
We managed to use all the Datadog features that we required on Flask with minor tweaks.