Analyzing Prometheus Metrics with Spark and Athena: Uncovering Hidden Insights

Moshe Zada
2 min readJul 19, 2024

--

PromQL is a powerful language for querying time series data, but what happens when you want to run some analytics on top of Prometheus? Use analytics tools!

I stumbled upon this topic while dealing with troubling issues in our large Prometheus server.
We were facing instability problems caused by significant CPU and memory usage spikes, mainly because we were tracking a rapidly increasing number of time series.
It became clear that particular time series contributed to these problems by generating excessive metrics and introducing high cardinality data. Identifying and addressing these problematic time series proved to be a challenging task.
The main challenge -> You are trying to investigate Prometheus performance issues using Prometheus ^_^

Converting Prometheus TSDB Files To Parquet

Prometheus stores its metrics data in a Time Series Database (TSDB), which is optimized for time-stamped or time series data.
Now, why is converting TSDB files to Parquet format valuable? There are many query engines out there that give you the ability to run SQL on top of Parquet files, which helps gain some analytical insights. In addition, those queries run offline and don’t load the already-loaded Prometheus server.

So, I wrote a small CLI to convert Prometheus chunk files into Parquet file format.

git clone git@github.com:moshe/prometheus-tsdb-parquet.git
cd prometheus-tsdb-parquet
make
bin/prometheus-tsdb-dump \
--block <path_to_chunk_dir>
--output <destination_directory>
--shard-size <optional: Max number of series in a signle parquet file (default 3_000_000)

Once we have the parquet file, we can query those files using SQL and Athena, and Spark.

Example Queries

Athena — Finding High Cardinality Series

Putting a URL as a label is very common in Prometheus, but it can be tricky. Should we also add the query string to the URL (No)?

The below query finds the top 1K series, sorted by the length of the labels. It is handy for finding long texts accidentally saved as labels (like URL query strings).

SELECT *
FROM df
ORDER BY size desc
LIMIT 1000

Spark — Finding High Cardinality Labels
Prometheus memory usage is performance is all about number of series kept in memory, the below query finds labels that changing to most between all metrics. This query aims to find labels that represent some high cardinality identifier. The snippet below defines custom UDF in Spark that explodes each serias to a row per label

labels = df.select("MetricName", posexplode(df.Metric).alias("pos", "key", "value"))
labels.createOrReplaceTempView("labels")
spark.sql("""
SELECT key, count(distinct value) as c
FROM labels
GROUP BY key
ORDER BY count(distinct value) DESC
LIMIT 1000
""").collect()

That’s it! With Spark and Athena's help, we reduced metric cardinality by 50% (!) while doing many expensive analytical queries offline.

--

--