Benchmarking data stores for event-based monitoring

Published in

Elucidata

12 min readSep 19, 2022

Monitoring a process using the events it generates is a very common application in computing. Based on the complexity of the process and the target observer, this can be implemented in different ways.

The most basic form of event-based monitoring, that most developers will be familiar with, is logging. In such cases, the “process” is usually a computer program either running locally or in the cloud. The “events” are messages written to a log. Looking at the messages appear in this log over time, the developer is able to tell what is going on with their program.

An ETL job is another kind of process that data teams might want to monitor. Our data ingestion pipeline at Elucidata has hundreds of such ETL jobs running at any given point of time. Jobs can be of different types, of varying scope with different responsibilities and even run on distributed compute units. One of our near-term goals is to let our users add and manage their data on Polly by themselves. For this to be a successful feature, the users also have to be able to see the progress of their data jobs and its logs as it goes through various stages of processing. This calls for a sort of job monitoring service.

We built this job monitoring service using Amazon Timestream as our event store. It looked like a promising product and we were excited to try it out for something useful. This post talks a little bit about our experience with Timestream and our benchmarks against another strong contender in this space — Elasticsearch. We included Cloudwatch (Logs & Insights) in our benchmarks as well, because our internal monitoring dashboard is running on this. Postgres on RDS was also included as a reference OLTP database.

Motivation

Let us quickly go over a simplified version of our requirements. For that, lets define the two main concepts in the system:

Jobs — A job transforms some piece of source data to generate some desired output data. Depending on the type of work being done, a job goes through a certain set of stages.
Batch — A collection of jobs, grouped based on some common relationship between them. The relationship itself is not important other than the fact that this gives the user an easy way to collectively treat a large number of jobs as a single unit. Batches have a status that is determined by the stages of its underlying jobs.

Using the events generated from each job, we want to be able to do:

Filtering — on static attributes of a batch, which is present in each event of each job in that batch
Aggregation — deduce the status of a batch, from the collective stages of its jobs
Filtering on aggregations — on dynamic attributes like status of a batch
Sorting — batches ordered based on creation time
Some time series — fetch batches created in a certain date range

Currently, our internal data teams are monitoring the progress of their ETL jobs on a Cloudwatch Dashboard. Our jobs are already generating logs and from those logs, we are able to get the relevant stats. But the latency observed on every refresh of the dashboard is not great at all. We want to be able to show jobs as old as 3 months. Fetching stats from 3 months of logs is very inefficient.

We were looking for a data store on top of which we can build a custom job monitoring dashboard as a part of Polly itself. That’s when we discovered Timestream. It is primarily built for timeseries data and supports regular and irregular timeseries. Ours is irregular timeseries (events) data. Apart from being serverless, it supports SQL based querying on its schemaless tables. This was a very attractive proposition and we ended up choosing TImestream for our job monitoring service.

The problem

We had done some basic benchmarking of Timestream by running it on a large number of dummy events and calculating the average response time. Most of the queries gave us sub-second averages. So it was a surprise to us that our GET APIs (running Timestream queries internally) were routinely taking anywhere from 2 to 5 seconds to respond. If you made subsequent API calls, the latency used to improve, but the 5 second max was still hurting. Our dashboard is supposed to have an auto-refresh rate of 1 minute, and it was those refresh requests that took the longest time.

Timestream’s contribution to our latency

It seems like Timestream has a bit of a “cold-start” problem which is not uncommon in serverless systems. The gap between two cold-starts seems to be at least one minute. In the worst case, if we have just one user with their dashboard open and refreshing once a minute, they will notice this long wait every single time. At this rate, our custom dashboard, even though it looked prettier, would be no more performant than the existing Cloudwatch dashboard.

The reason we missed this behavior in our preliminary benchmarks earlier was because we were looking at average times over a large number of requests. Knowing this, we now wanted to perform a more thorough benchmark. This time we will also throw in a few other data stores to see how Timestream fares against them.

Benchmarking setup

Our requirements are fairly common and prevalent in most analytical workloads. We have plenty of options today for storing and querying on event data. But we are a small team and we want to stick to services which we can get started with quickly and are easy to maintain. Also, at this point our entire data infrastructure is hosted on AWS, so it makes sense to start with AWS services too. Therefore, the benchmark was performed on the following data-stores:

Timestream — We were already using it.
Elasticsearch (on AWS) — Not just a search engine. Excellent support for all our requirements.
Cloudwatch (Logs + Insights) — Our existing internal dashboard is running on Cloudwatch Logs being queried through Cloudwatch Insights. It made sense to put this in the comparison.
Postgres (RDS) — Just added as a reference transactional database. Without any plugins, we are not expecting vanilla Postgres to perform well in this context.

Please note that of the four above only Timestream and Elasticsearch are true contenders for our use case.

Events were generated by a “writer” and written to each of data stores. Events were then periodically consumed by a “reader”. They will write their measurements of write and read performances to a DynamoDB table.

Benchmark design

A series of artificial batches of jobs were generated and their events were written to each data store. But this was done in a systematic manner:

8 sets of 20 batches each were generated. This was done so as to increase the scale of the data after each iteration.
Each batch had about 7000 jobs on average. Each job generates 7 events at most. Ultimately, we had about 4 million records in total, in each data store.
The jobs spanned over two days. The 566k jobs is a fair representation of the kind of workload we expect to see on our production infra.
Write performances were recorded for each batch of records written. Most databases perform better with batch writes. We chose batches of 100 messages each for Cloudwatch Logs. For Elasticsearch and Postgres we had batches of 500 records per write. For Timestream, we had 10 concurrent threads writing 100 records each and the cumulative time taken by the threads was taken as one write measurement.
After each iteration we ran the query benchmarks, giving us read performance at each scale (from 1x to 8x). Each query was run 10 times to get a distribution of read times.

Each event contained a few static values and a few (dynamic) stage-based values. Normalization of data is clearly not our goal here.

Performance results

For all the quantitative benchmarks, we will plotting the performance data as box-plots. This not only shows us the “typical” write or read times but also shows the extremes. Knowing these extremes is important. They can tell us whether we need to take extra measures to maintain a good write performance or low response latencies. Sometimes the worst-case performances may simply make it unviable to use a service for your requirements.

Write performance

Comparing write performance is not the primary purpose of this purpose. However, it would still be interesting to see how the different data stores fare against one another. The following plot shows the distributions of write times per 100 records.

Key observations:

The average write performance for Cloudwatch logs is pretty good, but at its slowest, it is many times worse. It is these kinds of observations that are lost in simple bar plots.
Elasticsearch has the healthiest write performance numbers, with 100 ms at its worst.
Timestream write performances show a large variation. It should be noted that with more concurrent writers, Timestream might be able to achieve much better throughput.

Query performance, Type I

We finally come to the query benchmarks. For each type of query, we will show an example SQL query to formally elucidate our intention.

The first query type is not at all useful for our monitoring dashboard, but its something simple that we can start with — get the total number of events per batch:

  SELECT batch_id, COUNT(*) AS num_events
    FROM monitoring_events
GROUP BY batch_id

Note: For each of the query benchmark types, we will show two plots. The first plot shows the distribution of query performances at 8x scale (highest number of records). The orange dot in the first plot is the read time when the first query request was made. The second plot shows how average query performance changes with scale (from 1x to 8x). In both plots, lower values are better.

Performance profile for Type I queries:

Average query performance with increasing scale, Type I

Key observations:

Elasticsearch returns every single time in less than one-third of a second.
Timestream shows a large variance in query performances. The first call (cold-start?) is noticeably worse.

Query performance, Type II

In the second query, we will be doing multiple aggregations by batch IDs. This is more useful query and one that we may run in certain situations — get the latest stats of each batch:

  SELECT batch_id,
         COUNT(DISTINCT(job_id)) AS num_jobs,
         SUM(CAST(finished AS integer)) AS successful_jobs,
         SUM(CAST(errored AS integer)) AS errored_jobs,
         MIN(created_at) AS creation_time,
         MAX(time) AS last_updation_time
    FROM monitoring_events
GROUP BY batch_id
   LIMIT 100

Performance profile for Type II queries:

Average query performance with increasing scale, Type II

Key observations:

Elasticsearch still shows the best performance, with 75% queries returning under half a second.
Timestream gives more predictable performance this time. The worse case performance is still not good though.
Postgres has to likely due do full table scans this time around.

Query performance, Type III

In this query, we will be doing multiple aggregations but also applying filters on static as well as aggregated attributes. Finally we will be applying a sorting based on an aggregated attribute. This sort of query will routinely run on our dashboard — get batches that have completed successfully, sorted by creation time:

  SELECT batch_id,
         COUNT(DISTINCT(job_id)) AS num_jobs,
         SUM(CAST(finished AS integer)) AS successful_jobs,
         SUM(CAST(errored AS integer)) AS errored_jobs,
         MIN(created_at) AS creation_time,
         MAX(time) AS last_updation_time
    FROM monitoring_events
   WHERE user_id = '1110'
GROUP BY batch_id
   WHERE successful_jobs = num_jobs
ORDER BY creation_time

Performance profile for Type III queries:

Average query performance with increasing scale, Type III

Key observations:

Every single Elasticsearch query returns under half-a-second.
Postgres and Timestream are doing much better this time. This is likely due to the filters being applied, which probably narrows down the search space.

Query performance, Type IV

This fourth query is unique in the way that it finds out values unique based on the maximum value of another dimension. This is another query we will run often, when viewing jobs within a batch — get the latest stage and progress for each job:

  SELECT job_id,
         MAX_BY(stage, time) AS stage,
         MAX_BY(stage_progress, time) AS stage_progress,
         MAX_BY(errored, time) AS errored,
         MAX(time) AS last_updation_time
    FROM monitoring_events
   WHERE batch_id = '16554252419__1661749456__1110'
GROUP BY job_id

Performance profile for Type IV queries:

Average query performance with increasing scale, Type IV

Key observations:

All queries, except those run on Cloudwatch logs, are performing extremely well. The search space here has been reduced drastically — to a single batch.
Cloudwatch insights queries taking 3 seconds at this point a consistent trend.

Query performance, Type V

This final query is again not needed for our dashboard, but we want to test how each of the data stores perform for a common timeseries use case — get number of jobs finished per repo, each minute, for a small set of users:

  SELECT repo_id,
         BIN(time, 1m) AS interval,
         COUNT(DISTINCT(job_id)) AS num_jobs
    FROM monitoring_events
   WHERE stage = 'Finished'
     AND user_id IN ('0011', '0111', '1111', '1110', '1100')
GROUP BY repo_id, interval
ORDER BY interval DESC

Performance profile for Type V queries:

Average query performance with increasing scale, Type V

Key observations:

Elasticsearch is once again excellent. Far superior to Timestream, which is supposed to be a timeseries database.
Postgres starts off being better than Timestream but with scale it eventually gets left behind.
Timestream’s performance, although not as good as expected, is quite stable even with increasing amounts of data.

Qualitative comparison

Good performance is definitely important but it is not everything. Depending on one’s use-cases and the amount of time one can afford, some of these data-stores will be more suitable than others. Let us compare our data stores based on some commonly desired qualities:

Feature matrix comparing the data stores

Based on the amount of green and yellow, Timestream seems to be the most featureful option among the lot.

Cost comparison

No benchmark is complete without a cost comparison. The pricing of a service should not be unjustifiable for the performance it provides. Predicting costs over some constant period of time would be difficult and prone to inaccuracy. So, here we will just attach the cost incurred by each instance of the service for the duration of the benchmark, which took place over a span 4 days:

Service costs incurred throughout the benchmark

Cloudwatch wins here, since it is built to store vast number of messages and per query cost is very low. Elasticsearch costs the most but it should be noted that those costs will remain constant per day for up-to a certain threshold of data volume (which is pretty high). Timestream also seems to be very economical service for our purposes.

Caution!

Before we conclude this post, we would like to mention some possible biases that may make this benchmark “not true” or completely inapplicable to your work:

Scale of data — The amount of data we performed this benchmark with, may not have been large enough to notice some emergent scalability patterns.
Variety of queries — We started off with explaining why we were doing this exercise. The types of queries we ran were specific to our data model and use cases. They were not exhaustive and your primary use case may not have been covered at all.
Minimal optimizations — While we did pay attention to model the data so as to support our queries well, some of the data stores (especially Postgres and Elasticsearch) may be further optimized to beat the competition on certain metrics. But the idea here is that, the data store should be capable of achieving “good enough” performance without a whole lot of tinkering and optimizations on the developers part. We did spend some time on trying to come up with more performant queries for each use case.

Conclusion

There is definitely a winner here for us. However, your requirements might be different enough that one of the others may be more attractive to you. We will end this post with some notes on the strengths and weaknesses of each:

We really like what Elasticsearch has to offer in terms of predictable performance and costs. It also checks most of the boxes in terms of features. Therefore, we are planning to replace Timestream with Elasticsearch in our job monitoring service, sometime in the near future.

With that, we come to an end of our benchmarks. If you have other data stores in mind that you want to compare against this plan, you can do so by adding to the source code. The infra configuration, source and analysis scripts are all available in this GitHub repository. In case you want to suggest changes or face issues with reproducibility, please raise an issue there itself.

Benchmarking data stores for event-based monitoring

Motivation

The problem

Benchmarking setup

Benchmark design

Performance results

Write performance

Query performance, Type I

Query performance, Type II

Query performance, Type III

Query performance, Type IV

Query performance, Type V

Qualitative comparison

Cost comparison

Caution!

Conclusion

Written by Saiful Khan