Building S4 — our own high-performance time-series data store

Published in

Life at Propeller

10 min readDec 9, 2022

“Never write your own database” is a tough to dispute nugget of wisdom in the software engineering world. Database systems have to consider all sorts of subtle concurrency, consistency and durability issues and the reward for any small mistake is lost data and a very bad day.

Despite that (and with great apprehension) we recently dived down the rabbit-hole and implemented our own time-series data store.

In this post I’d like to take you on the journey through the challenges that took us down this path, how we made the call to roll our own, the implementation process and how it turned out to be a great move after all.

The problem

Propeller’s DirtMate is a centimetre-accurate GPS tracking system that mounts on heavy machinery moving earth on worksites and continuously monitors the elevation of the ground underneath the wheels to update Propeller’s 3D drone survey. DirtMates stream a measurement payload back to Propeller every second containing positioning information for every GPS satellite visible, orientation and vibration data from the onboard IMU and various device metadata. Each second of data clocks in at around 2 kilobytes, this might not sound like much but with 86,400 seconds per day and thousands of sensors in the field the numbers get very large very quickly — hundreds of gigabytes and tens of millions of rows per day to store and process.

The original implementation of our backend for storing and analysing DirtMate data was built on top of Amazon DynamoDB for raw sensor data and TimescaleDB for aggregations and complex queries. These are extremely powerful and capable systems, but as we scaled up the number of devices in the field and developed more features using the data we found ourselves repeatedly running in the same issues:

Most of our data analysis features (automated truck-cycle counting for example) rely on complex geometry operations which cannot be performed in a database query. Therefore large amounts (potentially hundreds of megabytes) of raw sensor data need to be loaded into application code as quickly as possible to compute results.
DirtMate is an inherently bursty workload so both database backends needed large amounts of reserved capacity to handle these read spikes at great expense.
We were constantly needing to compromise how long we could keep sensor data due to the high cost of database storage — at least 10x the cost of S3 object storage.

We were constantly needing to compromise how long we could keep sensor data due to the high cost of database storage — at least 10x the cost of S3 object storage.

Rather than working on new features, the team found ourselves spending most of our time optimising queries and implementing complex caching and pre-computation strategies throughout the pipeline to ensure the product felt snappy to use while keeping costs sane.

After several months of this cycle we acknowledged that it was always going to be an uphill battle to meet our cost and performance targets with the current tools and went back to the drawing board.

So we cooked up a wishlist. With the benefit of hindsight it was clear that any replacement data storage layer would need:

Good single row insert performance, with the ability to ingest individual samples from 10,000–100,000 sensors every second.
Very low cost storage to make it practical to store long periods of historical data.
Very high read throughput for raw and downsampled / aggregated data.
Arbitrary horizontal scaling of reads and writes so performance could be increased by turning on more nodes rather than relying on the performance of single instances.

Features that we realised had turned out to be less important than we had originally anticipated were:

Complex analytical queries
Support for joins between different data sets

The anguish

So we set to work finding an off-the-shelf solution to our performance woes. “No problem” we thought, “time-series databases are the hot new thing, surely there will be an unreasonable number of options that fit our straight-forward use case”.

Quickly we found that almost all databases we looked at fell into one of two camps:

Data lakes (think ClickHouse, Snowflake, Athena)

Can use low cost object storage (e.g. AWS S3)
Great at aggregate queries over large datasets
Optimised for batch inserts, not suitable for streaming workloads

Event / time-series databases (TimescaleDB, InfluxDB, AWS Timestream, Kafka)

Require instance attached storage (generally 10x cost of object storage)
Good streaming insert performance
Often limited to single node or require complex sharding / scaling configuration

Surprisingly, we struggled to find one system that met all the requirements on our shopping list. A few meetings deep, someone (half joking) suggested that we should just roll our own frankenstein of AWS S3 for bulk storage and “throw Redis in front of it” to handle streaming inserts.

As we continued our search through the shiny web-scale-IoT-blockchain-cloud-hyperconvergence-at-scale™ brochures for an off-the-shelf system the idea of building our own slowly became less of a joke. There was still collective skepticism at the thought of building an in-house database and wincing at the idea of inevitable time spent hunting subtle consistency bugs, but slowly it started to look like an attractive option as we struggled to tick all the boxes on our list.

The cardboard prototype

Propeller loves a low fidelity mock-up to test a concept quickly, to the point where the original Propeller AeroPoint prototypes were crafted in an afternoon from yoga mats and duct tape.

Curiosity got the best of us and Jye, Propeller’s Capture team engineering manager, built a simple proof-of-concept prototype (nicknamed Baby Moon — I’m assured it’s a Dr Who reference) that included:

Streaming records into Redis (using a sorted set to model the time series)
A batch task to move time windows from Redis into “segment” files in AWS S3
A batch task for computing downsampling / aggregations of windows of data
A retrieval endpoint that loaded data from S3 and Redis and merged the results

Baby Moon prototype design architecture sketches

Benchmarking

To understand the relative performance of a few shortlisted database options, including our Baby Moon prototype, we generated a synthetic dataset that modeled:

300 sensors
1 sample per second
30 days of data retention

This came to around 800 million rows or 1.5 terabytes of source data.

We ingested this dataset into:

TimescaleDB Cloud
TimescaleDB running on an Amazon EC2 instance
TimescaleDB running on an Macbook Pro (2021 14” M1 Pro) with very high disk throughput (6000 MB / s)
Elasticsearch running on EC2
Baby Moon running on EC2 with Amazon MemoryDB for Redis and Amazon S3 for object storage.

We ran TimescaleDB on the three kinds of hardware to understand whether the performance limits we were seeing in production were due to hardware bottlenecks. TimescaleDB and Elasticsearch were configured with best-practice indexes and data compression as recommended in their documentation. All EC2 / Timescale cloud instances were configured with 4 CPUs and 16 GB of RAM.

We then ran 9 queries that simulated our real world workloads against these systems using random time windows to mitigate the effects of caching and averaged the results of 10 runs for each.

We found that all the off-the-shelf database systems were constrained by disk speed when retrieving large amounts of data. Conversely, by fanning-out queries to retrieve records across many files stored in S3 the Baby Moon prototype was predominantly bottlenecked by CPU and network limits which are typically much higher in a cloud environment.

Here is a representative result, where we found Baby Moon performed 20x faster on the same EC2 hardware as TimescaleDB:

We also found that for our use case of retrieving large amounts of data across multiple devices / partitions the performance scaled close to linearly, with a 4x increase in CPU cores (4 to 16) decreasing query time by 3.2x.

Development

With our newfound confidence that the performance and cost gains of building our own data storage layer were probably worth the complexity of rolling our own, we set to work building the full-featured production implementation — codenamed “S4” (because y’know, it’s a little bit more on top of S3).

The S4 system is made up of several components:

Read / write API server

The API server handles:

Receiving sensor records from our IoT sensor data ingestion edge servers and inserting them into MemoryDB (Redis).
Handling queries by retrieving the relevant data from MemoryDB and S3 and filtering / merging the results.

Amazon MemoryDB

Amazon MemoryDB is a managed Redis compatible key-value store from Amazon. We chose to use it over Elasticache or a self-managed Redis cluster as it guarantees no data is lost by writing it to disk while still having very fast write performance (typically <5 ms).

Amazon S3

Bulk data is stored in Amazon S3 as segment files, each typically holding around 1 hour of data for a specific sensor. We chose S3 due to its practically infinite storage capacity, low cost (around $0.02 / GB / month) and very high parallel read throughput due to its horizontal scaling architecture.

Segment worker

Once a segment is not being actively written to the segment worker moves the data out of MemoryDB and into S3.

Aggregation worker

As new data is inserted into S4 the aggregation worker computes aggregate values (e.g. min / max / average / first / last) for time windows and inserts it into a new table. This allows us to query large time ranges at a downsampled rate extremely quickly, to give a graph of battery level every hour over a month for example.

The API server and worker processes are written in Typescript using Nest.js and are deployed in isolated containers in our Kubernetes cluster so they can be scaled independently and automatically based on load.

Once up and running we ran synthetic load testing to confirm that the system performed as expected and would automatically scale to handle data from up to 10,000 sensors.

Migration

The most time consuming part of the S4 project was migrating our data from DynamoDB and TimescaleDB while ensuring the different systems contained exactly the same data and without incurring any downtime.

We did this with a staged approach:

Step 1: Start inserting into S4

As all our sensor data is ingested through a single edge server deployment it was relatively straightforward to update this to start inserting new data to S4 while still inserting into the legacy systems.

Step 2: Migrate historical data from TimescaleDB and DynamoDB

We then wrote scripts to read historical values from the legacy systems and insert them into S4. Fortunately our sensor time series data only ever includes a single value per timestamp which doesn’t change so we could safely overwrite any values that already existed in S4 and run the migrations multiple times if needed.

This was conceptually simple, but due to the volume of data (tens of terabytes) we had to shard this work across many workers.

For TimescaleDB we used a simple SQL adapter to load the old records, while for DynamoDB we were able to leverage DynamoDB S3 export. This allowed us to import the DynamoDB records from files it dumped into S3 which is much faster than querying DynamoDB directly and meant we did not have to pay for extra DynamoDB read capacity.

Step 3: Data verification

At this stage S4 should contain all new and historical data. We wrote some simple scripts to scan through TimescaleDB, DynamoDB and S4 and verify that they contained the same records.

Of course, during this process we discovered some edge cases that lead to discrepancies which we needed to fix and re-run the migrations a few times until we got the green tick of success.

Step 4: Updating the consumers

We updated all the downstream applications that use DirtMate data to read from S4 instead. We put these changes behind feature flags so we could gradually roll out the change and roll back if there were any issues in production.

Step 5: Cleaning up

Once all our production systems had been reading from S4 with no issues for a couple of weeks we set to work removing the TimescaleDB and Dynamo ingestion and querying codepaths and shutting down the legacy databases.

Results

After switching to S4 we were able to increase our data retention across our features from 30 days to at least a year. Our load times for DirtMate analysis features decreased by around 80% and our storage costs decreased by around 90% — a huge success.