It’s time for time series

Published in

Arduino Engineering

8 min readMay 19, 2022

TL;DR the story of why we decided to move to a dedicated TSDB and how we did it

Credits: huge thanks to Maurizio Branca @zmoog and Giuseppe Lumia @glumia8 who did most of the development work behind this article.

Arduino Internet-Of-Things (IoT) Cloud is a service that allows to easily transform Arduino boards into connected IoT devices, that can be controlled remotely via web or mobile and that can interoperate with other smart devices. To know more about Arduino Cloud check out our site.

One of our core use cases in IoT Cloud is the ability to collect sensor data coming from a device, such as a temperature sensor. For example, those data can be sent every second or every few seconds. Each measurement is a value with a timestamp, all associated with the same meaning, and continuously collected over time. Quickly said, a time series.

The temperature in my office, measured by a sensor and displayed in Arduino Cloud

Use cases and initial architecture

First off, let’s list the major use cases for this system:

UC1 Device Connectivity: receive/send data from/to a device; these data must be passed around in real-time and (depending on user choices and configuration) sent to other devices and/or user interfaces. In our system, data coming from a device are called “variables” or “properties”.
UC2 Last Status: store the last status of a device on the server in a permanent way; this is important to ensure the status is restored on the device in case of reset. IoT devices are considered “ephemeral” (data won’t survive a reboot) and the IoT Cloud must help by providing a safe copy of their status. This is one of the most important requirements of the concept of Digital Twin which in Arduino Cloud is called a “Thing”.
UC3 Historical data: store each variable value sent by the device — especially important in the case of sensors — to ensure the ability for a user to visualise all collected data over time (for example last week, or last month).

In terms of the data model, the core of this domain contains a Thing (identified by a ThingID) and its associated variables (internally called “properties”, each of those identified by PropertyID).

Based on these use cases, the initial architecture is depicted in this diagram:

Starting from the left:

IoT devices can send and receive data to an MQTT broker, solving UC1. The broker will expose Topics to which other components can subscribe to send or receive data. For example, a user interface (UI in the diagram) can receive data coming from devices in real-time.
Since the MQTT broker does not offer persistence, all received data are also sent to a datastream. Our app runs on Amazon Web Services, so we used a Kinesis data stream for this, but it can be any other decoupling mechanism that ensures that the data storing delay does not “block” the real-time flow.
Reading from the datastream, a lambda function picks up batches of records received, and stores those in a Key-Value store (we used DynamoDB for this). In the KVstore we are using a lastvalues table that contains only the last sample received (keyed with ThingID and PropertyID), and a historical table with an additional timestamp column to store all received variable values.
A service lastvalues-api is used to solve UC2; it allows to retrieve the last status for each device, by querying lastvalues in the KV store; the endpoint is used for example by the MQTT broker to send back to a device its last status after a reset.
A service historical-api is used by the user interface to respond to user queries for example about the last month of data (UC3); those queries are going against the historical table in the KVstore.

Problems

The above architecture is fairly simple and robust. Most of the time, all use cases are working very well. However, it’s hiding a problem for the historical data use case.

What happens if the user issues a query for a large range of time?
Let’s say a month. A device might send variables every second. Doing the math, it means extracting 2.5 million records from the KVstore, which implies a considerable fetch time to retrieve all records. Plus, it does not make sense to show millions of records in a chart, whatever form of a chart.

A possible idea: the client service could perform downsampling and aggregation to reduce the number of values that can feed the chart.
However, it’s hard to decide which samples to retrieve beforehand, because they end up being statistically not significant for the measurement they represent.

On the other side, extracting all values and computing aggregation in the client service can be compute-intensive and even memory-intensive for some non-linear aggregations. We implemented in-memory data aggregation but large queries had serious performance issues (a query could last minutes).

It’s true that users won’t issue queries for large time range all the times.
But when that happens, performances are not acceptable.

Of course, DynamoDB can be tweaked and tuned to provide better read performances, by increasing read capacity and even putting extra capacity on-demand, but that has a hit on costs and doesn’t necessarily improve performances. In fact, a single query can’t trigger the on-demand reaction of capacity increase and will be slow. Instead, increasing read capacity upfront means wasting money for most of the time just to optimise the few large queries that might happen.

Some architectural patterns come to mind, like the Database per service concept in Microservices architecture, wherein

“Each service can use the type of database that is best suited to its needs”
(see Microservices.io/Database per service).

Another well-known pattern is the Lambda architecture in data processing, where two stores are created starting from the same data feed, one optimised for real-time usage and the other for historical/batch use.

So we started asking ourselves: do we need a specialised TimeSeries DataBase (TSDB)? could that be used to avoid issues with large queries?

TSDB at the rescue

In the diagram above, the modified architecture with an additional component (TSDB). Essentially, any data in the datastream is stored both in the KVStore (where only the last value for each incoming variable s kept) and in the TSDB (where all samples are kept, with their timestamp).
The hist-api service can now extract data from the TSDB. Now one could say that TSDB could also be used for lastv-api to resolve UC2.

However, completely embracing the Database per Service pattern, we decided to keep them separated. With a little cost in terms of duplicated writes and storage, the major advantage is specialisation and decoupling.

We didn’t spend much time on the “which TSDB?” question.

There are many TSDBs out there, and some of those are great technologies like InfluxDB or TimescaleDB or those in the Prometheus arena, like VictoriaMetrics.

In the end, one of the promises of microservices architectures is that it will be relatively easy to select a different technology in future since we are keeping the hist-api service self-contained and independent.

Our choice was for a managed TSDB that we could rely on, without too much burden from an operational perspective, with reasonable costs.
Being on AWS we started playing with AWS Timestream.

Some of the benefits of AWS Timestream include:

it provides automated data storage tiering based on the age of the record, i.e. more recent data are stored in high throughput, expensive in-memory storage while older records are stored in slower and cheaper magnetic storage
it supports data encryption, which is key to maintaining proper security
it supports querying via SQL, which is easy and well known, and offers a rich set of aggregation functions while extracting data
it provides automated scaling to support increasing volumes of data

Timestream requires to specify for each record inserted:

a timestamp
a set of dimensions — in our case PropertyID is enough since all descriptive metadata of the time series are defined elsewhere
a measure, which consists of measure_name, measure_value, measure_type

An important decision while creating a table in Timestream is the memory vs magnetic store retention.

In our case, we saw that the most common queries on historical data are about the last day or last week, hence we decided for 1 week memory retention, while the magnetic store retention is the total retention offered by our service (1 year). This configuration has an impact on the total cost and performance of large queries, so it should be carefully selected. Once decided, it has to be defined during the table creation in form of retention properties:

retention_properties = { 
‘MemoryStoreRetentionPeriodInHours’: 168, ‘MagneticStoreRetentionPeriodInDays’: 365
}

The immediate advantage offered by the TSDB is that you can run queries like this one:

select bin(time, 300s) as binned_time, 
avg(measure_value::double) as avg_value, APPROX_PERCENTILE(measure_value::double, 0.9) as p90_value
from ”historical" 
where propertyID = ‘{property_id}’ AND time > ago(24h)
group by propertyID,bin(time, 300s)
order by binned_time

This query will automatically create 5-mins aggregate records over all data and compute averages or other more sophisticated aggregates such as approximate percentiles. The important point is that the number of output records is always 288 in a day, regardless of the original granularity of incoming samples. This means a constant fetch time and easy charting of results. If the time range is extended, larger bin sizes can be used to keep the number of resulting records and fetch time contained.

We populated the TSDB with millions of records and hundreds of thousands of time series and measured query performances, generally observing execution times lower than 1s for queries in the memory data range and lower than 10s for queries in the magnetic store data range. These are acceptable performances in our use case (and much better than what we were obtaining by fetching all single records from DynamoDB in the first architecture). The promise is that the cloud provider will automatically scale the service to keep up with the query performance as data volume grows.

Conclusions

This experience demonstrates once more that, even when similarities between two use cases appear prominent in terms of data model and data flow, it might be worth keeping them separated based on the query pattern.

The difference in our case was the way the same data are extracted, and the volume of extracted data. By separating services based on query patterns, we were able to select the proper data store for each of them and to optimise performances and scalability.

It’s time for time series

Use cases and initial architecture

Problems

TSDB at the rescue

Conclusions

Written by Stefano Visconti