Aerospike Time Series API
Aerospike is a high performance distributed database, particularly well suited for real time transactional processing. It is aimed at institutions and use-cases that need high throughput ( 100k tps+), with low latency (95% completion in <1ms), while managing large amounts of data (Tb+) with 100% uptime, scalability and low cost.
Conceptually, Aerospike is most readily categorised as a key value database. In reality however it has a number of bespoke features that make it capable of supporting a much wider set of use cases. A good example is our document API which builds on our collection data types in order to provide JsonPath support for documents.
Another general use case we can consider is support for time series. The combination of buffered writes and efficient map operations allows us to optimise for both read and write of time series data. The Aerospike Time Series API leverages these features to provide a general purpose interface for efficient reading and writing of time series data at scale. Also included is a benchmarking tool allowing performance to be measured.
Time Series Data
Time series data can be thought of as a sequence of observations associated with a given property of a single subject. An observation is a quantity comprising two elements — a timestamp and a value. A property is a measurable attribute such as speed, temperature, pressure or price. We can see then that examples of time series might be the speed of a given vehicle; temperature readings at a fixed location; pressures recorded by an industrial sensor or the price of a stock on a given exchange. In each case the series consists of the evolution of these properties over time.
A time series API in its most basic form needs to consist of
- A function allowing the writing of time series observations
- A function allowing the retrieval of time series observations
Additional conveniences might include
- The ability to write data in bulk (batch writes)
- The ability to query the data e.g. calculate the average, maximum or minimum.
Aerospike Time Series API
The Aerospike Time Series API provides the above via the TimeSeriesClient object. The API is as follows
A DataPoint is a simple object representing an observation and the time at which it was made, constructed as follows. The Java Date timestamp allows times to be specified to millisecond accuracy
The code example below shows us inserting a series of 24 temperature readings, taken in Trafalgar Square, London, on the 14th February 2022. We give the time series a meaningful and precise name by concatenating subject, property and units.
As a diagnostic, we can get some basic information about the time series
which will give
Another diagnostic allows the time series to be printed to the command line
Finally we can run a basic query
Note we could alternatively have used the batch put operation, which ‘puts’ all the points in a single operation.
There are two key implementation concepts to grasp. Firstly, rather than store each data point as a separate object, they are inserted into Aerospike maps. This minimises network traffic at write time (we only ‘send’ the new point) and allows large numbers of points to be potentially read at read time as they are encapsulated in a single object. It also helps minimise memory usage as Aerospike has a fixed (64 byte) cost for each object. Schematically, each time series object looks something like
The maps must not grow to an indefinite extent, so the API ensures that each map will not grow beyond a specified maximum size. By default this limit is 1000 points, although this can be altered (see additional control). There is also a discussion in the README of the sizing and performance considerations associated with this setting.
The second implementation point follows on from the first. As there is a limit to the number of points that can be stored in a block, we need to have some mechanism for creating new blocks and keeping track of existing blocks for each time series. This is done, on a per time series basis, by maintaining an index of all blocks created. Conceptually this looks something like the following
The Time Series API ships with a benchmarking tool. Three modes of operation are provided — real time insert, batch insert and query. For details of how to download and run see the benchmarking section of the README.
Real Time Benchmarking
As a simple example, let’s insert 10 seconds of data for a single time series, with observations being made once per second.
We can make use of another utility to see the output — ./timeSeriesReader.sh. This can be run for a named time series, or alternatively, will select a time series at random.
Here is sample output for our simple example
We can see that we have had sample points generated over a ten second period, with the series given a random name.
The benchmarker can be run at greater scale using the -c (time series count) flag. You may also wish to make use of -z (multi-thread) flag in order to achieve required throughput. The benchmarker will warn you if required throughput is not being achieved.
Another real time option is acceleration via the -a flag. This runs the simulation at an accelerated rate. So for instance if you wished to insert points every 30 seconds over a 1 hour period (120 points), you could shorten the time of the run by running using ‘-a 30’. This will ‘speed up’ the simulation by a factor of 30, so it will only take 120s. A higher number would also be possible. The benchmarker will indicate the actual update rates. For example
A disadvantage of the ‘real time’ benchmarker is precisely that — the loading occurs in real time. You may wish to build your sample time series as quickly as possible. The batch insert mode is provided for this purpose.
In this mode, data points are loaded a block at a time — effectively as fast as the benchmarker will run. The invocation below, for example, will create 1000 sample series (-c flag), over a period of 1 year (-r flag), with 30 seconds between each observation.
Having two different methods for generating data now puts us in the position where we can consider query benchmarking. This is the third and final aspect of the benchmarking toolkit.
Query benchmarking can be invoked via the ‘query’ mode. We choose how long to run the benchmarker for (-d flag) and the number of threads to use (-z flag).
At runtime, the benchmarker scans the database to determine all time series available. Each iteration of the benchmarker selects a series at random and calculates the average value of the series. The necessitates pulling all data points for the series to the client side and doing the necessary calculation so it is a good test of the query capability. We can ensure the queries are consistent in terms of data point value by using the batch insert aspect of the benchmarker which ensures all series have the same number of data points.
Sample invocation and output
The Aerospike Time Series API contains a realistic simulator, which is made use of by the Benchmarker.
Many time series over a short period at least, follow a Brownian Motion. The TimeSeriesSimulator allows this to be simulated. The idea is that if we look at the relative change in our observed value, then the expected mean change should be proportional to the time between observations and the expected variance should similarly be proportional to the period in question. Formally, let X(τ) be the observation of the subject property X at time τ. After a time t let the value of X be X(τ+t). The simulation distributes the value of (X(τ +t) — X(τ)) / X(τ) i.e. the relative change in X like a normal distribution with mean μt and variance σ 2t.
(X(t + τ) — X(t)) / X(t) ~ N(μt,σ 2t.)
More detail is available at simulation but it is useful to see that the net effect of the above is to produce sample series such as the one shown below
We can see it looks very much like the sort of graph we might see for a stock price.
More complex time series e.g. those seen for temperatures might be simulated by concatenating several series together, with different drifts and volatilities, allowing values to trend both up and down. Mean reverting series can be simulated by setting the drift to zero.
Real Life Performance
As a test, performance was examined on an Aerospike cluster deployed on 3 i3en.2xlarge AWS instances. This instance type was selected as the ACT rating of the drives is 300k, making the arithmetic simple.
In simple terms, this cluster can then support 100k (see Performance Considerations) * 1.5kbyte * 3 (number of instances) = 450mb of throughput.
We know our average write is ~8kb. We assume replication factor two for resilience purposes. Sustainable updates per second is then 450mb / 2 (replication factor) / 8kb = 28,000.
In practice a 50k update rate was easily sustained using the real time benchmarker. The reason the value is higher is that larger writes do not necessarily have a larger penalty than small writes. Also, the ACT rating guarantees operations are sub 1ms in latency 95% of the time, a guarantee not necessarily needed for time series inserts.
The cost of such a cluster would be $23k per year using on-demand pricing ($0.90 / hour / instance) or $16k per year ($0.61 / hour/ instance) if using a reserved pricing plan.
Queries retrieving 1 million points per query (1 year of observations every 30 seconds) were able to run at the rate of two per second, with end to end latency of ~0.5 seconds for a sustained period using the benchmarking tool.
At the time of writing, this is an initial release of this API. Further developments should be expected. Possible further iterations may include
- Data compression following the Gorilla approach which potentially allows data footprint to be reduced by 90%
- Labelling of data to support the easy retrieval of multiple properties for subjects. For example, several sensors may be attached to an industrial machine — it may be convenient to retrieve all this series simultaneously for analysis purposes.
- A REPL (read/eval/print/loop) capability to support interrogative analysis
The Time Series Client is available at Maven Central — aero-time-series-client. You can download directly or by adding the below to your pom.xml file.