An introduction to time series databases
Time series database (TSDB) is relative new compared with RDBMS, NoSQL, even NewSQL. However it is becoming trending with the growth of system monitoring and internet of things. The wiki definition of time series data is a series of data points indexed (or listed or graphed) in time order. When it comes to Time Series DataBase, i think the best definiton is to store client history in server for analysis. Time series data is history, it’s immutable, unique and sortable. For instance, a log data like "the CPU usage at 2017-09-03-21:24:44 is 10.02% for machine-01 in TR-1232211xxx host” , this won’t change overtime like bank account balance, there will be no update once it’s generated, the CPU usage at next second, or from different machine are different data points. And the order of data arriving at server does not effect correctness because you can remove the duplicate and sort by client timestamp. Clients of Time Series DataBase send their history to server and is still functional when the server is down. This is because sending data to Time Series DataBase is not critical for many clients. A http server’s main job is serving content instead of reporting status code to Time Series DataBase. However, Relational Database is treated as single source of truth and effect client’s critical decision making. This lead to very different read and write pattern. For instance, banking application need to query database for user’s balance before proceed by reading and updating a single record. But most TimeSeries Databases clients are either write only (collectors) or read only (dashboard and alerting system). And when they read, they read in large batch, like show CPU usage of last 1h or is used more often than (like show CPU usage at 2018-09-30-21:24:44 ) because time series data is not that useful without its context.
Time Series Data is very different and Because of this In DataBase Management Systems to deal with time series data people usually forced to use the database in very different ways.(for example VividCortex with MySQL, Timescale with Postgres). This approach sometimes brings their own headache and performance concerns so Some people decided that for special problems special solution is needed. Thats where time series databases comes to the picture. There are many TSDBs are written from scratch (Graphite, InfluxDB etc.) without dependencies to existing databases.
Evolve of time series database
There are too many time series databases, so I just list databases that I personally considered as milestone in the evolving of time series database, feel free to comment the pieces I missed, I can’t find the real initial release of many databases so I just use the oldest on github. Please see them below :
- 1999/07/16 RRDTool First release
- 2009/12/30 Graphite 0.9.5
- 2011/12/23 OpenTSDB 1.0.0
- 2013/05/24 KairosDB 1.0.0-beta
- 2013/10/24 InfluxDB 0.0.1
- 2014/08/25 Heroic 0.3.0
- 2017/03/27 TimescaleDB 0.0.1-beta
RRD Tool: RRDTool was created to graph network traffic, it ships with graphing tool while modern TSDB normally depends on Grafana for graphing.
Graphite : Graphite was created later using python instead of C like RRDTool, its storage engine is called Whisper, it’s much powerful when it comes to data processing and query, however it does not scale well.
OpenTSDB: OpenTSDB from Yahoo! solves the scale problem by using HBase.
KairosDB and Heroic : KairosDB was a fork for OpenTSDB to support Cassandra as an alternative backend, but then they found being compatible with HBase limit the potential of Cassandra, so they dropped HBase and use Cassandra only. Ironically, recent release of OpenTSDB added support for Cassandra. Then Heroic came out because they are not satisfied with KairosDB’s performance and direction.
InfluxDB: InfluxDB started with full open source, but then close sourced their cluster version because they need to keep the company running, there is a interesting talk called The Open Source Database Business Model is Under Siege during Percona Live which features a time series session.
TimeSclaDB: TimeScaleDB is based on PostgreSQL with a plugin instead of special schema.
Time series data model
Time series data can be split into two parts, series and data points. Series is the identifier, like CPU usage for machine-01 in xxxxx domain, data points are an array of points where each point is a timestamp and value.
For series, the main goal is the extensibility for post processing (searching, filtering etc.). i.e. If you want CPU usage of all machines in xxxx domain, the identifier of series CPU usage for machine-01 in xxxxx domain is name=cpu.usage machine=machine-01 domain=xxxxx, and the query becomes name=cpu.usage machine=* domain=xxxxx.
In order to deal with large amount of series and wildcard matching, (inverted) index is needed. Some choose to use external search engine like Heroic or using Elasticsearch. Some chose to write their own like in InfluxDB, Prometheus.
For data points there are two models, an array of points [{t: 2017-09-03-21:24:44, v: 0.1002}, {t: 2017-09-03-21:24:45, v: 0.1012}] or two arrays for timestamp and values respectively [2017-09-03-21:24:44, 2017-09-03-21:24:45], [0.1002, 0.1012]. The former is row store, the latter is column store (not to be confused with column family). When building Time Series DataBase on top of existing databases (Cassandra, HBase etc.), the former is used more, while for a Time Series DataBase written from scratch, the latter is more popular.
TSDB is actually a subset of OLAP and its columnar format brings higher compression ratio and query speed.
Hot topics in Time series databases
Fast response
Time series database is used for analysis, and people don’t want to wait in front of dashboard when production system is failing and user’s complain phone coming in, so fast response is a base requirement for any production ready time series database.
The most straight forward way is to put data into memory as much as possible. Facebook built Gorilla, now open sourced as Beringei, and its main contribution is using time series specific compression to store more data in memory.
Another way for speed up is pre-aggregation, also known as roll up.Please have look Akumuli and BtrDB projects in github.
Because query often involve a long time range with coarse granularity, like average daily cpu usage from June 1 to Aug 1, those aggregations (average, min, max) can be computed when ingesting data, BtrDB and Akumuli store aggregation in upper level tree nodes so fine grained data won’t be loaded when query is coarse grained.
A proper ingest format could also reduce response time for both read and write, JSON is widely used, but Binary format is much better than textual format when a lot of number is involved, protobuf could be a good choice.
Retention
Not all time series data is useful all time, if the system has been working well for the last two month, fine grained data can be dropped and only coarse grained is kept. This is the default behavior of RDDTool and Graphite, but not the case for many newer scaled TSDB.
Delete a file on local disk is easy but update a large amount of data in a distributed environment requires more caution to keep the system up all time, you don’t want your monitoring systems failed before the system it is monitoring failed.
Meta data indexing
Series identifier in general is the only meta data in time series database. Databases like Heroic use ElasticSearch to store meta data, query first goes to elasticsearch to retrieve the id for series, then data is loaded from Cassandra using id. A full search engine as Elasticsearch is powerful for sure, but the overhead of maintain another system and time spent coordinating and communicating between two system can’t be ignored. Also some TSDB specific optimization may not be available when you don’t have full control over metadata index building and storage. So InfluxDB and Prometheus wrote their own inverted index for indexing meta data.