The Rise of Time Series Databases
I have been probably working with time series data for about more than 15 years now (I prefer to not count the exact number of years) and have been really intrigued by the rise of Time Series databases in last years.
During my professional career I have been working with SQL databases and a couple of different data models strategies and I believe that for the problems dealing with a fixed set of variables/metrics and maximum time resolutions around 5 to 10 minutes you'll be just fine using your well known and reliable SQL database.
However, when you need to deal with different metrics and finer grained time resolutions things start getting complicated in SQL world. Consider gathering 10 different metrics every one second from 10 different sources. To accomplish this, you'd probably come up with a table with columns similar to this list:
- Date and time
- Data source
So what could be wrong with that? Just do the math. In one year of data (if it's not a leap year) you'd have 3,153,600,000 rows. Yes, that's it, ~3 billion rows for just one year. Now, imagine having to scan over these rows to every time you need to perform data aggregation. With such a simple database schema your SQL database performance will degrade over the time and table indexes and SSDs will only alleviate but won't solve the problem.
If you're lucky enough and don't need to keep this data around for too long you can probably get around it. I'm considering the cases where you need to keep data for a long time (in my case I cannot delete data) and medium to larger software development projects dealing with hundreds to thousands of different data sources with more than just ten metrics. In those scenarios it will not take too much time until you are tweaking and squeezing your database schema and parameters to try to keep performance at acceptable levels.
At some point you'll start, like I did, playing around with partitioning/sharding and not so obvious table schemas like RRD (Round Robin Databases) and may ask yourself the questions I've been trying to answer lately: "Is a SQL database the right tool for high frequency time series data?", "are there other reliable solutions available?".
I'm still working on this but, yes, it looks like there's good alternatives. Some interesting technological innovations are happening in this field to address such kind of problems driven mostly by the "Internet of Things" thing. Many open source Time Series databases have been proposed lately and I'm evaluating them for my projects.
If you are interested in this subject, stayed tuned for posts with evaluations of some of the promising databases for time series data such as Cassandra, MongoDB, InfluxDB and not so popular ones that I'm coming across during this research.