Time series analysis and beyond

Usama Ahmed
Aug 31, 2018 · 8 min read

Many of you may wonder what time series data is and want to learn about it’s applications, In this article we going to cover variety of topics related to time series (e.g. Continuous query)

timeseries dashboard

What is time-series data? ( Imaginary story )

One day, i asked my friend Khaled to stand on the balcony to follow the change in temperature using air temperature meter, Actually my friend did the job well after getting a beanbag

It was 12:00 PM at that time, and each time the temperature was measured, A data point was created, Contains information about the measurement and the current timestamp which the measurement process took place.


what date point is ?

+----------------------+--------+
| timestamp | value | <-- metric
+----------------------+--------+
| 1998-04-13T15:48:16Z | 43 c | <-- 43 Celsius
+----------------------+--------+
| 1998-04-13T15:49:16Z | 42 c |
+----------------------+--------+

A combination of timestamp (e.g 892482496000000000), And from zero to many metric (value = 0.7)

  • Every row is data point
  • A Point is equivalent to row in RDBMS
  • A measurement is equivalent to table in RDBMS
  • A metric, column and field has the same meaning

At the end of the day

By the end of the day, we measured the temperature near to 1440 times, Once every minute, these collected data called series .

+-----------+----------+----------+---------+----------+----------+
| timestamp | 12:00 PM | 03:00 PM | 9:00 PM | 12:00 AM | 04:00 AM |
+-----------+----------+----------+---------+----------+----------+
| metric | 33 | 37 | 33 | 25 | 20 |
+-----------+----------+----------+---------+----------+----------+

Time series definition:

A sequence of data points, Measuring the same thing over time, Stored in time order.

Is any number of points called a series ?

The answer is NO!, Let’s assume that we need to measure the water level in Santa Monica, We will set up a station for this purpose, Later we were asked to measure the water level in California too, Now we have data from different sources ! So each data source has it’s own series

  • Santa Monica series
  • California series

The points that has same tag are included in the same series , note that tags are equivalent to indexes in RDBMS

+----------------------+--------------+-------------+
| time | location | water level |
+----------------------+--------------+-------------+
| 2015-08-18T00:00:00Z | santa_monica | 2.064 |
+----------------------+--------------+-------------+
| 2015-08-18T00:00:00Z | coyote_creek | 8.12 |
+----------------------+--------------+-------------+
| 2015-08-18T00:06:00Z | coyote_creek | 8.005 |
+----------------------+--------------+-------------+
| 2015-08-18T00:06:00Z | santa_monica | 2.116 |
+----------------------+--------------+-------------+
- time is timestamp field which is the primary key
- location is tag
- water level is field

Also, if we measure the water level in two different places in the same city , this would effect the number of series that we have

For example the following serieses:

  • Santa Monica / Location A
  • Santa Monica / Location B
  • California / Location A
  • California / Location B

You must have noticed that we are measuring the same thing over time

Having timestamp column means that you have time series data ?

In fact no, In time series data the time is primary axis you must find it in the graph(X or Y axis), just like that

Time series also are used to record prices of cryptocurrencies, Each cryptocurrency has its own series, Each series consists of many points , And every single point has timestamp of measuring, fields, and tags (to attach point in her series)

If you looked at this graph you will find that i checked on one of the points in bitcoin series that has the following information: (Wednesday, Aug…) also the price of this bitcoin in specified timestamp


usage

  • Monitoring software systems: Virtual machines, containers, services, applications
  • Monitoring physical systems: Equipment, machinery, connected devices, the environment, our homes, our bodies
  • Asset tracking applications: Vehicles, trucks, physical containers, pallets
  • Financial trading systems: Classic securities, newer cryptocurrencies
  • Eventing applications: Tracking user/customer interaction data
  • Business intelligence tools: Tracking key metrics and the overall health of the business
  • Google uses time series data in google trends, where you can write a specific keyword then getting an analysis of that word throughout the year also keywords can be compared, think about it like each keyword has its own series.. and in comparing you can find that when a keyword rises up in trending there was another goes down that’s it, you can always compare things
this graph shows that Turky Suddenly became a trend in search after he became president of Al Ahli club
The Pound is a trend in Egypt after Mohamed Salah got a new contract with Liverbool FC

Data growth

let’s assume that we no longer have temperature meter, instead we have temperature sensors, distributed throughout the country, each sensor has it’s own series, so we can track each temperature sensor behavior individually, and day after day data size increases, based in many factor, among these factors how many sensor do you have?

Data growth Affected the system in terms of performance and hard desk capacity, then a developer volunteered to solve the problem with scaling vertically, although it is a quick fix it will not solve the problem, if you’re running low on RAM or you’re kind of exhausting your available CPU cycles or you’re running low on disc space, what’s the easiest, most obvious solution to that problem

Get a better processor and more RAM. :D

Good, get more RAM, more processor, more disc space and just throw resources or equivalently money at the problem. Unfortunately, there is a catch here. There's sort of a ceiling on what you can do why? Why is vertical scaling not necessarily a full solution?

At some point, you’re either going to exhaust your own financial resources or just the state of the art in technology because just the world hasn’t made a machine that has as many resources as you need. So, you need to get a little smarter

After making many attempts to scaling verticly

Downsampling

another developer has the perfect one, coming to solve problem

If we have configured our sensors to to provide us with readings point per minute, we gonna end up have 30 point per sensor every half hour, we have to ask our self some questions does the change in temperature degree occur every minute? of course it’s not, the change per minute unnoticeable, the temperature changes every long interval, what about calculate the average of every 30 min?

For example if we had a table called temperature that has pure data collected by sensors, let’s create another one called downsampled_temperature which contains average of every 30 min… let’s group each 30 min data together, calculate their avg, persist avg in downsampled_temperature, sure you would notice the difference between 30 point in half hour vs 1 point (average) in performance, also i think if someone ask you about temperature at some minute, and you respond with the average of 30 min which this minute belongs to, it would be Okay, The missing value may be a decimal fraction like 0.2 or something , which is not big deal

Read queries no longer has direct access on temperature measurement instead queries are handled by downsampled_temperature, what about temperature? just a mail box which keeps received data inside until someone take it, calculate it, and move the result in

Who is that someone ?

Continuous query introduced in Influx db, it looks like a cron job running somewhere, every time period, to execute some Queries, it may includes queries you write to be responsible for downsample process

CREATE CONTINUOUS QUERY "sensor_downsample_1h" ON "my-influx-db"
RESAMPLE FOR 2h
BEGIN
SELECT
mean(Temperature) as "Temperature",
mean(Humidity) as "Humidity",
min(Temperature) as "TemperatureMin",
min(Humidity) as "HumidityMin",
max(Temperature) as "TemperatureMax",
max(Humidity) as "HumidityMax",
count(Temperature) as "SampleCount"
INTO "my-influx-db"."autogen"."sensor.downsample.1h"
FROM "my-influx-db"."autogen"."sensor"
GROUP BY time(1h), * fill(none)
END

Let me clarify the code above, first..

You: “Excuse me! can you create a continuous query for me?”

Influx: “Sure dude, What you gonna call it?”

You: “I wanna call it sensor_downsample_1h, (label is just a description for what a query does) and to run on my-influx-db database, also I need it to run every hour”

Influx: done, what kind of queries you want?

You: the interesting part is coming, i want you to make me a favor, you gonna grab last hour data ( now() - 1h), to perform some operations on them, calculate Temperature average on last one hour long as “Temperature”, also do the same with Humidity , also get minimum temperature value as TemperatureMin and calculate TemperatureMax then get the result of this query and persist that result in another measurement called sensor_downsample_1h


Retention police

Time series data by nature begins to pile up pretty quickly and it can be helpful to discard old data after it’s no longer useful. Retention policies offer a simple and effective way to achieve this. It amounts to what is essentially an expiration date on your data. Once the data is “expired” it will automatically be dropped from the database, an action commonly referred to as retention policy enforcement. When it comes time to drop that data however, InfluxDB doesn’t just drop one data point at a time; it drops an entire shard group.

So it’s okay to make temperature retention police one hour, downsampled_temperature retention police one year

system dashboard

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade