Cloud-based stream analytics or Time-series analytics?

Mattia Nicolella (Nick1296)
5 min readMay 21, 2019

--

What is a time series?

A time series is a data set in which order and time are fundamental elements that are central to the meaning of the data and usually data elements have an internal structure, such as autocorrelation, for example.

In a time series the time associated with the data corresponds to the time in which the measurement was made and is fundamental for the whole dataset usage.

Usually a time series dataset is a set of past observations which is used to make some kind of prediction or to extract some statistics leveraging the timestamps of the values.

We will speak about InfluxDB time series toolkit when we refer to time series.

What is a data stream?

A data stream instead is any flow of information in which the timestamp of when the measurement was made is not important as in a time series, but is used to perform real time analysis to gain insight of events that are still happening. Usually this data is unstructured and is provided in a continuous stream of information which is analysed before being saved, because the value of the analysis decreases with time, so this data is analysed only once.

We will focus about Microsoft Azure data stream analysis toolkit.

Applications

InfluxDB+Graphana for house monitoring

Time series analysis applications

With time series we could analyse processes and make some kind of prediction based on the data that we have stored, so we could use a time series in a weather model or for prediction financial trends in a large time span.

Data stream analysis applications

A data stream analysis has several application and if often used where there are need to analyse large quantities of data to perform various real time detections: we could search for business opportunities by analysing transactions, detect intruders with several cameras and motion sensors or analyse communication traffic to balance load on telecommunications equipments or servers.

All these application do not require data persistence and have no use for a second analysis of the same data.

Pros & Cons

Time series pros & cons

Usually leverages an SQL-like database, optimized for heavy use of writes and queries, supporting also continuous queries that compute aggregated data as new data is written in the database. Series can be indexed by tags, making queries more efficient.

To have fast queries and writes, deletion and updates have restricted functionalities.

The tables use the timestamp as a key, so it’s simple to resolve conflicts, reducing the time needed for writing, however we cannot store duplicates.

The interface of the database uses HTTP APIs and has several plugins for other data ingestion protocols.

We can process quickly a large amount of data, relative to a wide time span (months) thanks to data aggregation and compression.

We can use continuous queries and triggers to react in a semi-real time fashion when new data is inserted, however under heavy load these tools lose struggle to keep up with the new data that is inserted, so we can react to an event with some delay.

Data stream pros & cons

Is used to quickly react to real time events, thus improving the responsiveness of the system, to do so however a cluster is needed, which makes this approach expensive.

The availability of high computation power can be leveraged to perform computational-heavy operations over a stream, like machine learning.

By real time analysis we can create aggregate data to be stored for later reuse (e.g. with a time series) when we perform some kind of detection.

We are in a SQL-like ambient, so we use queries to analyse data in time windows and compute results to react to the event that has generated the information.

The service can run on the cloud or in the edge, with different latency, it’s usually billed on the amount of data processed and when in the cloud it can be easily scaled.

While collecting and analysing data we could setup a trigger based on condition derived from the data analysis.

With Azure, the input tools are provided by the cloud environment (Iot Hub, Event Hub etc.)

Summary

Similarities

  • usage of timestamps to organize the data
  • visualization of the data often shows the results of the analysis in relation with the time
  • data used in stream analysis can be stored to be reused in a time series analysis and vice-versa
  • many techniques used in time series analysis are adapted to be used over data streams in real time
  • SQL-like ambient
  • visualization and monitoring of data in dashboards
  • usage of Http API and RESTful APIs
  • Schema free usage with persistence and durability of the database

Differences

  • time series is used to make predictions
  • data stream analysis is used to gain information on real time events quickly
  • time series relies on stored data
  • data stream analysis usefulness ends when the data is stored
  • time series uses structured data
  • data streams have unstructured data
  • time series may not user the most recent data
  • InfluxDB implementation of time series database supports more languages than Azure data stream
  • Azure data stream has more access methods than InfluxDB, which relies on JSON over UDP and HTTP
  • Azure uses JSON data types and offers some advanced features like secondary indexes, server side scripts and trigger which are not available on InfluxDB which uses numerical data types
  • Azure is not open-source and it’s implementation is cloud only, while InfluxDB has an open source implementation which can be downloaded and a closed source clustering solution for cloud storage.

--

--