Time Series — What is all the hype about
3 Weeks ago Amazon launched a complete tool to analyze just the Time-Series data.
What drove such a tech giant to create a tool just for one kind of data?
Google did something similar a couple of months ago in September 2018
In Februrary 2018 InfluxData, a startup that started building an open source toolkit for time-series processing in 2014, announced its series C funding round of $35M.
Today they have more than 120K customers using their open source pipeline and 400 enterprise customers.
What’s so special about this data?
It’s not that this data is rare to come by or that this data is something very unique. Quite the opposite! This kind of data is everywhere. We are abound in it. By definition any data that has ‘time’ as one of its variables is time-series data. Since time works exclusively as an independent variable, any dataset with time as a variable is ordered by time.
That gives very neat properties to the other variables in the dataset.
Time series data is ordered by time
The data is received in only one direction i.e. if time is plotted on X-axis you only receive the data in the right direction. The past is fixed. No new data will modify the old values.
Time series data is becoming increasingly important for the reason that as you get new data you are getting data in real-time. Real time data processing has immense advantages. e.g. predicting heart attacks before they occur or stock market changes or even machine breakdown before it occurs to help reduce downtime and hence losses.
What is more interesting is that with the growth of IoT more and more fields are producing live data. Such huge amount of data is not only representative of the current state of the system but also helps identify future events.
Meteorology Data (Temperature, Wind, Rainfall etc.)
Financial Data (Stock Market, GDP, Exchange Rate)
Medical Data (EEG, ECG, Temperature, Blood Pressure)
What information does time-series really have?
From a machine learning perspective time series can be utilized for:
- Labelling the States of the system:Supervised/Unsupervised — Description of the various states in which the series exists. e.g. Stock market can be in bull phase or bear phase. ECG can be in a heart attack phase or a normal phase, EEG can be in epileptic fit phase or normal phase.
- Analysis of features: Time Series Analysis — The series can tell a lot about how the data is being generated. The statistical measures of the data break the series apart into its components in such a way that we can infer a lot about the behaviour of the data. Figure below!
Trend — Whether the series increases or decreases contiously
Seasonality — The presence of any repeating patterns
Noise — Additional random variations
- Understanding the mechanism for generation of series — Note that a series might not have any of these components available for analysis. As far as statisticians are concerned forecasting a series that has non deterministic behaviour is done through probabilistic analysis. After de-trending and de-seasonalizing a series we look for stochastic behaviour of the noise component. Statistical time-series analysis is concerned with evaluating the properties of the probability model which generated the observed time series.
- Forecast — We can label the various phases of a time series and we can also look at the trend and statistical behaviour in the past but what is seen increasingly as an important function of time-series is forecast. Determining the future from past values holds immense importance. e.g. predicting earthquakes, sales trend. Amazon supposedly uses its own sales data to predict where are future orders going to come from and ships its products even before the orders are placed. That’s the power of forecast. Creation of data — Sometimes just forecast is not enough we even create future data and course correct based on such machine created data.
Machine Learning as a Tool to evaluate Time Series
By its very nature time series data keeps getting accumulated. So any technical solution must do two things
- Employ machine learning algorithms that analyse and forecast from the data
- Architecture that handles a combination of streaming, real-time and historical data. James Corcoran SVP of products, solutions and innovation at Kx, said
“Time-series data tends to be big, so performance and scalability are crucial. The key requirements for working with time-series data are the abilities to analyze and aggregate the data very, very quickly.”
The things to keep in mind
- Multivariate time series — Time series data is very complex. Quite often, a single variable/signal is not enough to determine a system. Multiple signals that vary continuously in time are required to represent the system accurately.
- Sampling rate —These variables/signals may not be sampled at the same rate. Even a single signal may not be sampled at a uniform rate. Some data may arrive in bursts and other data may arrive continuously and may be rich in high frequency content[ref 1]
- Windows of time — A single snapshot of time may not be enough to deduce information about all the states in the signal. Analysis of long periods of data is required. Thus voluminous time history traces of data need to be digested by the algorithms.
- Feature Learning — Manual selection of features like mean, moving average, higher order derivatives is infeasible and not very useful for most cases. ARIMA, RNNs and SAX etc. are useful for feature extraction and are regularly used but now-a-days companies like Google and Amazon prefer unsupervised extraction of features (figure below)
- Transforming Raw Data — Often the above mentioned feature extraction methods cannot be applied on the raw signal directly. Transforms might be needed to be applied on the data — spectral analysis, resampling, window optimisation etc.
- Combination of supervised + unsupervised labelling — Labelling the states of the data may not be easy. Let’s say we want to to label when a lathe Machine is in normal state, vibration state and not-working state. In real world applications the historic data for faulty states is very rare as compared to the normal state. An unsupervised labelling algorithm will be able to sift through huge amounts of data and separate out these rare chunks of faulty data.
- Real time processing — The need to process signals real time poses a number of challenges. Some amount of memory needs to be reserved to buffer the history of signals to feed into the model. With multiple signals being captured at varying sampling rates synchronization becomes a key issue. Consequently gaps in the data need to be managed.
- Insights — Finally all analysis is useless if there isn’t a way to visualize the insights drawn from the Machine Learning infrastructure. Actionable insights must be provided to organisations to utilise the live time-series data.
Where to get started?
- A good place to get started is a book that handles all the statistical aspects of it and uses the language R. Buy it here.
This book gives you a step-by-step introduction to analysing time series using the open source software R. Each time…amzn.to
- There are many tutorials on the internet for performing various machine learning algorithms on TS data.
Deep Learning for Time Series Forecasting Crash Course. Bring Deep Learning methods to Your Time Series project in 7…machinelearningmastery.com
- A very interesting article by Vegard F points out the pitfalls of using ML with time series.
In my other posts, I have covered topics such as: How to combine machine learning and physics based modeling, and how…towardsdatascience.com
- The 10th chapter in the following book is a very good paper on Machine Learning Strategies for Time Series Forecasting
To large organizations, business intelligence (BI) promises the capability of collecting and analyzing internal and…amzn.to
- Finally, Time Series Forecasting is a well studied but still growing field. A number of experts on quora give detailed answers to questions related to time series. One of the good ones I found was this by Matthew Dancho.
X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such experiment driven simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please clap and comment below
Thanks for reading!