Cleaning and Understanding Multivariate Time Series Data

A beginner’s guide to the world of time series forecasting!

Published in

Analytics Vidhya

10 min readSep 4, 2020

No matter what kind of data science project one is assigned to, making sense of the dataset and cleaning it always critical for success. The first step is to understand the data using exploratory data analysis (EDA)as it helps us create the logical approach for solving the business problem. It also allows us to identify the issues like outliers existing in our dataset.

It is necessary to clean up these issues before starting any analysis because if our data is spewing garbage, so will our analysis. Moreover, the insights from such an analysis won’t tie up with the theoretical or business knowledge of our clients and they may lose confidence in our work.Even if the clients end up making a decision based on such an analysis but the end result will turn out to be wrong and we will be in a lot of trouble! Thus, how well we clean and understand the data has a tremendous impact on the quality of the results.

Things get slightly more complicated when we deal with datasets having hidden properties like time series datasets. Time series datasets are a special type of data that is ordered chronologically and needs special attention for handling its intrinsic elements like trend and seasonality.

For these reasons, We will be focusing on a step-by-step guideline that walks through the EDA and data cleaning process one can follow while working with multivariate time series data.

Index:

Understanding time series data — The Theory
EDA (inspection, data profiling, visualizations)
Data Cleaning (missing data, outlier detection and treatment)
Final words

Understanding time series data — The Theory

One of the best freely available sources to learn about time series analysis is the book ‘Forecasting Principles and Practices’ by Rob J Hyndman and George Athanasopoulos. Both of them are professors at Monash University, Australia and Rob was Editor-in-Chief of the International Journal of Forecasting and a Director of the International Institute of Forecasters from 2005 to 2018. I am going to summarize some of the basic elements of a time series dataset here and for further details, please refer to the mentioned book.

Elements of Time Series Data

A time-series data can be considered a list of numbers, along with information about when those numbers were recorded. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Time series data is composed of four elements:

Decomposition techniques help us extract trend, seasonality and error/irregular components of a time series dataset. There are multiple decomposition techniques but we will be focusing on the additive method in this blog in the EDA section.

Stationarity

In the most intuitive sense, a stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times.

Why is this property important? Stationary processes are easier to model as the way they change is predictable and stable. For most models involving time series, we will find ourselves determining if the data was generated by a stationary process, and if not, then we possibly need to transform it so it has the properties generated by such a process.

ACF and PACF

Autocorrelation (ACF) and partial autocorrelation (PACF) plots are heavily used to determine stationarity and time series model parameters. These plots graphically summarize the strength of a relationship with observation in a time series with observations at prior time steps.

For ACF plots, we calculate the correlation for time-series observations with previous time steps, called lags. The PACF plot is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed. Such time plots for a stationary process will start having statistically insignificant values within the first few lags.

EDA (inspection, data profiling, visualizations)

To share my understanding of the common concepts and techniques on EDA, we will work on the multivariate time series dataset on Hong Kong flat prices along with various macroeconomics variables. It is a daily dataset starting from 2nd January 2003 and goes on till 26th November 2019. The dataset is available on Kaggle.

To begin with, we imported the necessary python libraries (for this example pandas, NumPy,matplotlib etc) and loaded the data set.

Looking at the data and checking field types

# Input the dataset
df1 = pd.read_excel('Housing market data.xlsx',
              sheet_name='Upsampled')df1.head(5)

#Check number of rows and columns, type of each columns
df1.info()

To take a closer look at the data, used headfunction of the pandas library which returns the first five observations of the data. Similarly tail returns the last five observations of the data set.
We found out the total number of rows and columns in the data set and the data type of each column using infofunction. The dataset comprises 4233 observations and 14 columns. All the columns have the correct data format (The date is in DateTime format and the rest are float). None of the columns have any null values

Get summary statistics

# Understanding the numeric fields
df1.describe()

Here as you can notice mean value is more than the median value of most columns which is represented by 50%(50th percentile) in the index column.
There is notably a big difference between 75th percentile and max values of certain fields like “First hand sales quantity”,”First hand sales amount”,” Total completions” etc.
Thus observations 1 and 2 suggest that there are extreme values-Outliers in our data set. We get the same conclusion once we look at the histograms of all the numeric fields.

# Histogram of all numeric fields
df_hist = df1.drop(columns=['Date'],axis=1)
df_hist.hist(figsize=(15,15));

Checking time-series properties of the target variable

# Visualise the target variable
plt.plot(df1['Date'], df1['Private Domestic (Price Index)'])

Target variable/Dependent variable (‘Private Domestic (Price Index)’)has a rising trend
There is a seasonal dip in most years.
The variation in 2020 is extreme compared to the overall trend
The target variable is not stationary

df_series = df.set_index('Date').asfreq('D')
series = pd.Series(df_series['Private Domestic (Price Index)'], index= df_series.index)
results = seasonal_decompose(series, model='additive',freq = 365)
pyplot.rcParams['figure.figsize'] = (20.0, 10.0)
results.plot()
pyplot.show()

We have used the additive model for decomposition which assumes that the time series data is structured in the following manner:
Time Series Data = Trend + Seasonal + Random
We can observe that the seasonal pattern is a regularly repeating pattern and that the trend is upward sloping but it is not a smooth line.

# ACF Plot of Target Variable
plot_acf(series,lags= 100)
pyplot.show()

# PACF plot of target variable
plot_pacf(series, lags=50)
pyplot.show()

The dataset is highly non-stationary as can be seen from the ACF and PACF plots.

Data Cleaning (missing data, outliers detection and treatment)

Data cleaning is the process of identifying and correcting inaccurate records from a dataset along with recognizing unreliable or irrelevant parts of the data. We will be focusing on handling missing data and outliers in this blog.

Missing Data

# Check if any dates are missing
daily_data = pd.DataFrame(pd.date_range(start=df1['Date'].min(),end=df1['Date'].max()))
daily_data.rename(columns={ daily_data.columns[0]: "Date" }, inplace = True)
daily_data.describe()

# Add the missing dates
df2 = pd.merge(df1,daily_data,on='Date',how='outer')
df2 = df2.sort_values(by=['Date'])
df2.head(10)

Our raw data starts from “2003–01–02” and ends at “2019–11–26”. There are 6173 days between “2003–01–02” and “2019–11–26” but the original data only had 4233 records. So a few dates are missing.
We create a new dataset with all the 6173 dates and join the original dataset with this new dataset. This leads to null values for all the records not available in the original dataset.

#Use linear interpolation to fill up nulls
df3 = df2.interpolate(method='linear', axis=0).ffill().bfill()
df3.head(10)

We use linear interpolation to fill in the null values

Outlier detection

Wikipedia definition,

In statistics, an outlier is an observation point that is distant from other observations.

To ease the discovery of outliers, we have plenty of methods in statistics, but we will only be discussing a few basic techniques (interquartile range, standard deviation) here. In a separate blog, I will be focusing on advanced methods.

# Outlier Detection using Inter Quartile Rangedef out_iqr(s, k=1.5, return_thresholds=False):
    """
    Return a boolean mask of outliers for a series
    using interquartile range, works column-wise.
    param k:
        some cutoff to multiply by the iqr
    :type k: ``float``
    param return_thresholds:
        True returns the lower and upper bounds, good for plotting.
        False returns the masked array 
    :type return_thresholds: ``bool``
    """
    # calculate interquartile range
    q25, q75 = np.percentile(s, 25), np.percentile(s, 75)
    iqr = q75 - q25
    # calculate the outlier cutoff
    cut_off = iqr * k
    lower, upper = q25 - cut_off, q75 + cut_off
    if return_thresholds:
        return lower, upper
    else: # identify outliers
        return [True if x < lower or x > upper else False for x in s]
    
    
# For comparison, make one array each at varying values of k.
df4 = df3.drop(columns=['Date'],axis=1)
iqr1 = df4.apply(out_iqr, k=1.5)iqr1.head(10)

The Interquartile range (IQR) is calculated as the difference between the 75th and the 25th percentiles of the data. The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is the value 1.5 (which we have used here). A factor k of 3 or more can be used to identify values that are extreme outliers or “far outs”.
For the initial values, fields like ‘Total Completion’ have a lot of outliers.
If we know that the distribution of values in the sample is Gaussian or Gaussian-like, we can use the standard deviation of the sample as a cut-off for identifying outliers. Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a Gaussian or Gaussian-like distribution. For smaller samples of data, perhaps a value of 2 standard deviations (95%) can be used, and for larger samples, perhaps a value of 4 standard deviations (99.9%) can be used.

Outlier treatment

for column in df4:
    df4[column] = np.where(iqr1[column] == True,'NaN',df4[column])cols = df4.columns
df4[cols] = df4[cols].apply(pd.to_numeric, errors='coerce')
df4.head(10)

All the identified outliers are replaced by nulls first.

#Use linear interpolation to fill up nulls
df = df4.interpolate(method='linear', axis=0).bfill().ffill()
df3['Date'] = pd.to_datetime(df3['Date'])
df = pd.concat([df3['Date'],df], axis=1)
df.head(10)

Then the nulls are filled by linear interpolation. In a separate blog, a more robust approach to replacing outliers will be discussed.

Final Words

I hope this blog helps the readers make sense of their datasets as well as handle some of the issues with messy data. Apart from that, the readers should now be able to understand the basic elements of a time series dataset as well. The code for the entire analysis can be found here. But it is important to understand that each dataset comes has its own unique challenges and will need a customized approach to make the data usable.

These are some of the questions one should always ask while working with a new dataset:

How the data is collected, and under what conditions?

What does the data represent?

What are the issues in the dataset? Are there any outliers?

What methods should be used to clean the data and why?

This is the first blog in a series focused on creating a robust forecasting engine based on multivariate time series data. Kindly do read the next blog which focuses on feature engineering and selection methods needed to optimize our forecasts.

Do you have any questions or suggestions about this blog? Please feel free to drop in a note.

Thank you for reading!

If you, like me, are passionate about AI, Data Science, or Economics, please feel free to add/follow me on LinkedIn, Github and Medium.