Preprocessing and Data Exploration for Time Series — Handling Missing Values
In our series of articles, we have provided a comprehensive introduction to time series analysis, covering various aspects such as the components of time series and the necessary steps to perform a thorough analysis. In this particular article, we will focus on an important aspect of time series analysis, which is handling missing values in time series data. This falls under the category of time series preprocessing and data exploration.
Throughout this article, we will explore the significance of imputing missing values in time series data and delve into various methods that can be employed to achieve this. The following table of contents outlines the key topics covered in this article:
Table of Contents
- Importing a Time Series dataset
- Finding Missing Values
- Forward-Filling Method
- Backward-Filling Method
- Linear Interpolation
- Trend and Seasonal Decomposition
Let’s start with importing a time series dataset.
Importing a Time Series Dataset
In this article, we are going to download market data from Yahoo! Finance’s API, for which yfinance’s open source tool is going to help us, which uses Yahoo’s publicly available APIs. Using the following line of code, we can install it in our environment.
!pip install yfinance
After installing this module, we are ready to download the market data of any company for this article, and we will use the reliance company’s market data of last year. Let’s do this.
import yfinance as yf
data = yf.download("RELIANCE.NS", start="2022–01–01", end="2023–01–01")
print(data)
Output:
Here we can see an overview of this data. In this data, we can see that there are 248 rows for 365 days of data, which means there are some missing dates in the data.
When we consider time series analysis as a process, we need to understand that not only missing data values are called missing values in time series but also the missing time values from the sequence of time is called missing value in the data. To learn about handling general missing values from data, we can refer to this article. In this article, we will learn how to handle missing values, specifically in time series data. Let’s move to the next sections and learn how to handle missing values in time series data using different methods.
Finding Missing Values
Before handling the missing values in time series data, it is necessary to find the time values that are missing from the time series. To find missing time data from a time series, we can use the Pandas library functions. Below is a way to store the missing time values in a series object.
import pandas as pd
data.index = pd.to_datetime(data.index)
date_range = pd.date_range(start="2022–01–01", end="2023–01–01", freq="D")
missing_dates = date_range[~date_range.isin(data.index)]
print(missing_dates)
Output:
Here we get a series object which has a length of 118 which means there are 118 missing in our extracted data. Let’s visualise this using the Matplotlib library’s functions and the Close variable of the data.
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.bar(missing_dates, [1] * len(missing_dates))
plt.title("Missing Dates")
plt.xlabel("Date")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()
Output:
Here, we can see the visualisation of the missing time values in the graph. Now let’s take a look at the visualisation of the missing data with the close variable of the data.
merged_data = data.reindex(date_range)
closing_prices = merged_data["Close"]
missing_dates_mask = closing_prices.isna()
# Plotting the closing prices with breaks for missing dates
plt.plot(closing_prices.index, closing_prices)
plt.title("Reliance Closing Prices")
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.grid(True)
if missing_dates_mask[i]:
plt.axvline(closing_prices.index[i], color="red", linestyle=" - ")
plt.show()
Output:
Here in the graph, we can see the red lines drawn for the missing dates and the blue lines drawn for close values of the reliance stock price. Now that we know about the missing values in the data, we are ready to apply the missing value handling techniques to it. Let’s start with the forward-filling method of imputing missing values.
Forward-Filling Method
Using this method, we can fill in missing values in a time series using the most recent preceding value. Things to notice here are that in the forward filling, we don’t consider any relationship between the data values, which means this method assumes that the value of the missing data point remains constant until a new value is observed. It is useful when dealing with time series data that exhibits a relatively stable trend or when missing values occur in consecutive intervals. This operation propagates the last observed value forward until encountering the next available data point. Using the below lines of codes, we can perform this with our extracted data.
data_reindexed = data.reindex(date_range)
data_filled_forward = data_reindexed.fillna(method="ffill")
Here, we have made the new index of the data using the above-defined date range and applied the ‘ffill’ method to fill the missing dates in the data.
Now we can draw the closing variable and check how the whole data can be presented.
plt.figure(figsize=(10, 6))
plt.plot(data_filled_forward.index, data_filled_forward["Close"], label="Forward Filled")
plt.title("Reliance Closing Prices")
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.legend()
plt.grid(True)
plt.show()
Output:
Here, we can see we have imputed the missing values using the forward-filling method. Now let’s move towards the next method of imputing missing values in time series.
Backward-Filling Method
As the name suggests, we can think of this method as the opposite of the forward-filling method, where we use the most recent succeeding value to impute the missing values in time series data. When applying a backward filling to fill missing values, the next available value after the missing data point replaces the missing value. The backward fill operation propagates the next observed value backwards until encountering the last available data point. Using the below line of codes, we can apply this method to our extracted data.
data_reindexed = data.reindex(date_range)
data_filled_backward = data_reindexed.fillna(method="bfill")
Let’s draw the close variable data with respect to the time after imputing with both forward and backward filling so that we can get a comparison between both of the methods as they are similar.
plt.figure(figsize=(10, 6))
plt.plot(data_filled_forward.index, data_filled_forward["Close"], label="Forward Filled")
plt.plot(data_filled_backward.index, data_filled_backward["Close"], label="Backward Filled")
plt.title("Reliance Closing Prices")
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.legend()
plt.grid(True)
plt.show()
Output:
Here, we can see that there is a slit change in both types of data as one is using the most recent preceding value, and on the other hand, the other one is using the most recent succeeding value to impute the missing values in the data. After completion of these two methods, let’s take a look at the other method of handling missing values.
Linear Interpolation
Basically, linear interpolation is a method of estimating values between two known data points. In the context of time series data, we can use linear interpolation to fill in missing values or gaps in the data.
When we go into the deeper side, we find that this process works by creating or assuming a straight line between two adjacent data points and estimating the values at points along that line. To estimate the missing values using this method, we need to consider a linear relationship between the known data points.
We can also consider it as a simple and straightforward way to estimate missing values, especially in cases where the data follows a relatively smooth trend. It is advised not to use this method of imputing data when the underlying relationship is nonlinear or if there are significant fluctuations or irregularities in the data. Like the above-given methods, it is simple to implement; let’s check the below codes.
data_reindexed = data.reindex(date_range)
data_interpolated = data_reindexed.interpolate(method="linear")
Here, we can see that we have used the interpolate function given with the pandas’ data frame and specified the linear method to perform the linear interpolation method for imputing missing data in time series data. Let’s take a look at the close variable graph after imputation while comparing it with imputed data using the forward-filling method.
plt.figure(figsize=(10, 6))
plt.plot(data.index, data["Close"], label="Original Data")
plt.plot(data_filled_forward.index, data_filled_forward["Close"], label="Forward Filled")
plt.plot(data_interpolated.index, data_interpolated["Close"], label="Interpolated Data")
plt.title("Reliance Closing Prices")
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.legend()
plt.grid(True)
plt.show()
Output:
Here, we can see the difference between the results from both of the methods, and we can see how assuming a linear relationship between data points worked in imputing the missing value in the data.
Trend and Seasonal Decomposition
In the introduction article, we have already discussed that time series data is a result of several components and trend, seasonality, cycle, and residuals are the four main components of it. By breaking a time series in these components, we can also impute the missing values in time series data.
Since the seasonal component captures recurring patterns or seasonality present in the data
When it comes to imputing missing values in a time series using seasonal decomposition, the approach typically involves the following steps:
- Time series decomposition
- Missing value identification
- Impute Seasonal Component: Here, the seasonal patterns of time series are necessary to take into account, and by this, we can use the average of the corresponding seasonal values from previous and subsequent periods to fill in the missing values.
- Impute Trend Component: If there are still any missing values after imputing the seasonal component, we can fill the rest of the values using the techniques such as linear interpolation or regression-based imputation that estimates the trend component and fill in the missing values accordingly.
- Reconstruct the Time Series.
Let’s take a look at how we can perform this via codes.
Decomposing the time series into its components.
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(data["Close"], model="additive", period=7)
Getting the trend components and filling them using the forward filling method.
trend = result.trend
trend_filled = trend.fillna(method="ffill").fillna(method="bfill")
Getting the seasonal component and it with the forward filling method.
seasonal_filled = result.seasonal.fillna(method='bfill').fillna(method='ffill')
Adding components of time series
imputed_data = trend_filled + seasonal_filled + result.resid
Let’s plot the data filled by Interpolation and filled by Trend and Seasonal Decomposition
plt.figure(figsize=(10, 6))
plt.plot(data_interpolated.index, data_interpolated["Close"], label="Interpolated Data")
plt.plot(imputed_data.index, trend_filled, label="Imputed Data")
plt.title("Reliance Closing Prices")
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.legend()
plt.grid(True)
plt.show()
Output:
Here, we can see that the time series after the imputation by this method has fewer variations than the imputation by the interpolation method. Now let’s compare the different imputed time series by all the methods.
plt.figure(figsize=(10, 6))
plt.plot(data_filled_forward.index, data_filled_forward["Close"], label="Forward Filled")
plt.plot(data_filled_backward.index, data_filled_backward["Close"], label="Backward Filled")
plt.plot(data_interpolated.index, data_interpolated["Close"], label="Interpolated Data")
plt.plot(imputed_data.index, trend_filled, label="Imputed Data")
plt.title("Reliance Closing Prices")
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.legend()
plt.grid(True)
plt.show()
Output:
Here also, we can see that The application of trend and seasonal decomposition for imputing missing values in time series data results in a smoother time series compared to other methods. This approach leverages the patterns and dependencies inherent in the data, leading to more accurate and meaningful imputations that preserve the seasonality of the time series.
Conclusion
In this article, we have discussed the four important methods of handling missing values in the time series. Addressing missing values in time series data is a critical step in the data preprocessing and exploration phase. By employing suitable techniques such as forward filling, backwards filling, linear interpolation, or seasonal/trend decomposition, we can ensure the integrity and completeness of the data, enabling more accurate and reliable time series analysis.
Preprocessing and exploring time series data involve several steps, and dealing with missing values is a critical component that should be prioritized. By addressing missing values early on, we ensure that subsequent processes can be carried out smoothly and accurately.