Feature Engineering for Time Series Problem

Rajat Pal
AlmaBetter
Published in
3 min readJun 5, 2021

Time-series data is defined as a collection of data gathered at regular intervals of time, with each consecutive data point in the series dependent on the previous data point. The time gap could be expressed in years, months, days, minutes, or even seconds.

What is the best way to solve a time series problem?

With the help of Arima, a Univariate model, we can solve a time series problem. Only a time-ordered target variable is required for this model.
The time series problem can alternatively be approached using Supervised Machine Learning.
We will attempt to forecast the closing price of Nifty50 to better understand how supervised machine learning techniques may be applied to Time series data.

Nifty50 dataset
  • open- share price when the market open
  • High- greatest value of share price for that day
  • low- lowest value of share price for that day
  • close- share price when the market close

If we try to train our model on these variables, it will have a significant bias, thus we’ll have to use feature engineering to lower the model’s bias.

Lag features

One of the characteristics of time series data is that it is highly associated with data from prior days. The value from the previous day is referred to as lag; for example, yesterday is lag 1, the day before yesterday is lag 2, and so on.

# function  to create lag features
def create_lag_variable(feature,n):
for i in range(n): final_df[feature+str(i+1)] = final_df[feature].shift((i+1))
feature are added when using lag on all features

So, when we do lag of 1, all the values from 3 January are mapped to 4 January rows, and as we increase the amount of lag, the difference in the values grows as well, as we can see with lag of 3, 6 January values are mapped to 3 January values, and so on.

Rolling Window Features

We’re trying to pick a window value and then figure out what the statistical value is for all of the data points in that window. The data points in the window change as we walk through our data collection, which is why it’s referred to as a rolling window.

We also know that earlier values are more closely related to present values than values from two or three days ago, thus I attempted to design a feature that calculates the exponential weighted average value.

# Function to create features based on moving average and EWMAdef moving_avg(df,col, day):var_name = col + str(day)df[var_name + '_ma']= df[col].rolling(window=day,min_periods=1).mean().shift(1)df[var_name+ '_ewma']= df[col].ewm(com=day).mean()return df
mean avg ewma values in data frame

So, if we look at the dataset, we can see that we only chose a window of 3 and 5 because the first three values will be in the window, which is why NaN appears there. The fourth row contains the average of these values. The exponential weighted mean average, on the other hand, starts with only the first value.

Domain-Specific Features

We can also create domain-specific features; for example, a person with extensive domain expertise may devise features that would boost accuracy because these features would be highly associated.

One of the features I invented was volatility, which was computed by subtracting the high and low values. It used to show the range for a specific day.

We can also include a whole new feature, such as the closing share price of companies that have participated in the Nifty50. The Nifty50 is a list of the top 50 corporations in the world.

These are a few of the Featured I created while working on the time series problem.

I hope you enjoy this blog; please contact me if you have any questions or would want to provide important comments.

--

--