Demystifying Feature Engineering and Selection for Driver-Based Forecasting

Create and extract key features using 3rd party data and AI models like random forest, lasso regression and recursive feature selection!

Indraneel Dutta Baruah
Analytics Vidhya
11 min readSep 6, 2020

--

Welcome to the second part of my 3-blog series on creating a robust driver based forecasting engine. The first part gave a brief introduction to time series analysis and gives readers the tools needed to makes sense of time series datasets and cleaning it up (Link here). We will be now looking at the next step in our analysis.

While working with one of the leading data analytics teams in India, I have realized that there are two key elements which lead to actionable insights for our clients: Feature Engineering and Feature Selection. Feature engineering refers to the process of creating new variables from existing ones which capture hidden business insights. Feature selection involves making the right choices about which variable to choose for our forecasting models. Both these skills are a combination of art and science which need some practice to perfect.

In this article, we will explore the different types of features which are commonly engineered during forecasting projects and the rationale for using them. We will also look at a comprehensive set of methods that we can use to select the best features and a handy method to combine all the these methods. To dig deeper on feature analysis, one can refer to the book “Feature Engineering and Selection: A Practical Approach for Predictive Models” by Max Kuhn and Kjell Johnson

Index:

  • Feature Engineering (Lags, Periodic difference, Flags for events etc.)
  • Feature Selection Methods (Correlation, Lasso Regression, Recursive Feature Selection,Random Forest, Beta Coefficients)
  • Combining Feature Selection Methods
  • Final words

Feature Engineering

It is common practice to use different ways to represent data fields in a model, and usually some of these representations are better than others. This is the basic idea behind feature engineering — the process of creating representations of data that increase the effectiveness of a model. We will be discussing some of the tried and tested features created from a multivariate time series data and the rationale behind each of them. The dataset being used here is the same as the previous blog in this series. It is a daily dataset on Hong Kong flat prices along with 12 macro economics variables.

Feature 1: Lags

Lag features are values at prior time steps. For example lag 1 of a variable at time t is its value from last period, t-1. As the name suggests, the hypothesis here is that the features have a lagged impact on the target variable. The best way to find the optimal number of lags to chose for each field is to look at cross correlation graphs.Cross correlation graphs show the correlation between the target variable (here, private domestic price index of flats) with various lags of raw features (sales, money supply, etc.).

# Check Optimal Number of Lags
for i in range(3,12):
pyplot.figure(i, figsize=(15,2)) # add this statement before your plot
pyplot.xcorr(df.iloc[:,2],df.iloc[:,i],maxlags=45, usevlines=1)
pyplot.title('Cross Correlation Between Private Domestic (Price Index) and '+ df.columns[i])
pyplot.show()
  • There is a high level of correlation for a large number of lags between the target variable and most of the features.
  • For this exercise we created 3 lags for each column with the ‘shift’ function as shown below. Ideally, we should add lags till the point where cross correlation drastically drop.
#Add Lags
predictors2 = predictors1.shift(1)
predictors2 = predictors2.add_suffix('_Lag1')
predictors3 = predictors1.shift(2)
predictors3 = predictors3.add_suffix('_Lag2')
predictors4 = predictors1.shift(3)
predictors4 = predictors4.add_suffix('_Lag3')

Feature 2: Periodic Difference

This feature is calculated as the difference between the current value of the variable and its previous value. The expectation is that the change in a variable has a stronger relationship than the raw variable itself. We can calculate the periodic difference using the ‘diff’ function (as shown below).

#Add Periodic Difference
predictors5 = predictors1.diff()
predictors5 = predictors5.add_suffix('_Diff')

Similarly, one can also calculate month-on-month, year-on-year or quarter-on-quarter changes given the business problems.

Feature 3: Flags for events

Sometimes highlighting key events like holidays helps the model predict better. For example, if we are trying to predict garment sales then a flag for holidays like Christmas, Thanksgiving etc usually help the model predict the sudden peaks during such times better. In our case, we created flags for holidays using the ‘calendar’ function.

# Create Holiday Flagdf['Date'] = pd.to_datetime(df['Date'])
cal = calendar()
holidays = cal.holidays(start=df['Date'].min(), end=df['Date'].max())
df['holidays'] = pd.np.where(df['Date'].dt.date.isin(holidays), 1, 0)
df.head(5)
  • To create a flag for events, it is important to have a date column which is in the correct date-time format (example : “YYYY-MM-DD”)
  • We create a separate list of dates for holidays and for all the values in the date column in our dataset which has a match with this list, we flag it as a holiday

Apart from these common features, based on business context we should try to create additional features which have a strong relationship with the target variable. For example, we created a new feature ‘First Hand Price’ as the ratio between ‘First Hand sales amount’ and ‘First Hand quantity’.

#Create a new feature =  First hand price
df['First hand sales price'] = df['First hand sales amount']/df['First hand sales quantity']
predictors1 = df.drop(columns=['Date','Private Domestic (Price Index)','holidays','First hand sales amount','First hand sales quantity'],axis=1) # Add if applicable
predictors1.head(5)

Feature 4: Financial data

It is very easy to download financial data like stock prices and indexes in Python. There are many options out there (Quandl, Intrinion, AlphaVantage, Tiingo, IEX Cloud, etc.), however, Yahoo Finance is considered the most popular as it is the easiest one to access (free and no registration required). Although we do not use it in our exercise but I have added the relevant pieces of code needed to extract the following:

  1. Stock Price of a company:
#### Additional features: Stock Price
import yfinance as yf
from yahoofinancials import YahooFinancials
yahoo_financials = YahooFinancials('CRM')stock_price = yahoo_financials.get_historical_price_data(start_date='2018-08-01',
end_date='2020-08-01',
time_interval='monthly')
stock_price_df = pd.DataFrame(stock_price['CRM']['prices'])
stock_price_df = stock_price_df.drop('date', axis=1).set_index('formatted_date')
stock_price_df = pd.DataFrame(stock_price_df['close'])
stock_price_df.columns = ['Stock_Price']
stock_price_df

2. Stock price index of the competitors of a company

# Additional features:  Stock Price Index from multiple companiesscaler = StandardScaler()
# transform x data
scaled_stock = scaler.fit_transform(stock_price_df_c)
column_name = competitor_stock[i]
stock_price_df_comp[column_name] = scaled_stock[:,0]

col = stock_price_df_comp.loc[: , "ADBE":"ORCL"]
stock_price_df_comp['Competitor_Stock_Index'] = col.mean(axis=1)
stock_price_df_comp = pd.DataFrame(stock_price_df_comp['Competitor_Stock_Index'])
stock_price_df_comp

3. S&P 500

#Additional features:  S&P 500import datetimepd.core.common.is_list_like = pd.api.types.is_list_likeimport pandas_datareader.data as webstart = datetime.datetime(2018, 8, 1)
end = datetime.datetime(2020, 8, 1)
SP500 = web.DataReader(['sp500'], 'fred', start, end)SP500_df = SP500.resample('MS').mean()SP500_df

Feature Selection Methods

Feature selection is a critical step for most data science projects as it enables the models to train faster, reduces the complexity and makes it easier to interpret. It has the potential to improve model performance and reduce the problem of overfitting if the optimal set of features are chosen. We will be discussing various methods and their respectives rules for selecting the best features.

Method 1: Variable Importance from Random Forest

Random forests consist of multiple decision trees, each of them built over a random sample of the observations from the dataset and a random sample of the features. This random selection guarantees that the trees are not correlated and therefore less susceptible to over-fitting. For forecasting exercises, we use variable importance feature of random forest which measures how much the accuracy decreases when a variable is excluded. To learn more about random forest and variable importance, please refer to this detailed blog by Niklas Donges.

#1.Select the top n features based on feature importance from random forestnp.random.seed(10)# define the model
model = RandomForestRegressor(random_state = random.seed(10))
# fit the model
model.fit(predictors, Target)
# get importance
features = predictors
importances = model.feature_importances_
indices = np.argsort(importances)
feat_importances = pd.Series(model.feature_importances_, index=predictors.columns)
feat_importances.nlargest(30).plot(kind='barh')
  • We first fit the random forest model and then extract the variable importance.
#Final Features from Random Forest (Select Features with highest feature importance)
rf_top_features = pd.DataFrame(feat_importances.nlargest(7)).axes[0].tolist()
  • We check the feature importance plot for top 30 variables based on variable importance. We can see that 2 period lag of ‘M3 (HK$ million)’ is the most important feature and after the 7th feature the variable importance falls. So we chose 7 features for this analysis.

We can optimise the random forest model by tuning the parameters in case the features selected by the default model are not satisfactory. This is an example of embedded method which work by evaluating a subset of features using a machine learning algorithm that employs a search strategy to look through the space of possible feature subsets, evaluating each subset based on the quality of the performance of a given algorithm.

Method 2: Pearson Correlation

We try to find the highly correlated features using the absolute value of pearson correlation. This is an example of a filter method where features are selected on the basis of their scores in various statistical tests.

#2.Select the top n features based on absolute correlation with target variable
corr_data1 = pd.concat([Target,predictors],axis = 1)
corr_data = corr_data1.corr()
corr_data = corr_data.iloc[: , [0]]
corr_data.columns.values[0] = "Correlation"
corr_data = corr_data.iloc[corr_data.Correlation.abs().argsort()]
corr_data = corr_data[corr_data['Correlation'].notna()]
corr_data = corr_data.loc[corr_data['Correlation'] != 1]
corr_data.tail(20)
  • We calculate the correlation of each feature with the target variable and sort the features by the absolute values of their correlation.
# Select Features with greater than 90% absolute correlation
corr_data2 = corr_data.loc[corr_data['Correlation'].abs() > .9]
corr_top_features = corr_data2.axes[0].tolist()
  • We select the 12 features with greater than 90% absolute correlation. Based on the business context the threshold can be modified.

Method 3: L1 regularisation using Lasso regression

Lasso or L1 regularisation is based on the property that is able to shrink some of the coefficients in a linear regression to zero. Therefore, such features can be removed from the model. This is another example of an embedded method of feature selection.

#3.Select the features identified by Lasso regressionnp.random.seed(10)estimator = LassoCV(cv=5, normalize = True)sfm = SelectFromModel(estimator, prefit=False, norm_order=1, max_features=None)sfm.fit(predictors, Target)feature_idx = sfm.get_support()
Lasso_features = predictors.columns[feature_idx].tolist()
Lasso_features
  • We specify the Lasso Regression model and then use the ‘selectFromModel’ function, which will select in theory the features which coefficients are non-zero

To learn more about feature selection using Lasso Regression, please refer to this paper by Valeria Fonti.

Method 4: Recursive Feature Selection (RFE)

RFE is a greedy optimization algorithm which repeatedly creates models and separates out the best or the worst performing feature at each iteration. It constructs the next model with the remaining features until all the features have been used. It finally ranks the features based on the order of their elimination.

#4.Perform recursive feature selection and use cross validation to identify the best number of features#Feature ranking with recursive feature elimination and cross-validated selection of the best number of featuresrfe_selector = RFE(estimator=LinearRegression(), n_features_to_select= 7, step=10, verbose=5)
rfe_selector.fit(predictors, Target)
rfe_support = rfe_selector.get_support()
rfe_feature = predictors.loc[:,rfe_support].columns.tolist()
rfe_feature

We use linear regression to perform the recursive feature selection and select top 7 ranked fields. This is a wrapper method of feature selection that attempts to find the “optimal” feature subset by iteratively selecting features based on the model performance.

Method 5: Beta Coefficients

The absolute value of the coefficients of a standardized regression, also known as beta coefficients, can be considered a proxy for feature importance. This is a type of filter method of feature selection.

#5.Select the top n features based on absolute value of beta coefficients of features# define standard scaler
scaler = StandardScaler()
# transform x data
scaled_predictors = scaler.fit_transform(predictors)
scaled_Target = scaler.fit_transform(Target)
sr_reg = LinearRegression(fit_intercept = False).fit(scaled_predictors, scaled_Target)
coef_table = pd.DataFrame(list(predictors.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",sr_reg.coef_.transpose())
coef_table = coef_table.iloc[coef_table.Coefs.abs().argsort()]
sr_data2 = coef_table.tail(10)
sr_top_features = sr_data2.iloc[:,0].tolist()
sr_top_features
  • We standardise the dataset and run a standardized regression with all the features included in it.
  • We select 10 features based on the highest absolute value of beta coefficients.

Combining Feature Selection Methods

Each of these models are good at capturing a particular type of relationship between the features and target variable. For example, beta coefficients are good at identifying the linear relationships while random forest is suitable for spotting non-linear bonds. Based on my experience, I found that trying to combine the results from multiple methods led to more robust results. We will look at one of the ways to do so in this section.

# Combining features from all the modelscombined_feature_list = sr_top_features + Lasso_features + corr_top_features + rf_top_featurescombined_feature = {x:combined_feature_list.count(x) for x in combined_feature_list}
combined_feature_data = pd.DataFrame.from_dict(combined_feature,orient='index')
combined_feature_data.rename(columns={ combined_feature_data.columns[0]: "number_of_models" }, inplace = True)combined_feature_data = combined_feature_data.sort_values(['number_of_models'], ascending=[False])combined_feature_data.head(100)
  • We combine the list of features from each model into one and then check the number of times each feature appears in that list. This indicates the number of methods/models in which the features were selected.
  • We select the features which were chosen in majority of the methods. In our case we have 5 models and hence, we picked 3 features which were selected in at least 3 out of 5 methods.
#Final Features: features which were selected in atleast 3 modelscombined_feature_data = combined_feature_data.loc[combined_feature_data['number_of_models'] > 2]
final_features = combined_feature_data.axes[0].tolist()
final_features

The benefit of using this approach to combine feature selection methods is the flexibility to add/remove models as per the business problem at hand. Markov blanket from a bayesian network, forward/backward regression and feature importance from XGBoost are some additional methods on feature selection which can be considered.

Final Words

With this said Part 2 of the 3-part blog series comes to an end. The readers should now be able to extract various relevant features out of datasets which are relevant for most forecasting exercises. They can also add relevant 3rd party datasets like holiday flags and financial indexes as new features to supplement the raw dataset. Apart from engineering these features, the readers now have an extensive toolkit for selecting the best features based on multiple methods. I would like to again highlight here that feature engineering and selection is combination of art and science. One becomes better at it with experience. The code for the entire analysis can be found here.

Do read the final part of this series where we create an extensive pipeline of tried and tested forecasting models by combining time series and machine learning techniques.

Do you have any questions or suggestions about this blog? Please feel free to drop in a note.

Thank you for reading!

If you, like me, are passionate about AI, Data Science, or Economics, please feel free to add/follow me on LinkedIn, Github and Medium.

--

--

Indraneel Dutta Baruah
Analytics Vidhya

Striving for excellence in solving business problems using AI!