Feature Engineering: Transform your Raw Data into accurate Revenue Prediction.

Learn how to boost your model’s accuracy with practical feature engineering strategies.

Published in

AnalyticSoul

4 min readMay 28, 2024

Welcome to the third lesson of chapter 2. In this chapter we’ll learn and practice feature engineering, one of the most important technique in machine learning to create meaningful features that boost model performance. We must remember that not all features are important to train a machine learning model. Some improve the performance, some increases bias. In this lesson, we’ll utilize our data wrangling and transformation techniques to extract valuable insights that will improve our linear regression predictions. Our goal is to build a linear regression model that can predict customer expense in the future based on historical data.

Steps of building a machine learning system

Load the dataset

Before everything, let’s first import the important libraries and load the dataset.

import pandas as pd
import plotly.express as px
import seaborn as sns

df_retail = pd.read_csv('data/retail_transactions.csv')

Data wrangling

In the previous lesson, we learned exploratory data analysis techniques. Based on that knowledge, we will shape and prepare our data for our model.

# remove unnecessary columns
df_retail = df_retail.drop(columns=['StockCode', 'Description'])

# keep UK records only
df_retail = df_retail[df_retail['Country'] == 'United Kingdom']

# fix the data type and parse datetime
df_retail['CustomerID'] = df_retail['CustomerID'].astype(str)
df_retail['InvoiceDate'] = pd.to_datetime(df_retail['InvoiceDate']).dt.normalize()

# calculate revenue and transaction year
df_retail['Revenue'] = df_retail['UnitPrice'] * df_retail['Quantity']
df_retail['Year'] = df_retail['InvoiceDate'].dt.year

# take a look at the current dataset
df_retail.info()

First, we remove the StockCode and the Description fields, as they don't add value in predicting customer revenue.

Next, as we’ve noticed in the last lesson, the majority of the transactions took place in the UK, let’s consider only those records from the UK. So that we don’t introduce unnecessary diversity bias.

We have also seen in the previous lesson that the CustomerID datatype was float, and the InvoiceDate was string. Let's fix that.

Next, let’s calculate the Revenue by multiplying UnitPrice and Quantity.

Now, let’s obtain the Year from the InvoiceDate. This will become a vital element when we move to the model-building steps.

Finally, let’s take a look at the DataFrame.

Feature creation

We have the base dataset ready. Now let’s shift the gear and summarize the total amount of revenue for each invoice. To do that, we’ll use the groupby() function of Pandas DataFrame.

df_retail = df_retail.groupby('InvoiceNo').agg(
    {
        'Revenue':'sum',
        'CustomerID': 'first',
        'InvoiceDate': 'first',
        'Year': 'first'
    }
)

df_retail.head()

We summarize our dataframe and eliminate the Country, UnitPrice, and Quantity columns. Don’t worry, since we are only focusing on transactions from the UK, we don’t need a column for the country, and the newly created column Revenue will retain the price and quantity-related information.

More Feature Creation

Now we want to narrow our focus to the transactions of 2019 only.
Because, based on the activities of 2019, we want to predict how much a
customer will spend in 2020.

# slice out only 2019 transactions
df_retail_19 = df_retail[df_retail['Year'] == 2019].copy()
df_retail_19['DaysSince'] = (dt.datetime(year=2019, month=12, day=31) - df_retail_19['InvoiceDate']).apply(lambda x: x.days)

# calculate revenue per customer
df_retail_19 = df_retail_19.groupby('CustomerID').agg({
    'Revenue': 'sum',
    'DaysSince': ['max','min','nunique']
})

# create DaysSince(max, min, nunique), and AvgOrderCost columns
df_retail_19.columns = ['_'.join(col).strip() for col in df_retail_19.columns.values]
df_retail_19['AvgOrderCost'] = df_retail_19['Revenue_sum'] / df_retail_19['DaysSince_nunique']

revenue_2020 = df_retail[df_retail['Year'] == 2020].groupby('CustomerID')['Revenue'].sum()

print(df_retail_19.shape)
df_retail_19.head()

At first, copy of the transactions of 2019 in a new dataframe.

Then, we’ll introduce another new feature DaysSince. This will show, standing on 31st Dec 2019, how long ago the transaction was made.

Then, calculate individual customers’ spending in the year 2019. Also
identify the earliest, latest, and number of days a customer interacted
based on the column DaysSince.

Now, flatten the sub-columns of DaysSince (max , min, nunique).

Then calculate the average expense per day of a customer.

Finally, we take the total revenue of each individual customer in 2020
and prepare the final dataset by joining it with the dataset. We’ll
consider this as the label or dependent variable in the model-training
phase.

Data Wrangling

wrangled_df = pd.concat([df_retail_19, revenue_2020], axis=1, join="inner")
wrangled_df = wrangled_df.rename(columns={
    'Revenue_sum': 'revenue_2019',
    'DaysSince_max': 'customer_relation_days',
    'DaysSince_min': 'last_purchase_days',
    'DaysSince_nunique': 'number_of_purchases',
    'AvgOrderCost': 'avg_order_cost',
    'Revenue': 'revenue_2020'})
wrangled_df.index.name = 'customer_id'

print(wrangled_df.head())

Creates a combined dataset where we have each customer’s activity for
the year 2019 along with the individual’s total spending for 2020.

We’ve got the base dataset ready. In the next chapter, we’ll take a look at the final features and prepare for the model building.

Play around with the code found the Jupyter Notebook.

What’s Next?

Lesson 2.4 — Relation Between Dependent and Independent Variables: We’ll analyze the relationships between revenue and other variables.
Lesson 2.5 — Model Building and Evaluation: We’ll build our linear regression model and evaluate its performance.

Happy learning! See you on the other side :-)