Predict Next Month Transaction with Linear Regression (Part 2)

Exploratory Data Analysis and Feature Engineering

Leah Nguyen
6 min readJun 18, 2022
Photo by Christopher Gower on Unsplash

In Part 1 of this blog, we worked on basic data analysis and understanding the transaction dataset following the CRISP-DM methodology. We found out that our problem is the supervised regression problem from looking at the data type of the target variable — monthly amount.

For the Part 1, please visit-

In this part, we will continue to work on the Exploratory Data Analysis of the data, which will help to uncover business insights for the later modelling stage as well as perform feature engineering for variables selection.

You can view all my code using for this project on GitHub.

Exploratory Data Analysis (Part 2) — The Business Insights

Transaction amount vs. transaction number trend over time

The number of transactions and the total amount of sales rose sharply throughout the years, from 2013 to 2017. The seasonal trend can be found in the total amount of sales while the up trend for the number of transactions is quite smooth.

We could see that there is a seasonal pattern, although the trend is not clear yet. To investigate more, we would make a yearly polar plot:

Seasonal Trend Over the Years

A closer examination reveals that the total volume of transaction amount increases significantly from January to October and subsequently decreases from November to the end of the year. This pattern can be ascribed to the fact that people trade less during the holidays, particularly during the month surrounding big holidays like Christmas and New Year.

This, however, might be based on a variety of different factors rather than on individual conclusions about each region or industry. As a result, additional information is required to substantiate these hypotheses.

Transaction amount by Location vs. Industry

When looking at the monthly amount by location and industry, it is not surprising that total sales of locations 1 and 2 increased significantly compared to other locations. Meanwhile, in terms of industry, industry 2, 3 and 1 shows rapid growth over the years while others’ progress is quite slow.

Data Preparation

Feature Engineering

When the data has been fully understood, data scientists generally need to go back to the data collection and data cleaning phases of the data science pipeline so as to transform the data set as per the expected business outcomes. To expand the information that is already at hand and better represent the information we have, the best practice is to perform Data Preparation or Feature Engineering, meaning the creation of new features from the ones already existing.

In this case study, the data will need to be modified as we will be applying a linear regression model later on.

# write a reusable function
aggregate_transactions <- function(df) {

# aggregate the data, grouping by date, industry and location,
# and calculating the mean monthly_amount
output = df %>%
group_by(date, industry, location) %>%
summarize(monthly_amount = mean(monthly_amount, na.rm = TRUE))

# create a column for the month number and another one for year number
output = output %>%
# create new column for month number
mutate(month_number = format(as.Date(date), "%m")) %>%
# create new column for month number
mutate(year_number = format(as.Date(date), "%Y"))

# Make sure the new columns are of the correct type
output$month_number = as.character(output$month_number)
output$year_number = as.character(output$year_number)

transform(output, month_number = as.integer(month_number), year_number = as.integer(year_number))
return(output)
}

# create a new variable that store new df with transformed features
aggregated_transactions <- aggregate_transactions(df)
# A tibble: 3,886 x 6
# Groups: date, industry [470]
# date industry location monthly_amount month_number year_number
# <date> <chr> <chr> <dbl> <chr> <chr>
# 1 2013-01-01 1 1 136081. 01 2013
# 2 2013-01-01 1 10 188735. 01 2013
# 3 2013-01-01 1 2 177840. 01 2013
# 4 2013-01-01 1 3 141632. 01 2013
# 5 2013-01-01 1 4 221058. 01 2013
# 6 2013-01-01 1 5 178138. 01 2013
# 7 2013-01-01 1 6 133400. 01 2013
# 8 2013-01-01 1 7 231599. 01 2013
# 9 2013-01-01 1 8 143778. 01 2013
# 10 2013-01-01 1 9 157416. 01 2013
# ... with 3,876 more rows


# turn the df into a Markdown table format
rmarkdown::paged_table(aggregated_transactions)
A snapshot of new feature engineering variables

An aggregated data set using the fields date, industry and location, with a mean of the monthly amount is created. There are a total of 3,886 rows with each row presenting a mean of a monthly amount ranging from 2013 to 2016.

Train-Test split

Now that we have a new adjusted data set, I’m going to split the data into train and the test set for the aim of building a prediction model. The train set includes three years of data from 2013 to 2016 while the test set includes one last year of data, 2016.

Additionally, we have 2 requirements for this assignment, which are:

  1. Basic Model Fitting: Developing a linear regression model with monthly_amount as the target for industry = 1 and location = 1.
  2. Advanced Model Fitting: Developing a linear regression model with monthly_amount as the target for all industries and locations.

I will generate an additional data set that filters only Industry 1 and Location 1 records. The train and split test for the Advanced Model Fitting section can be kept the same as there are no further adjustments needed.

As new dataset is created, I will also use it to create a line plot of the variable monthly_amount for industry = 1 and location = 1 with the purpose of gaining more insights from targeted areas.

It is clear from the graph that there is a seasonality trend observed from the mean transaction amount of Industry 1 & Location 1. More specifically, a downtrend at the end of the year followed by an up trend at the beginning of the year is presented with the months of December and January are low months for this industry and location, and the sales bounce back from March to June. This pattern of fluctuation is repeated during the year and in the time span of 3 years from 2013 to 2017. On average, the monthly mean amount of sales is increasing slowly over time.

However, it is worth mentioning that the year-end trend in 2016 was upward, which was the inverse of previous years. As a result, we will need to take a closer look at this occurrence by examining the amount of money moved by month for each year using the graphic below.

As can be seen, the anomalous increase towards the end of 2016 was previously noticed as a result of a lack of transaction data in December 2016. As a result, we discovered another insight based on facts observed from the trend chart above.

Code of the project and relevant files-

--

--