Predict Next Month Transaction with Linear Regression (Part 1)

Basic Exploration of the dataset

Leah Nguyen
7 min readJun 11, 2022
Photo by Josh Appel on Unsplash

Introduction

This article aims to analyse and provide insights from the monthly transaction data set to understand the customer transaction patterns better. The article also offers a study on the linear regression model, an essential concept in the field of machine learning and explains how this model can assist in the decision-making process of identifying trends in bank transactions within the years 2013–2016.

To well capture this information, the CRISP-DM management model is adopted to provide a structured planning approach to a data mining project with 6 high-level phases. In particular, these phases assist companies in comprehending the data mining process and serve as a road map for planning and executing a data mining project (Medeiros, 2021).

Cross-Industry Standard Process for Data Mining (CRISP-DM project, 2000)

This study explores each of the six phases and the tasks associated with each in the following orders:

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modelling
  5. Evaluation
  6. Deployment

In the scope of this article, I will cover the first 2 points of the CRISP-DM: Business Understanding and Data Understanding (EDA — Part 1).

You can view all my code using for this project on GitHub.

Business Understanding

Business Understanding is the first taken step in the CRISP-DM methodology. In this stage, the main task is to understand the purpose of the analysis and to provide a clear and crisp definition of the problem in respect of understanding the Business objectives and Data mining objectives.

In our case study, the posed question-related Business object paraphrased from the sales manager’s request is:

What is driving the trends and increasing total monthly revenue?

On the other hand, we wish to achieve the data mining object by applying data visualization tools to identify any underlying patterns from the dataset.

Data Understanding

Following that, the Data Understanding phase is where we focus on understanding the data collected to support the Business Understanding and resolve the business challenge (Wijaya, 2021). Data preprocessing and data visualization techniques play an essential role in this. Thus, I’m going to divide the section into 2 main components:

  1. Exploratory Data Analysis (Part 1) — The Dataset, including:
  • Stage 1: Basic Exploration
  • Stage 2: Univariate, Bivariate & Multivariate Analysis

2. Exploratory Data Analysis (Part 2) — The Business Insights

The data was imported into the software package R to construct visualizations representing the findings found during the analysis.

Exploratory Data Analysis (Part 1) — The Dataset

Stage 1: Basic Exploration

First, I will run the libraries which will be necessary for reading & manipulating our data and then conducting the graphs.

##----------------------------------------------------------------
## Load the Libraries --
##----------------------------------------------------------------
library(here) # assess the file path
library(DataExplorer) # EDA visualizations
library(tidyverse) # data wrangling
library(kableExtra) # write table
library(bannerCommenter) # create comment banner
library(ggplot2) # data visualization
library(forecast) # times-series forecasting
library(ggradar) # plot seasonal trend
library(sqldf) # using SQL
library(dplyr) # data processing
library(ggpubr) # combine plots into single page
theme_set(theme_pubr())
library(reshape2) # transpose table
library(fmsb) # create radar chart
library(modelr) # computing regression model performance metrics
library(caret) # streamline the model training process
library(xts) # convert df to ts object

Once libraries are loaded, we explore the data with the goal of understanding its dimensions, data types, and distribution of values. In this assignment, a time series data set of financial transactions was used as the major source of data. The attributes information is specifically presented as follows:

Data Description

After having a good idea of the data description, I want to have an understanding of what the data look like in general. TheDataExplorer package can help to retrieve that piece of information within a few lines of code:

Data preview

As apparent from the table, the data records 470,000+ observations across 5 columns, which are equivalent to 94,000+ bank transactions. The 5 features contained in this data set including date, customer_id, industry, location, monthly_amount, clearly indicate the total transaction amounts for customers each month spanning a 3-year period over a range of industries and locations. Therefore, no further justification needs to be made on column names.

Data columns inspection

It is also worthwhile to note that features are made up in multiple formats that include both numerical and time-series data. However, the output shows that the date column has the wrong data type which will need to be converted to date format later.

Additionally, I investigate further by looking at the response field. Recall from the business question, we would expect to use themonthly_amount column as the target field since our goal is to get the predicted value of the monthly transaction value next month. Since the observation in this column is continuous, thus, I can conclude that our problem is defined as the supervised regression problem. Having known this information is extremely essential to selecting the right Machine Learning model in the later stage of this report.

Plot missing values

From the plot, it shows that there are no missing values on any fields of data. Nevertheless, some data sets define missing observations in categorical/character columns as a new category such as "NA", "NULL", etc. so there are chances that we possibly miss these observations, which can lay a tremendous negative impact on the real data distribution. Consequently, a further address on the missing values of our categorical columns need to be made in order to confirm this observation.

The code output below interprets that there is no new missing value category exists in categorical columns. Thus, we can confirm our hypothesis that there is no missing values from both numerical and categorical columns in this data set. Furthermore, it also indicates that there are 1 row that contain odd value in monthly_amount column that will need to be resolved.

Stage 2: Univariate, Bivariate & Multivariate Analysis

To evaluate the impact of each feature in the phenomenon, a univariate, bivariate, and multivariate analysis is performed with all features.

Univariate: Check the distribution of each field

The univariate analysis is the study of the data distribution. In research from Sharma (2020), the distributions of the independent variable and the target variable are assumed to be crucial components in building linear models. Therefore, understanding the skewness of data helps us in creating better models.

Firstly, I will plot a histogram to check which group of industry and location statistically contribute the most to the significant difference.

Distribution histogram

As can be seen from the plot, the location 1 and 2 made the top contributions for the industry column while the industry 2 and industry 1 occupied for the highest frequency distribution for the location. These results imply that the model can perform better at predicting the total transaction amount for next month with location 1, 2 and/or industry 1, 2.

Boxplot to check for outliers when plotting Monthly Amount against Location & Industry

Next, the boxplot of sale transactions by the industry and location presents their high variance with a considerable amount of outliers. The median amount of spending per customer for industry 6 and 9 are highest, over 500,000 while the lowest ones belong to industry 1 and 10, at less than 200,000. In terms of locations, most of the locations had a median amount of spending of less than 500,000.

Bivariate Analysis: Relationship between each column and target field & Collinearity

After having known the distribution of our transaction dataset, it is essential to also check for correlation and collinearity assumptions between fields in the Bivariate Analysis. Some basic transformations regarding data types are performed beforehand for the sake of plotting visualizations.

Correlation plot

Having known this information is essentially important to gain a better understanding of the transaction data set and provide great insights for transforming data in the later stage.

For the Part 2, please visit-

References

  1. Medeiros, L. (2021, December 19). The CRISP-DM methodology — Lucas Medeiros. Medium. https://medium.com/@lucas.medeiross/the-crisp-dm-methodology-d1b1fc2dc653
  2. Sharma, A. (2020, December 23). What is Skewness in Statistics? | Statistics for Data Science. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/Wijaya, C. Y. (2021, December 19).
  3. CRISP-DM Methodology For Your First Data Science Project. Medium. https://towardsdatascience.com/crisp-dm-methodology-for-your-first-data-science-project-769f35e0346c

--

--