Gain expertise in dataset exploration for reliable revenue forecasting.

Learn how to explore datasets to identify key trends and patterns that can drive actionable insights.

Asish Biswas
AnalyticSoul
4 min readMay 28, 2024

--

Welcome back! Now that you’ve grasped the fundamentals of linear regression, let’s apply this knowledge to analyze customer spending in a retail context. Specifically, we’ll predict the amount a customer is likely to spend next year. Throughout this exercise, we’ll walk through the entire process of building and deploying machine learning systems.

Steps of building a machine learning system

Here is some basic information about the dataset:

Dataset features

Let’s get ourselves familiarize with the dataset through exploratory data analysis (EDA).

Data exploration

Import the dataset first (retail_transactions.csv). Just after importing any dataset, we should always check its dimension (shape) and the details of each column.

import pandas as pd

df_retail = pd.read_csv('data/retail_transactions.csv')

print('The dimention of the dataset')
print(df_retail.shape)

print('\nDetail description of the dataset')
print(df_retail.info())

print('\nNumeric details')
print(df_retail.describe())

We check further details of the dataset using the info() method of the dataframe. We also get further details of the numeric columns using the describe() method. Overall, we have 8 columns and 397884 records in the dataset. Each record represents a detail of a transaction.

Now, let’s observe some individual features carefully.

Quantity

Run the code below and you’ll see in the graph that there is a couple of unusually large quantities in the transaction dataset. This might be caused by poor data quality or some mistakes. We have to tackle them otherwise they will influence our model.

sns.rugplot(x=df_retail['Quantity'], height=1)
plt.xlabel('Quantity')
plt.show()
Quantity rug plot

Let’s take a look at the outliers who have unusually high numbers in Quantity (more than 70000). One way to deal with such outliers is to remove them.

# Quantity outlier inspection
quantity_outlier = df_retail[df_retail['Quantity'] > 70000]
quantity_outlier.head()
Quantity outliers

UnitPrice

We find similar outliers in the UnitPrice column as well. The describe() method showed that the third quartile for UnitPrice was 3.75, but we have some records that have more than a thousand as UnitPrice. We have to treat them as well.

sns.rugplot(x=df_retail['UnitPrice'], height=1)
plt.xlabel('UnitPrice')
plt.savefig('output/output.png')
UnitPrice outliers

Country

Now, let’s move on to the categorical feature Country. We want to see the number of transactions recorded per country.

country_records = df_retail['Country'].value_counts()
print(country_records.head(10))
Transactions per country

We see in the output that transactions from the UK heavily dominate our dataset. That’s why we might want to focus on records from the UK because transactions from other countries might introduce unwanted cultural diversity.

All these findings will help us build the dataset that will train our machine-learning model to predict customer revenue.

Please refer to the Jupyter Notebook for more detailed data exploration. Practice along to sharpen your understanding.

What’s Next?

Join the community

Join our vibrant learning community on Discord! You’ll find a supportive space to ask questions, share insights, and collaborate with fellow learners. Dive in, collaborate, and let’s grow together! We can’t wait to see you there!

Happy learning! See you on the other side :-)

--

--