Gain expertise in dataset exploration for reliable revenue forecasting.
Learn how to explore datasets to identify key trends and patterns that can drive actionable insights.
Welcome back! Now that you’ve grasped the fundamentals of linear regression, let’s apply this knowledge to analyze customer spending in a retail context. Specifically, we’ll predict the amount a customer is likely to spend next year. Throughout this exercise, we’ll walk through the entire process of building and deploying machine learning systems.
Here is some basic information about the dataset:
Let’s get ourselves familiarize with the dataset through exploratory data analysis (EDA).
Data exploration
Import the dataset first (retail_transactions.csv
). Just after importing any dataset, we should always check its dimension (shape
) and the details of each column.
import pandas as pd
df_retail = pd.read_csv('data/retail_transactions.csv')
print('The dimention of the dataset')
print(df_retail.shape)
print('\nDetail description of the dataset')
print(df_retail.info())
print('\nNumeric details')
print(df_retail.describe())
We check further details of the dataset using the info()
method of the dataframe. We also get further details of the numeric columns using the describe()
method. Overall, we have 8 columns and 397884 records in the dataset. Each record represents a detail of a transaction.
Now, let’s observe some individual features carefully.
Quantity
Run the code below and you’ll see in the graph that there is a couple of unusually large quantities in the transaction dataset. This might be caused by poor data quality or some mistakes. We have to tackle them otherwise they will influence our model.
sns.rugplot(x=df_retail['Quantity'], height=1)
plt.xlabel('Quantity')
plt.show()
Let’s take a look at the outliers who have unusually high numbers in Quantity
(more than 70000). One way to deal with such outliers is to remove them.
# Quantity outlier inspection
quantity_outlier = df_retail[df_retail['Quantity'] > 70000]
quantity_outlier.head()
UnitPrice
We find similar outliers in the UnitPrice
column as well. The describe()
method showed that the third quartile for UnitPrice
was 3.75, but we have some records that have more than a thousand as UnitPrice
. We have to treat them as well.
sns.rugplot(x=df_retail['UnitPrice'], height=1)
plt.xlabel('UnitPrice')
plt.savefig('output/output.png')
Country
Now, let’s move on to the categorical feature Country
. We want to see the number of transactions recorded per country.
country_records = df_retail['Country'].value_counts()
print(country_records.head(10))
We see in the output that transactions from the UK heavily dominate our dataset. That’s why we might want to focus on records from the UK because transactions from other countries might introduce unwanted cultural diversity.
All these findings will help us build the dataset that will train our machine-learning model to predict customer revenue.
Please refer to the Jupyter Notebook for more detailed data exploration. Practice along to sharpen your understanding.
What’s Next?
- Lesson 2.3 — Feature Engineering: We’ll create new features from the existing data to improve our model’s performance.
- Lesson 2.4 — Relation Between Dependent and Independent Variables: We’ll analyze the relationships between revenue and other variables.
- Lesson 2.5 — Model Building and Evaluation: We’ll build our linear regression model and evaluate its performance.
Join the community
Join our vibrant learning community on Discord! You’ll find a supportive space to ask questions, share insights, and collaborate with fellow learners. Dive in, collaborate, and let’s grow together! We can’t wait to see you there!
Happy learning! See you on the other side :-)