Module IV: Unveiling Insights through Cluster Analysis thru Retail Data

Wadi Ahmed
INST414: Data Science Techniques
3 min readMay 2, 2024

In today’s data-driven world, extracting meaningful insights from complex, real-world problems is key. Taking in a great example is the retail industry, an industry that has reported more than $7.24 trillion in sales by year-end 2023. Factoring in how to get products from production to warehouse to consumer hands in the fastest, most efficient way possible is how the industry thrives and continues to grow. Consequently, data analysis is a key aspect of making sure that trends are being effectively capitalized on.

One powerful technique of data analysis to uncover patterns within data is cluster analysis. In this post, we’ll take a dive into constructing and characterizing custers in a given dataset to answer pertinent questions about retail data, delve into data collection and the selection of k-values, and see where our model can succeed, as well as need improvement in.

Question to Stakeholder:

The retail industry is one that is known for it’s small margins of profit, so maximizing these streams of venues is key for shareholders, management, and stores to effectively operate. Marketing managers in a retail company want to now understand customer segments based on their purchasing behavior, to make sure their marketing strategies are effective as possible. To answer this question will influence decisions on targeted advertising and inventory management for a retail store.

Data Description

The data was gathered from Kaggle that has a sample dataset of more than 20,000 entries dealing with different types of transactional data. The dataset includes fields such as invoice numbers, stock codes, description quantity, invoice date, price, customerID, and Country where the item was purchased. This data becomes relevant as it provides insights into customer preferences, purchase frequency, and spending habits.

The features that will be used in this model are:
-Countries
-Prices
-Invoice Dates
-Quantity

Classification or Regression?

For this analysis, we will be using a classification model. The reason for choosing a classification model over a regresison one is due to the features we are predicting are categorical in nature — specifically, the aim is to predicted customer segments based on their purchasing behavior. Classification models are much better suited for predicting these categorical outcomes, making them the appropriate choice for this scenario.

Application and Incorrect Predictions:

After applying the trained classification model, I identified samples where the model made incorrect predictions. Mainly in how the model generated some of the clusters showed incorrect information. However, it did show multiple options on how to get this to effectively be shown to generate data.

Conclusion:

Cluster analysis offers valuable insights into complex datasets such as these, enabling stakeholders to make better, informed decisions. By understanding the process of constructing and characterizing clusters, we unlock the potential of network data to drive business strategies and enhance customer experiences.

Github Link for Code: https://github.com/CaptFalc/Assignment-4

--

--