An Exploratory Data Analysis (EDA) on the SuperStore Dataset

Warisa Weechaneemart
7 min readOct 29, 2023

--

Superstores have been at the center of an outstanding revolution in the retail environment of the United States during the previous few decades. Superstore, as the name suggests, are big retail facilities that offer a wide variety of things, ranging from food to clothing, electronics, and more, all under one roof. Superstores have evolved as a powerful force in the modern constantly evolving retail market, revolutionizing the way people shop, impacting local economies, and influencing customer behavior.

Introducing the Investigation

As economics students, we were curious about the superstore purchase selections made by American consumers within their enormous budgets. Are American consumption habits in different regions also diverse? There were several major questions we wanted answers to through conducting this exploratory analysis:

  • Which Superstore product category is the most preferred among Americans?
  • Are American consumption patterns consistent across regions?
  • Are American consumption patterns consistent across states in the same regions?
  • Are superstores in different regions making different profits?

We thought that the best place to start answering these questions would be the dataset of superstore. This dataset provides information related to sales orders. It includes details about orders, customers, products, and sales transactions for an American superstore, such as order ID, product and customer ID, type of shipping, prices, product categories, names, etc.

https://www.kaggle.com/datasets/saadharoon27/superstore-dataset

Inspecting the Information

Let’s get a clearer view of our dataframe so we can check out the columns and start cleaning them up.

Cleaning the Columns

During our initial assessment, we observed that the columns “Row ID+O6G3A1:R6”, “Order ID”, “Order Date”, “Ship Date”, “Ship Mode”, “Customer ID”, “Customer Name”, ”Segment”, “Product ID”, “Product Name”, “Returns” and “Payment Mode”columns did not serve any purpose. Therefore, they are better off removed.

The fields that remains — “Country”, “City”, “State”, “Region”, “Category”, “Sub-Category”, “Sales”, “Quantity” and “Profit”.

Now that our dataset is organized, we can use the data to answer our initial assessment.

Data Visualization

Data Visualization will give better insights into the relationship between variables, distribution of the certain variables on the graph and learn more about the statistical properties.

We have tried to use distribution , box plots and bar plots to find the relation between each variable. The interesting detail to point out is that;

product category that is the most preferred

According to the distribution, Office Suppliers is the most popular product category among Americans, with a frequency of approximately 3600, implying that Americans purchased things in the Office Suppliers category in superstores about 3600 times.

Work and home office trends were most likely a contributing factor to high office supplier product demand. There is a growing demand for office supplies as the number of people working remotely or running home-based enterprises grows. This involves not only typical office settings but also home offices, which contributes to the category’s popularity. Another aspect influencing the popularity of Office Suppliers products in superstores is daily necessity. Office supplies include a wide variety of everyday things such as pens, paper, notebooks, printers, and ink cartridges that are required for a variety of functions at work, school, or home. These requirements drive constant demand and usage among the general public.

American consumption patterns

According to the box plot, American consumption patterns are consistent across regions, with the most popular product categories being office supplies, furniture, and technology, respectively.

Urbanization and comparable lifestyles are the causes for the constant American consumption pattern. Because of comparable modern living standards and lifestyles, urban areas frequently have similar consumption trends. The metropolitan population has more constant wants and preferences.

Another factor could be national trends and culture. Certain trends or cultural influences influence consumer preferences on a national scale. For example, technical advancements or lifestyle changes might have a national impact on the adoption of specific products or services.

Comparing consumption patterns across states in the same regions

Now that we’ve briefly compared the consumption patterns in each region side-by-side, let’s analyze a single region in a bit more detail, by creating a set of more individualized dataframes, and charting each column in the table.

By eliminating the columns “Row ID+O6G3A1:R6”, “Order ID”, “Order Date”, “Ship Date”, “Customer ID”, “Customer Name”, “Country”, “Product ID”, “Product Name”, “ind1” and “ind2” that are unnecessary. Then, using the filter tool, determine each region’s superstore order one by one.

data of superstore in east region
data of superstore in east region
data of superstore in west region
data of superstore in central region
data of superstore in south region

Lastly, create box plot by using product category variable and state subgroup to see American consumption patterns across states in the same regions

box plot of east region
box plot of west region
box plot of south region
box plot of central region

Although consumption patterns are consistent within the region, they are not consistent across states within the same region.

Cultural variations may be one of the reasons why American consumption habits vary by area but not by state within those regions. While there may be regional parallels in consumption habits due to common national trends or influences, each state may have its own unique cultural preferences. Even within the same region, states may have various social and cultural standards that influence consumer behavior.

Another reason is economic inequality. Economic variations between regions might result in disparities in purchasing patterns. Some states may have greater average earnings, which may result in different spending habits than states with lower average incomes.

The final explanation could be local market dynamics. Each state may have its own set of local market characteristics, such as the presence of specialized stores, marketing methods, or regional brands catering to local tastes. This variance can have an impact on which products are more popular or widely available in various states. For example, in the east, Maine mainly buys furniture and technological products from superstores. This could be due to local store competition and the retail environment in these states influencing offers and consumer preferences.

Profit clustering of superstores in different regions

t-SNE can be used to visualize and analyze clusters of data points. By reducing the dimensionality of the data and then applying traditional clustering algorithms like K-means clustering in the lower-dimensional space, t-SNE can help identify and understand the underlying patterns and relationships within the data.

The K-means clustering algorithm groups the profit into four cluster:

  • Central
  • East
  • South
  • West

Superstores in the western United States especially California are well-positioned to attract a large and diverse customer base, capitalize on high consumer spending, and efficiently manage their supply chains, all contributing to their ability to make substantial profits. On the other hand, the South includes rural areas where population density is lower, and some southern states have lower average income levels compared to other regions in the country, affecting lower sales and profit margins for superstores.

Conclusion

In this data analysis project, we explored a dataset from a superstore, examining various aspects related to store sales and profitability. We performed Exploratory Data Analysis (EDA) to gain insights into the data distribution, including visualizations of sales and profit across regions. This EDA of Superstore Dataset has taught us a great deal about how to extract information from large datasets. However, we also conclude this project with even more questions — like Are American consumption patterns consistent across regions? Or, which product category is the most preferred?

These insights can guide strategic decision-making, resource allocation, and marketing efforts to optimize sales and increase profitability for the superstore.

--

--