Mark Mulika
5 min readMay 8, 2023

--

Retail Store Data Cleaning and Analysis

Data cleaning is essential for enhancing the use of data to create actionable insights. Material misstatements are limited through data cleaning, which ensures the insights that are consequently generated can be used to benefit stakeholders.

Context

The dataset was obtained from Kaggle, where an organization selling multiple products to customers internationally. However, the data had multiple issues that prevented objective analysis. Thus, data cleaning was required to improve the reliability of any insights that would be generated from any analysis.

Data Cleaning

Multiple issues were noted in the data, which were fixed to improve its usability for understanding the business. Below are some of the steps taken to clan the data.

Issues Fixed

· Inconsistent formatting

· Missing Values

· Duplicate values

· Spelling Errors

· Irrelevant data

Data Extraction

The data was downloaded from Kaggle as a zip file, which was consequently loaded on Excel. Data exploration was to be done in excel using multiple utilities.

Data Exploration

The data had multiple characteristics which were critical for understanding its components and how some issues noted would be fixed. The records in the dataset included invoice number, stock code, product description, quantity, unit price, revenue, customer ID, invoice date and country. A quick check on the data revealed multiple issues like inconsistent formatting, missing values, duplicate entries and null entries.

Data Cleaning

The objective of data cleaning was to make the data usable, where the inconsistent issues were fixed to support a more objective analysis. Below are some of the operations undertaken to clean the data.

· Consistent formatting was undertaken, where Arial font 11 was adopted for improved clarity of the spreadsheet, which increases the chances of noticing errors and also more convenient readability.

· Changed product description formatting to uppercase for all entries for consistency and readability.

· The TRIM function was used to remove extra spaces in the data

· Remove Duplicates — 5,268 duplicate values were removed found and removed

· The Invoice

· Checked dates for consistent formatting and changed all for uniformity.

· Delete blank rows which reduced them from 536, 643 to 535,199.

· Items with blank description were labelled as “No Description”

· All blank country entries were replaced with “Unspecified”.

· The date was separated into an invoice date, month and time for a more objective analysis

The Filter Function was used for checking the numerous missing or inconsistent data entries, which was critical to avoid deleting any important data that would have been used for analysis. Each column was analyzed for outliers, blank entries, inconsistent formats and any issues which would skew data analysis. Date formats, unknown countries, missing values and inconsistent formats were fixed.

Filter Function for Data Cleaning

After cleaning the data, it is critical to establish insights for the business because any errors have been fixed effectively. Pivot tables were created to analyze various aspects from the dataset, which revealed multiple insights.

Below are some graphics genearted from the pivot tables:

Insights

· November had the highest monthly revenue, while April had the lowest revenue for the months analyzed.

· Sales were highest at 19:00 HRS while the lowest sales were recorded at 18:00 HRS.

· The UK had the highest sales volume at 84.01% while the rest account for 15.99 of total sales.

· Most sales had no customer IDs

Tableau Visualization

The data was bulky, which would be inefficient to visualize in excel, necessitating the use of Tableau features. Below are some of the visualizations generated from Tableau for a better review of the entity’s operations.

Top Customers

Revenue by Category

Revenue per State

Revenue per Month

Challenges

The data cleaning, analysis and presentation had some challenges which the organization should improve in its operations. Below are some of the challenges encountered.

· Lack of documentation about the cost of producing or procuring the various commodities, which would have been used to compute profitability. Profitability is a critical key performance indicator that would have revealed better insights about the organization’s operations.

· Poor record-keeping regarding country of origin and customer ID. The missing information is material and the organization should institute measures for more accurate collection of sales records for better audits and business analysis.

Suggestions for the Business

The business is performing excellently, based on the number of orders completed in multiple regions. However, there are some measures that can be taken to enhance effectiveness of the entity’s operations.

I. Diversify markets to avoid reliance on the UK territory for most of its sales. Significant risks in the UK market may compromise the organization’s operations because it relies on the region for 84% of its operations. Franchising, direct entry and joint ventures can be used to expand operations into other countries.

II. Improvement of efficiency to reduce the sales returns in the organization. The data reveals numerous sales returns (9923) were made over the one year period, which is a highlight of ineffectiveness in customer satisfaction. Improvements should be cultivated in the entity to satisfy customers more sustainably.

III. Marketing the various categories of commodities would be an excellent move by the business, which will increase the organization’s overall performance. Promotions for commodities with low sales volume will yield desirable results in terms of profitability, which will increase the organization’s growth.

Data Source

Dr. Daqing Chen, Course Director: MSc Data Science. chend ‘@’ lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK. https://www.kaggle.com/datasets/lakshmi25npathi/online-retail-dataset

--

--