Data Cleaning: Exploring Beyond Plots

Published in

The Startup

6 min readOct 3, 2020

Exploring the hidden information in Online Retail dataset.

In the data science world, data cleaning and EDA is a major component which is covered in every discussion and books but many times we take a shortcut by either dropping the clumsy data and move forward or don’t do much data diligence. I am writing this article because I found that this particular data set is handled at many places but not due diligence is given to understand data. In The Online Retail data, which is used for cohort and RFM analysis, the articles quickly switching either to cohort analysis or RFM analysis and that is completely fair and justified given the focus of the article is different. The purpose of this write up is to understand the data, populate missing data, drop whatever is not needed, then do whatever you need to do. So in this article, I am not covering cohort or RFM but only the data cleaning part. Don’t take it as one specific case but the exploration done to understand data can be utilized for any complex data set.

Let’s jump to our investigation:

import pandas as pd
import numpy as np

Read data file using pandas and convert InvoiceDate to datetime.

d2 = pd.read_excel('Online Retail.xlsx', 
                     dtype={'CustomerID': str,
                          'InvoiceID': str},
                   )
d2['InvoiceDate'] = pd.to_datetime(d2.InvoiceDate)

Missing Values

Check how many missing values we got in dataset:

d2.isna().sum()InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

We see that Description and CustomerID have missing values.

Zero & Negative Values

From basic shopping experience we know that Quantity and UnitPrice can’t be zero for any stock. Let’s check:

d2[d2.Quantity<=0]

d2[d2.UnitPrice <=0]

We see that Quantity is negative where orders are cancelled but UnitPrice zero has no explanation., Also, we see that for many of the zero UnitPrice stocks, Description is missing. Same information we can get from describing the data as well.

d2.describe().T

Here we observe that the UnitPrice is not only zero but it is negative as well. We know that negative value in Quantity implied order cancellation but what about negative value in UnitPrice ? Let’s check that:

We see that this is a bad debt adjustment and is more related to internal working of retail store than actual sales. So this can be dropped off. Moreover, is this the only non sales related entry in data set or could there be more ? We will see that shortly.

Get Description Right

To know what exactly a Stockcode is, we need to know the description. The current data has inconsistent description, few times partial or blank as well. A quick look at stock count and description count says that they don’t match-up.

print(‘StockCode count: ‘, len(dr.StockCode.unique()))
print(‘Description count: ‘, len(dr.Description.unique()))

Besides above, we have 1454 missing descriptions that we saw very fist thing. To bring all the descriptions in sync, we will pickup the most common description of a stock and replace all the description with the most common description. Below is a sample code for that.

from collections import Countermissing_des = d2[d2.Description.isna()] \[['StockCode']].drop_duplicates()all_desc = d2[~d2.Description.isna()][['StockCode', 'Description']]def get_desc(gr):
    c = Counter(gr.Description.values)
    return c.most_common(1)[0][0]new_desc = all_desc.groupby('StockCode').apply(get_desc)for stock in new_desc.index.values:
    d2.loc[d2.StockCode==stock, 'Description'] = new_desc.loc[stock]

Once we do that, we will see that we have updated most of the missing descriptions. This will help us in zeroing down to product in case we deep dive in product analysis.

Getting back to Zero UnitPrice

We found that UnitPrice is zero for many products that seems to be an incorrect data entry/collection as same products have different UnitPrice in other invoices. We will replace all the zero unit price values with available mean value of the same stock.

zero_unit_priced_stock = d2[d2.UnitPrice <= 0.0]\['StockCode'].drop_duplicates()stock_unit_price = d2[d2.UnitPrice != 0].groupby('StockCode').agg({'UnitPrice': [max, min, 'mean']})stock_unit_price.columns = ["_".join(x) for x in stock_unit_price.columns.ravel()]for stock in zero_unit_priced_stock.values:
    if stock in stock_unit_price.index.values:
        d2.loc[(d2.StockCode == stock) & (d2.UnitPrice == 0), 'UnitPrice'] = stock_unit_price.loc[stock, 'UnitPrice_mean'].round(2)

Are there special transactions ?

Once above is done, most of the zero UnitPrices will be gone. For the remaining ones we need to take a deeper look. Any value that is still zero, means all the instances of that stock had zero values. Since our description is consistent(we updated all the descriptions), let’s see if that can be of any help.

d2[d2.UnitPrice <= 0.0]['Description'].value_counts(dropna=False)

From above, we can see the majority is NaN and remaining others are not a sales(except 2 cases). The two highlighted cases, and NaN needs further analysis while rest all can be dropped off from dataset as they are not actual sales.

Possibly fraudulent transactions

Let’s analyze the NaN values now as rest all are explained.

d2[(d2.UnitPrice <= 0.0) & d2.Description.isna() & d2.CustomerID.isna()]

If we see closely, all of them have missing customer id and missing description. On a quick analysis we find that these all stocks are existing only once in the whole data set. So these seem to be some kind of special transaction and not actually a sales transaction. These could be:

Free-offers along with actual purchase which is recorded for the sake of recording
Free stuff to the employees of the store such as : gifts, novelties
Things that no one bought and were lying in good condition to be given to someone.
Use your imagination or contact the store 😉

Since these are not standard sales transaction, we can drop all of them from dataset.

d2.drop(nan_desc_stock.index, axis=’index’, inplace=True)
d2.drop(d2[d2.UnitPrice <= 0.0].index, axis='index', inplace=True)

Now, we don’t have any stock with zero UnitPrice and undefined description.

Ghost Customers

Now only undefined values are CustomerIDs. We will have a quick look at them to see whether they are one time customers or is it an special transaction which is actually not sales.

Our first effort would be populate them from same invoices number assuming that different stock under same invoice may have recorded CustomerID.

invoice_missing_cust = d2[d2.CustomerID.isna()]['InvoiceNo'].drop_duplicates()d2[d2.InvoiceNo.isin(invoice_missing_cust.values) & (~d2.CustomerID.isna() )]

Above code shows us that all stocks withing an invoice is missing CustomerId and missing CustomerID can’t be populated from existing data.

These missing customer records can neither be used for cohort analysis not RFM analysis as these can’t be correlated with each other in any way. But these can be useful for non-returning customer’s buying pattern analysis. We can compare the buying pattern of one time customer with a regular customer. Depending on that, we can decide product placement strategies as well. As for as cohort analysis or RFM is concerned, we can drop all the missing customer ids.

I hope this explains why do almost all articles drop the missing CustomerIDs right at starting, though without giving an explanation, but hopefully now you got a better understanding why.

Please do like and comment if you loved the EDA that has no plots to look at.