BITS OF PYTHON

Exploratory Data Analysis Using Python for Beginners

Sharon Regina
CodeX
Published in
5 min readNov 13, 2022

--

Simple EDA of ABC’s Black Friday Sales to Determine their Highest Paying Customers

This time, I’m going to share my attempt to do EDA using a Black Friday Sales dataset from Kaggle. For reference, the notebook containing the codes can be found here.

The goal of the analysis is to increase ABC’s high-paying customers, hence, let’s dive into the dataset and see what we can find!

  1. Draw up the info of the dataset to see the columns, nulls, and data type. Here we can see that there are fields with nulls: Product_Category_2 and Product_Category_3, meaning one product has a minimum of one product category.
Dataset Info by Author

2. Look at the sample data to determine if columns are numeric or categorical then convert the column data types to object (categorical) or int (numeric).

Sample data by Author
Convert column data types by Author

3. Look at the unique values of categorical fields. It seems that some field values are masked (Occupation, Product Category).

Categorical fields unique values by Author

4. Next, let’s take a look at the Statistical Summary of Purchase, the Target Variable; The min and max are far from the mean but the mean and median are not far from each other. Hmm, are those outliers incorrect data points?

Purchase statistical summary by Author

5. Let’s check if that 12 is an incorrect data input. Don’t think so, cause there are 101 rows from it.

Check if there are incorrect data points by Author

6. What’s the granularity of this dataset? Does one Product_ID only has one Purchase price? Nope! It seems like it’s the total purchase amount per product per User_ID. From here we know that the dataset is not unique per User_ID and since there is no purchase time, it is not unique per purchase as well.

Product_ID unique value count by Author
Dataset granularity check by Author

7. Okay, now we know the granularity of the dataset and we can determine if we need to remove the data far from the median or the outliers. Since they are not incorrect data and we are only doing exploratory analysis, it is unnecessary to remove them.

Data distribution of Purchase by Author

8. This analysis wants to focus on the customers of the store and not based on the Product, hence since this data is not distinct per User_ID, I thought that making it distinct will give the analysis more reliability when we want to Count the population. To do this, the Product related columns need to be dropped first and the Purchase will be summed per User_ID.

Data manipulation to make it distinct per User ID by Author

9. Since the target variable Sum_Purchase is a numeric variable and the rest are categorical variables, let’s try illustrating the relationship between Gender and Sum_Purchase using a histogram. We can see that most records have a Purchase between 0–0.1 e7 (1,000,000) and it is true for both Genders, although Male has much more records than Female. There are also many more Male customers than Female customers.

Gender histogram by Author

10. Next let’s try drawing a distribution plot to see the probability density between Sum_Purchase and Gender. For Females, there is a higher probability than for Males to have a Sum_Purchase of 0–530,000. But for higher Purchase amounts (above 530,000), there is quite a difference between the two Genders and Males have a higher probability to yield them. Hence, the higher-paying customers are most likely Male.

Gender Distribution Plot by Author

11. Let’s do the same for City_Category. Here we can see that City Category B contributes the most customers and City Category C tends to have a lower purchasing power while City Category A tends to have a higher purchasing power.

City category visualization by Author

12. Since Age has quite a lot of groups (‘0–17’ ‘55+’ ‘26–35’ ‘46–50’ ‘51–55’ ‘36–45’ ‘18–25’), I decided to draw up a pie chart to see which group has more customers first. From the pie chart, we can see that age groups 26–35, 36–34, and 18–25 take up 78% of the customers, defining the store’s target market to probably be university students and working adults. Moreover, group 26–35 has a lower tendency to have a purchase amount of < 600,000 than the other two groups.

Age pie chart by Author
Age distribution by Author

13. Next, let’s take a look at Occupation. Among occupation 4, 7, and 0, occupation 4 and 0 seems to have a higher probability to have higher-paying customers.

Occupation count plot by Author
Age distribution plot by Author

14. Lastly, even though most customers are not married (0), Marital Status doesn’t seem to have any significant difference toward sum_purchase.

Marital status count plot by Author
Marital status distribution plot by Author

In conclusion, from the store’s Black Friday sales data, if the store wants to increase their higher paying customers, they should target 26–35-year-old Males from City Category A, with Occupation 4 or 0.

This is my first attempt to do EDA on datasets using Python and I wanna share what I found here :D, therefore, any suggestions, comments, or constructive criticism are warmly welcome!

For any inquiries, I can be contacted via LinkedIn. Hope this article can serve as an introduction to your EDA practice!

--

--

Sharon Regina
CodeX

Business Intelligence enthusiast that wants to learn more by sharing my experiences and research. **Views expressed here are solely my own & not my employer’s.