Exploratory data analysis on customer.csv file from Kaggle using Google Spreadsheets

Published in

CodeX

6 min readOct 19, 2022

--

Hi guys, in this post, I would like to do basic exploratory data analysis, usually called EDA. I hope from reading this post, you could know what is the possibility you could do when you face some raw data in the future, and you want to explore it.

As my mission is to humanize data analysis for people around me, I will use Google Spreadsheets, assuming it’s a more common tool compared to using R or Python. I will do a similar thing using another tool at another chance. For now, let’s proceed!

The first thing first, to get the data

We can get the data in Kaggle, specifically on this page. Don’t forget to login into your account before downloading it. Or, you can just go directly here and copy the file so you can try to do your own EDA.

The second thing second (bruh?), defines the data

Image 1. Screenshot of the top 10 rows of customer data.

We have to know what is the definition of each column in our data before starting to explore it.

customer_id Registered customer identifier, unique for each row. Data type: INTEGER.
home_store Store identifier where customer registered, possibly not unique for each row (because different customers could be registered in the same store). Type: INTEGER.
customer_fist-name Registered customer’s first name, possibly not unique for each row (because different customers could be having the same name). Data type: STRING.
customer_email Registered customer’s email address, unique for each row. Data type: STRING.
customer_since Customer’s registration date, possibly not unique for each row (because different customers could be registered on the same date). Data type: DATE.
loyalty_card_number Registered customer’s loyalty card number, unique for each row. Data type: STRING.
birthdate Registered customer’s birthdate, possibly not unique for each row (because different customers could be having the same birthdate). Data type: DATE.
gender Registered customer’s gender, possibly not unique for each row (because different customers could be having the same gender, in this file the possibility only M for Male, F for Female, and N for Not-available). Data type: STRING.
birthyear Registered customer’s birthyear, possibly not unique for each row (because different customers could be having the same birthyear). Data type: INTEGER.

After defining the data, we also have to define the problem we want to look into. Let’s say we would like to know the basic summary of this dataset.

Combining customer_id with another column could tell us:

How many registered customers do we have?
How about the distribution of the store, is there any favorite store based on the number of registered customers?
What is the most common email domain used by customers?
What is the growth rate from time to time for our number of registered customers?
What is the proportion of gender from our registered customers?

With a similar framework, we could create some questions for other variables. But in this case, we would like to make sure if is there any problem existing or not, based on our data. Whatever the problem we could find, would be great additional information for our team.

Data cleaning and manipulation

Some data cleaning and manipulation I did on this dataset would be:

`birth_year_interval`

By using birth_year we could easily create a new variable to group it by using the “IF” formula. In my case, I’m using a formula like this:

=IF(I2 < 1960, ”1950s” , IF(I2 < 1970, ”1960s”, IF(I2 < 1980, ”1970s”, IF(I2 < 1990, ”1980s”, IF(I2 < 2000, ”1990s”, ”2000s”)))))

`customer_since_year`

By using customer_since we could easily create a new variable to group it by using the “IF” formula. In my case, I’m using a formula like this:

=IF(YEAR(E2) = 2017, 2017, IF(YEAR(E2) = 2018, 2018, 2019))

`customer_email_domain, customer_email_domain_head, customer_email_domain_tail`

By using customer_email we could easily create a new variable to group it by using the “SPLIT” formula. In my case, I’m using a formula like this:

=SPLIT(D2, ”@”)
=SPLIT(M2, ”.”)

We could do more than this. Any data manipulation you think is needed to explore our dataset better?

Data exploration and visualization

Please check on the sheet with the name eda. The common thing to do is to create a pivot table. How to create a pivot table in Google Spreadsheets as simple as this:

Insert > Pivot Table > Data Range > New Sheet > Create

After pivoting the dataset, we can do some data aggregation, until we can get the data in here and then visualize it like this:

Image 2. Visualize the number of registered customers by home_store and gender.

As we could see, after doing a little bit of EDA and visualization, we already get an insight that could be useful for us, but we still have to look deeper than this, maybe we can find the cause or prove that the problem exists. Below we will use our birth_yeardata to get birthyear_interval so we can have a simpler grouping.

Image 3. Visualize the number of registered customers by birthday_interval and gender.

Image 4. Visualize the number of registered customers by brithday_interval and gender. Filtered by store 5.

Image 5. Visualize the number of registered customers by brithday_interval and gender. Filtered by store 8.

Image 6. Visualize the number of registered customers by brithday_interval and gender. Filtered by store 3.

Wow, we are doing it, we got another information to know what is the main cause. Currently what we know is the N gender data, dominantly contributed by users registered at stores 5 and 8, with criteria birthday_interval between the 1980s and 2000s. Let’s see from our dataset, which other variables we could use to maybe dive deeper to know the cause.

Image 7. Visualize the number of registered users by gender and year when registered. Filter: store 5 & 8 only.

Image 8. Visualize the number of registered users by gender and year when registered. Filter: store 3 only.

We could assume that stores 5 and 8 having this problem since the start, but store 3 which is a newer store not have this problem. It could be our “old” store’s registration procedure not following the newest procedure implemented at the new store, causing the customer’s gender not recorded properly.

But, we still have another question remain, which is:

Why only customers with birthyear intervals between the 1980s and 2000s?

Image 9. Visualize the number of registered customers, grouped by gender and customer email domain tail. Filter: store 5 and 8, birthyear_interval 1980s until 2000s.

Image 10. Visualize the number of registered customers, grouped by gender and customer email domain tail. Filter: store 3, birthyear_interval 1980s until 2000s.

As we can see, no specific pattern for customer email domain tail that could be preferred by our customers with birthyear between the 1980s and 2000s. This means we have to look for another variable that may be the main cause.

But we realize we still have other variables not checked yet, would you try to do some EDA to look for the main cause? If after checking we can’t still find it yet, in your opinion, what data do we need to collect to get more insights related to this matter?

Data presentation

We can use Google Slides to put all our findings into one big slide presentation before we share it with others. Please find my previous post here that may be useful for you to create a data presentation.

Closing

That’s all for my post right now. I hope you enjoy reading this until finish and learn something new from this basic and simple EDA. If you have any ideas on how could we do better EDA, give your thoughts! Until next time, see you when I see you, data analysis fellows! :D