Exploratory data analysis on customer.csv file from Kaggle using Google Spreadsheets
Hi guys, in this post, I would like to do basic exploratory data analysis, usually called EDA. I hope from reading this post, you could know what is the possibility you could do when you face some raw data in the future, and you want to explore it.
As my mission is to humanize data analysis for people around me, I will use Google Spreadsheets, assuming it’s a more common tool compared to using R or Python. I will do a similar thing using another tool at another chance. For now, let’s proceed!
The second thing second (bruh?), defines the data
We have to know what is the definition of each column in our data before starting to explore it.
customer_id
Registered customer identifier, unique for each row. Data type: INTEGER.home_store
Store identifier where customer registered, possibly not unique for each row (because different customers could be registered in the same store). Type: INTEGER.customer_fist-name
Registered customer’s first name, possibly not unique for each row (because different customers could be having the same name). Data type: STRING.customer_email
Registered customer’s email address, unique for each row. Data type: STRING.customer_since
Customer’s registration date, possibly not unique for each row (because different customers could be registered on the same date). Data type: DATE.loyalty_card_number
Registered customer’s loyalty card number, unique for each row. Data type: STRING.birthdate
Registered customer’s birthdate, possibly not unique for each row (because different customers could be having the same birthdate). Data type: DATE.gender
Registered customer’s gender, possibly not unique for each row (because different customers could be having the same gender, in this file the possibility only M for Male, F for Female, and N for Not-available). Data type: STRING.birthyear
Registered customer’s birthyear, possibly not unique for each row (because different customers could be having the same birthyear). Data type: INTEGER.
After defining the data, we also have to define the problem we want to look into. Let’s say we would like to know the basic summary of this dataset.
Combining customer_id
with another column could tell us:
- How many registered customers do we have?
- How about the distribution of the store, is there any favorite store based on the number of registered customers?
- What is the most common email domain used by customers?
- What is the growth rate from time to time for our number of registered customers?
- What is the proportion of gender from our registered customers?
With a similar framework, we could create some questions for other variables. But in this case, we would like to make sure if is there any problem existing or not, based on our data. Whatever the problem we could find, would be great additional information for our team.
Data cleaning and manipulation
Some data cleaning and manipulation I did on this dataset would be:
birth_year_interval
By using birth_year
we could easily create a new variable to group it by using the “IF” formula. In my case, I’m using a formula like this:
=IF(I2 < 1960, ”1950s” , IF(I2 < 1970, ”1960s”, IF(I2 < 1980, ”1970s”, IF(I2 < 1990, ”1980s”, IF(I2 < 2000, ”1990s”, ”2000s”)))))
customer_since_year
By using customer_since
we could easily create a new variable to group it by using the “IF” formula. In my case, I’m using a formula like this:
=IF(YEAR(E2) = 2017, 2017, IF(YEAR(E2) = 2018, 2018, 2019))
customer_email_domain, customer_email_domain_head, customer_email_domain_tail
By using customer_email
we could easily create a new variable to group it by using the “SPLIT” formula. In my case, I’m using a formula like this:
=SPLIT(D2, ”@”)
=SPLIT(M2, ”.”)
We could do more than this. Any data manipulation you think is needed to explore our dataset better?
Data exploration and visualization
Please check on the sheet with the name eda
. The common thing to do is to create a pivot table. How to create a pivot table in Google Spreadsheets as simple as this:
Insert > Pivot Table > Data Range > New Sheet > Create
After pivoting the dataset, we can do some data aggregation, until we can get the data in here and then visualize it like this:
As we could see, after doing a little bit of EDA and visualization, we already get an insight that could be useful for us, but we still have to look deeper than this, maybe we can find the cause or prove that the problem exists. Below we will use our birth_year
data to get birthyear_interval
so we can have a simpler grouping.
Wow, we are doing it, we got another information to know what is the main cause. Currently what we know is the N gender data, dominantly contributed by users registered at stores 5 and 8, with criteria birthday_interval
between the 1980s and 2000s. Let’s see from our dataset, which other variables we could use to maybe dive deeper to know the cause.
We could assume that stores 5 and 8 having this problem since the start, but store 3 which is a newer store not have this problem. It could be our “old” store’s registration procedure not following the newest procedure implemented at the new store, causing the customer’s gender not recorded properly.
But, we still have another question remain, which is:
Why only customers with birthyear intervals between the 1980s and 2000s?
As we can see, no specific pattern for customer email domain tail that could be preferred by our customers with birthyear between the 1980s and 2000s. This means we have to look for another variable that may be the main cause.
But we realize we still have other variables not checked yet, would you try to do some EDA to look for the main cause? If after checking we can’t still find it yet, in your opinion, what data do we need to collect to get more insights related to this matter?
Data presentation
We can use Google Slides to put all our findings into one big slide presentation before we share it with others. Please find my previous post here that may be useful for you to create a data presentation.
Closing
That’s all for my post right now. I hope you enjoy reading this until finish and learn something new from this basic and simple EDA. If you have any ideas on how could we do better EDA, give your thoughts! Until next time, see you when I see you, data analysis fellows! :D