WhatsApp Message Exploratory Data Analysis(EDA)

“Understand more on ourselves through analysis”

Introduction

After 6 week course with Jovian.ml on Data Analysis with Python: Zero to Pandas. I am now finally able to do my Final Course Project that is related to WhatsApp chat analysis.

The reason why I choose WhatsApp as my analysis is because I always wanted to know more about my chat room behaviour. And after reading some of the notebooks from Prajwal Prashanth at Jovian.ml. It really inspired me to start this analysis as my Final Course project.

In this project, I will attempt to find out what normally I will do in a group chat with my friends such as the active hours we usually talk and the number of emoji we use in the chat. Let get started!

The dataset in use:

In this analysis, we are using the personal dataset from Whatsapp chat. Everyone is able to export their own dataset from the Whatsapp group:

Option and click on more

Before I start my analysis there is some item that you need to take noted. This dataset is from real-life chat that was created during the 05/02/2020 till 21/09/2020 period. And the main purpose of this group is to exchange knowledge and ideas during our University life. Moreover, this group consist of three users that is Ed, Rohit and Pei Yin.

You can view more code and some explanation in my Jovian Notebook for a better understanding: Project Notebook: my_notebook

1. Import Basics library & Read the Dataset

Importing needed library
Read the text file

Here we will start importing some library to use in our dataset

Then we will use pandas or pd.read to read the text file

2. Data Preparation and Cleaning.

Data understanding

Before we start any analysis we need to understand the Business and the data side:

Data Understanding:

  1. In here we can see there are 3 columns in the dataset
  2. The dataset contains date, text and a NaN value
  3. Using the info(), we are able to know the row of each object is not balance because there are 21k message but some of the columns only have 23k and 700
  4. After knowing there is an unknown value in the dataset and inbalance row we now can clear the data
Dataframe
Data information

So now we understand the column name need to be changed instead of using 0,1 and 2 we need to change it to more meaningful name such as DateTime, user and messages then we will put it as whatsapp_df. Also, we want to make all the row and column are in the same value. In this project, you will notice that I will be repeating using whatsapp_df and copy into multiple data frame. The below diagram shows how to convert text file into data frame

Convert text into a data frame

Cleaning the image data

After we are done cleaning the columns data, now we must make sure to clear all the image/media data because we are not going use that as our data analysis questions. Since we want to do analysis on the text rather than the image so we have to clean the image data in the text file. In here we have 11k of the image in the three-row. The below diagram is showing how to drop the image file.

Check the image number
Drop all the image that has media

3. Let get started on the Exploratory Data Analysis(EDA)

Question 1: Which users have the most Chat/messages in the group?

In any WhatsApp analysis, we always want to know which user normally chat the most in the group. This help as we determine the most active person in the chat group.

Using pandas:

As you can see we can use pandas to understand the data and even sort the data in ascending order. Now we will be able to see the most message or chat in the group is “Rohit”

Using pandas to understand the data

Data Visualization:

We are going to use the plot and bar chart for our data visualization. As you can see the results have shown us the most number of messages is by users call “Rohit” that is around 10k and this show “Rohit” is a very active member in the group

Plot chart
Bar Chart

Want to know more on the code, please visit my_notebook.

Question 2: Which emojis use the most by which users?

Now we want to know which emoji is used widely by the user and from the analysis, we can do an assumption that user will most likely to use emoji again in the other chat.

Using pandas:

As the result, you are able to see the most emoji are used in the WhatsApp group is Face with Tears of Joy

Using pandas to understand the data

Data Visualization:

Before we go into each of the users to determine which emoji is widely used by the user. We need to look at the overall emoji that have been used from three of the users. As you can see on the results, the most widely use emoji among the three users is Face with Tears of Joy that stand around 79.7% from the overall. So we can agree that most of the time the user will use Face with Tears of Joy Emoji in this group chat

Pie Chart of emoji

In here we are able to see each of the users use what emoji the most

User: Ed

  1. This user uses Woman Gesturing OK Emoji the most
  2. The second is Tears of Joy Emoji

User: Rohit

  1. This user has a diversities emoji use in the group chat, so we cannot determine what emoji he uses the most or use in the future.

User: Pei Yin

  1. This user will use two emoji that are in common that is Thumbs Up and Clapping emoji
Pie chart of each user use with emoji the most

Want to know more on the code, please visit my_notebook.

Question 3: The Most active hour in WhatsApp

In this analysis, it helps us to understand what is the hours where all the member is very active in WhatsApp. We will depend on two variable on is the number of messages and the hours. Then we will able to know when is the most active hours

Using pandas:

In this data frame, the most active hours use in WhatsApp is 1300hrs

Most activate Hours in data frame

Data Visualization:

In this analysis we are able to found the most active hours in WhatsApp is 1300 hours because at that time mostly we having our lunch break and normally most of the time we will chat during that hours.

Surprisingly we found that between the time period of 5 till 7 am there is no user are active during that time but we can agree between 12 till 2 am there is still user who is active for the past 8 months. So I will assume most of the user are a late sleeper.

Bar chart for Most active hour in whatsapps

Want to know more on the code, please visit my_notebook.

Question 4: Which month has the highest messages and also the busiest month?

This group was created between (05/02/2020–21/09/2020). Here we hope to found out the month that we are busiest and we look in the amount of message is generated.

Using pandas:

We are able to see the month and number of messages generated during the month. The most message generated is on the month of July (7).

Data Visualization:

In this analysis, we found that the busiest month is on July (7) the total number of messages had reached around 7000, The reason behind it is because on that month we are all busy on University assignment and mid-term test. This show that the users are very active during that month. The following month you are able to see there is a decrease of chat. This is highly due to the user are too busy on mid-term and assignment due date.

Moreover, you are able to see there is no March(3). This is because during that period of time Malaysia goes through pandemic lockdown due to COVID-19. So the group had been silent until the University resume in E-learning mode. Because of e-learning now you are able to see there are an increase in April(4) and a drop on May (5) due to University semester break.

Want to know more on the code, please visit my_notebook.

Question 5: Determine which word or text did the user use the most?

In here we are going to use a word cloud to the visual representation of the word in the chat and determine which word is widely used by the user? The reason for this analysis is to understand user behaviours. Why do we say so? Because the word is repeating use we can say that the user will more likely to use the particular or text again in the other chat.

Data Visualization:

As you can see the most word we use is “la”. The word “la” in Malaysia is very famous and common use among Malaysian. We always use this word such as “Yes la”, “No la”, “Okay la”, “coming la” and etc. For those who did not understand what is “la” you can treat it like slang in Malaysia and it is a way we Malaysian talk to others.

Want to know more on the code, please visit my_notebook.

Share and Like:

So this all I had shared with you all. I hope you enjoying reading my post!

Hope you all can share and like about my write up!!

Remember to follow me on Linkedin and Jovian.ml See ya!

Written by

Sign up for Data Science Daily

By Jovian

Your daily dose of data science articles, resources, tutorials, datasets, videos, and more — handpicked by the Jovian team Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Jovian is a community-driven learning platform for data science and machine learning. Take online courses, build real-world projects and interact with a global community at www.jovian.ai

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store