WhatsApp Group Chat Analysis.

Cibhi Baskar
Analytics Vidhya
Published in
9 min readJul 25, 2020

WhatsApp is a messaging app for smartphones founded in 2009 by two former Yahoo employees, Brian Acton and Jan Koum. 2 billion users in 180 countries make WhatsApp the most popular messaging app in the world. WhatsApp extended to focus on producing a clean and fast chat service that worked flawlessly. Any feature they added, including voice messaging that could be made with a simple one-click implementation, was a simple extension of the core function of a text-chat app.

India is the biggest WhatsApp market in the world, with 340 million users. Nearly 65 billion+ messages are being sent using WhatsApp every day and an average of 29 million messages are sent per minute.

WhatsApp provides an option to export the chat which we can use it for analysis purpose. It is recommended to export any group chat for analyzing since it tends to be larger.

You will find the “Export chat” option in your WhatsApp chat (group/individual). Find the vertical ellipsis symbol > More > Export chat. Note: This feature isn’t supported in Germany.

Export chat option in WhatsApp

The “Export Chat” option converts the WhatsApp group conversation into a text file(.txt). Once the conversion is complete, I shared it to my email ID for doing the analysis.

Make sure you export the messages “without media” as we won’t be needing the media files for this analysis. When exporting with media, you can send up to 10,000 latest messages. Without media, you can send 40,000 messages.

Python Libraries

I use Jupyter Notebook, which is an open-source web application that allows you to see the intermediate outputs easily. These are the several packages I used for this analysis.

  • Pandas module to manipulate the data by creating the DataFrame.
  • Matplotlib is a visualization library which is used to generate insights from the data through visual methods.
  • RegEx(re) A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern. The RegEx module is not pre-installed in the Jupyter Notebook. Use the below line in the Anaconda Prompt to install the package.
conda install -c conda-forge regex
  • Emoji to deal with emojis in the message sent by the user. As Emoji module is not pre-installed in the Jupyter Notebook, run the below line in the Conda Prompt to install the package.
conda install -c conda-forge emoji

The Data

I exported the chat of a WhatsApp group where there are 15 members and it was created way back in November 2017.

The text file from my email looks something like this.

01/02/2019, 06:01 - Siddharth: Good Morning friends!
01/02/2019, 09:53 - Adithya: Do we have class?
01/02/2019, 09:53 - Adithya: Who is the teacher?
01/02/2019, 09:53 - Emma Watson: Akila Ma'am
01/02/2019, 09:53 - Emma Watson: don't be late!
01/02/2019, 10:11 - Adithya: Thank you boy!
01/02/2019, 10:14 - Bala: Where are you guys?

The file consists of 40,000 messages dated from 01/02/2019 to 24/07/2020.

By seeing the file we can see there are 4 attributes — Date, Time, Author, Message. These 4 attributes will be my columns in the Pandas DataFrame.

Note: Emmanuel is disguised as Emma Watson here. though I can change the author value using replace() in Pandas, I just don’t want to lose the originality😛

Creating the DataFrame

This plain text file will have to be tokenized and parsed into the above-mentioned attributes in a meaningful manner to be stored in a Pandas DataFrame.

To Detect the {Date} and {Time} tokens from the line of text, I use the RegEx matching. This will also say if a line of text is a new message or belongs to a multiline message.

def date(l):
pattern = '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)(\d{2}|\d{4}), ([0-9][0-9]):([0-9][0-9]) -'
result = re.match(pattern, l)
if result:
return True
return False

Now that I have identified lines that contain new messages with Date and Time components, the next part of the message which is detecting the {Author} token. The author values depend on how you have saved the contact in your phone. If you don’t have a contact saved from the WhatsApp group, the author value will be of their mobile number. For all the constraints mentioned, here is the RegEx pattern that will detect the {Author} token from the line of text.

def isauthor(l):
pattern = [
'([\w]+):',
'([\w]+[\s]+[\w]+):',
'([\w]+[\s]+[\w]+[\s]+[\w]+):',
'([+]\d{2} \d{5} \d{5}):'
]
patterns = '^' + '|'.join(pattern)
result = re.match(patterns, l)
if result:
return True
return False

Now that I detected the Date, Time and Author token, the remaining portion of the string (Message token) detects automatically.

Now it is time to split each line based on the separator tokens like commas (,), hyphens(-), colons(:) and spaces( ) so that the required tokens can be extracted and stored in a DataFrame.

def DataPoint(line):
SplitLine = line.split(' - ')
DT = SplitLine[0]
DateTime = DT.split(', ')
Date = DateTime[0]
Time = DateTime[1]
Message = ' '.join(SplitLine[1:])

if isauthor(Message):
authormes = Message.split(': ')
Author = authormes[0]
Message = ' '.join(authormes[1:])
else:
Author = None
return Date, Time, Author, Message

The Last step is that Parsing the entire file line by line and also handling the multiline texts. The below code will check if the line starts with a date, if not then it will be considered as a multiline text. The tokens will be extracted using the methods defined and store it in the list.

parsedData = []
FilePath = 'WhatsApp Chat.txt'
with open(FilePath) as fp:

messageBuffer = []
Date, Time, Author = None, None, None

while True:
line = fp.readline()
if not line:
break
line = line.strip()
if date(line):
if len(messageBuffer) > 0:
parsedData.append([Date, Time, Author,' '.join(messageBuffer)])
messageBuffer.clear()
Date, Time, Author, Message = DataPoint(line)
messageBuffer.append(Message)
else:
messageBuffer.append(line)

Now it’s time to create a DataFrame using the Pandas module. The list “parsed data” consist of all the 40,000 messages from the WhatsApp exported file, parsed and ready to be stored in the Pandas DataFrame.

df = pd.DataFrame(parsedData, columns=['Date', 'Time', 'Author', 'Message'])

Dropping Messages With No Authors

While exporting all the messages from the chat, WhatsApp exports messages related to security changes, the one who left the group, the one who joined, changes in group name etc. These messages will also be exported, but with no authors (Null value). It is necessary to drop all the null value data points to proceed further with the analysis.

NoneValues = df[df['Author'].isnull()]
NoneValues
df = df.drop(NoneValues.index)

Drops all rows of the DataFrame containing messages from null authors.

Analysing Number of Messages Sent by the Group Members

Toppers = df['Author'].value_counts()
Toppers.plot(kind='bar')

Emma Watson certainly is the most active person in this group. 😮

Analysing the Number of Media Messages Sent by the Group Members

Although media is not included while exporting the chat from WhatsApp, it will be considered as a message — <Media Omitted>. With this information, we can analyse the number of media messages sent by the group members.

MediaValues = df[df['Message'] == '<Media omitted>']
MediaValues
MediaTopper = MediaValues['Author'].value_counts()
MediaTopper.plot(kind='bar', color='m')

Emma Watson (aka Emmanuel) tops all the charts. 😆

Top 5 Frequently used Emojis by the Group Members

It’s time to use the Emoji module to see top-5 frequently used emojis by the group members. This module will help us to extract emojis from the text message, then we can append it to the list.

emojis=[]
for i in df['Message']:
my_str = str(i)
for j in my_str:
if j in emoji.UNICODE_EMOJI:
emojis.append(j)
emo = pd.Series(emojis)
TopEmoji = emo.value_counts().head(5)
TopEmoji

Total Message Count — This Year vs Last Year

There is a lot of difference between last year (2019) and this year (2020). In 2019, we met each other every day in our college and there were no unnecessary chats in this WhatsApp group. But 2020 is so different, we are in the lockdown because of the coronavirus pandemic situation and this (WA Group) is the best medium for all of us to stay in touch like we used to be.

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)# Extracting Year from the Date
YearData = df['Date'].dt.year
TopYear = YearData.value_counts()
TopYear.plot(kind='bar', color='purple', width=0.09)

The above graph shows that we used this WA group 3 times more in 2020 than in the year 2019.

We are still in the 7th month of the year 2020. 😮

Total Message Count — Last 18 Months

The exported file consists of the last 18 months chats, from Feb 2019 to July 2020.

# Extracting Month and Year from Date
df['Month_year'] = df['Date'].dt.to_period('M')
TopMonth = df['Month_year'].value_counts()
TopMonth = TopMonth.sort_index()
TopMonth.plot(kind='bar', color='salmon')

It is so obvious that the sudden increase in the message count from March 2020 is because of the lockdown.

Top 25 Active Days of the WhatsApp Group

Since we have this group from 2017, all the top 25 active days lies only in the year 2020.

TopDate = df['Date'].value_counts().head(25)
TopDate.plot(kind='bar', color='firebrick')

Looks like we are using this WhatsApp group too much in this lockdown! But, why not? 😛

Active hours of the WhatsApp Group

df['Hour'] = df['Time'].apply(lambda a : a.split(':')[0])
TopHours = df['Hour'].value_counts()
TopHours = TopHours.sort_index()
TopHours.plot(kind='bar', color='orange')

We are open 24/7. 🙌

Finding the Letter and Word Count from Each Message

Going deep into the analysis, I extracted the number of letters and words used by a group member. We won’t be needing the media values for this analysis, so I dropped it from the DataFrame.

df = df.drop(MediaValues.index)df['Letters'] = df['Message'].apply(lambda s : len(s))
df['WordCount'] = df['Message'].apply(lambda s : len(s.split(" ")))

Analysing Which Group Member Has Highest Letter and Word Count

GroupedData = df.groupby(['Author']).sum()
LetterGroupedData = GroupedData.sort_values(by=['Letters'])
WordGroupedData = GroupedData.sort_values(by=['WordCount'])
LetterGroupedData['Letters'].plot(kind='bar', color='hotpink')
WordGroupedData['WordCount'].plot(kind='bar', color='teal')

Thank you for reading and knowing deep about my WhatsApp group.

Head to my GitHub Repo to view the Jupyter Notebook: https://github.com/CibhiBaskar/WhatsApp-Chat-Analysis

Connect with me on LinkedIn www.linkedin.com/in/cibhi

Note: All the Data Visualization code lines are just for understanding. View my Jupyter Notebook to see the full code.

--

--