WhatsApp group chat analysis with python

Published in

MCD-UNISON

8 min readNov 2, 2020

In the world of social media, the WhatsApp conversation groups are one of the most popular ways to stay in contact with family, friends and coworkers. These groups contain a lot of information and data that can be analyzed in an fun and interesting way; therefore, inspired in Saiteja Kura’s “Whatsapp Group Chat Analysis using Python and Plotly”, I decided to analyze one of my close friends’ group chat to see what interesting results I could find.

WhatsApp conversation data.

The first step of this process is to obtain the conversation on TXT format. The easiest way to export an entire chat history, not including video or photos from your cell phone, is to use the built-in “Export Chat” feature, following these steps.

1. Open an individual or a group chat.
2. Tap the Menu button shown in the image below.
3. Tap the More button.
4. Select Export Chat.
5. Tap Without Media from the options that are given.
6. Select an option to share TXT file.

Importing libraries.

The following analysis uses a series of tools to develop and easy mechanism to generate charts and graphics. The imported libraries where:

import re
import regex
import pandas as pd
import numpy as np
import emoji
import plotly.express as px
from collections import Counter
import nltk 
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus 
import stopwords
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
%matplotlib inline

Creating a DataFrame from txt file.

Regular expressions are fundamental to classify information in a chat; hence, we will use some functions based on these expressions’ use.

With the following code, plus the support of the previous functions, we will select the TXT file from the conversation, and we will form each of the initial DataFrame columns with which we will work, and we will call it “chat”.

Exploring “chat” DataFrame.

It is time to know more about new DataFrame, with help of head() pandas method, we are be able to see columns and few records

chat.head()

Using an alias for each author.

In some cases, doing this type of analysis it is better to take care with identity of the group members, for that it is possible to assign an alias to each participant at author and message column where they are mentioned

ducktales_group = list(chat.Author.unique())
ducktales_group
aliases = ['Pluto', 'Donald', 'Mickey', 'Goofy']
chat['Author'].replace(ducktales_group, aliases, inplace=True)
chat.head()

Using aliases in all messages that require it

for(name, alias) in zip(ducktales_group, aliases):
    chat.Message = chat.Message.str.replace(name,alias)

Making adjustments and adding new columns.

Using another pandas method called info()we can have a lot of information about columns of DataFrames, such as format, number of nan and the count of records per column, in this way we found that format of DateTime column is string, we will transform this column to a suitable date and time format to be able to analyze these aspects.

chat["DateTime"] = pd.to_datetime(chat["DateTime"])chat.info()

With DateTime column, we will get 4 new columns weekday, month_sent, date, hour.

#new column weekday
chat['weekday'] = chat['DateTime'].apply(lambda x: x.day_name())# new column month_sent
chat['month_sent'] = chat['DateTime'].apply(lambda x: x.month_name()) #column date
chat['date'] = [d.date() for d in chat['DateTime']] #column hour
chat['hour'] = [d.time().hour for d in chat['DateTime']]

With Messages column, we can get 4 new columns emojis, urlcount, Letter_Count & Word_Count.

#column urlcount
URLPATTERN = r'(https?://\S+)'
chat['urlcount'] = chat.Message.apply(lambda x: re.findall(URLPATTERN, x)).str.len()#column Letter_Count
chat['Letter_Count'] = chat['Message'].apply(lambda s : len(s))#column Word_Count
chat['Word_Count'] = chat['Message'].apply(lambda s : len(s.split(' ')))

Function to create emojis column

Finally a DataFrame to analyze our chat group

chat.tail()

Analyzing Data

This chart shows the number of messages per day

date_grouped = chat.groupby('date')['Message'].count().plot(kind='line', figsize=(20,10), color='#A26360')

In a chart that shows the number of messages per day by authors, we can see that usually when one of them starts a conversation there is a response from the rest of the members.

Fridays are the favorite days to chat with friends

What time of day is it most common to send messages in this group?

hour_grouped_msg =  (chat.set_index('hour')['Message']
                          .groupby(level=0)
                          .value_counts()
                          .groupby(level=0)
                          .sum()
                          .reset_index(name='count'))fig = px.bar(hour_grouped_msg, x='hour', y='count',
                 labels={'hour':'24 Hour Period'}, 
                 height=400)
fig.update_traces(marker_color='#EDCC8B', marker_line_color='#D4A29C',
                  marker_line_width=1.5, opacity=0.6)
fig.update_layout(title_text='Total Messages by Hour of the Day')
fig.show()

If want to know which day of the week of each month had the greatest number of messages sent, to obtain these data

1. Group data by month and day of the week, in addition to counting the messages sent 2. Make a pivot_table with DataFrame obtained from in the previous step, having as columns days of the week and as rows the months of the year and the count carried out as the values to evaluate. 3. Using Plotly, perform a HeatMap with the ImShow function

Let’s coding…

grouped_by_month_and_day = chat.groupby(['month_sent', 'weekday'])['Message'].value_counts().reset_index(name='count')grouped_by_month_and_daymonths= ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']pt = grouped_by_month_and_day.pivot_table(index= 'month_sent', columns= 'weekday', values='count').reindex(index=months, columns= days)fig = px.imshow(pt,
                labels=dict(x="Day of Week", y="Months", color="Count"),
                x=days,
                y=months
               )fig.update_layout(
    width = 700, height = 700)
fig.show()

Using some pandas methods, is possible to know that the DataFrame has 9679 rows, 3764 of them are multimedia messages, average of words per message is 3.23, average of letters per message is 20.92 and average of daily messages is 23.67.

total_messages = chat.shape[0]
media_messages = chat[chat['Message'] == '<Multimedia omitido>'].shape[0]
average_message_words = chat['Word_Count'].mean()
average_message_letters = chat['Letter_Count'].mean()
average_message_day = chat.groupby('date')['Message'].count().mean()
print('Total Messages ',total_messages)
print('Media Message', media_messages)
print('Average Words by Messages', round(average_message_words, 2))
print('Average Letters by Messages', round(average_message_letters, 2))
print('Average Message Per Day', round(average_message_day, 2))

We can have more information on behavior of authors, analyzing data such as
Number of messages sent per author

qty_message_author = chat['Author'].value_counts()
qty_message_author.plot(kind='barh',figsize=(20,10), color=['#D4A29C', '#E8B298', '#EDCC8B', '#BDD1C5', '#9DAAA2'])
qty_message_author

Average number of messages sent daily per author

Number of multimedia messages sent by author

Number of words sent per message from each author

Its possible to know the most used words in chat, following this steps

1. Using nltk stopwords, to remove some words from our data
2. Generate a new DataFrame copying chat DataFrame selecting author and message columns
3. Separate each word of each message to make a row with each of them
4. Use "remove_emoji" function, to not consider emojis as common words
5. Remove empty or NaN rows
6. Unifying every laughing text just in a "jaja"
8. Grouping most common words then count their repetitions
9. Using plotly make a nice chart with top 10 common words

What about most common words by author?

WordCloud.

It’s time to create a word cloud with all “chat” words, for this we will use WordCloud library.

Using a couple of new functions to plot our chart and to eliminate links to web sites that could be found in messages.

#function to display wordcloud
def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(40, 30))
    # Display image
    plt.imshow(wordcloud) 
    # No axis details
    plt.axis("off");
#function to remove urls from text
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

Preparing text to plot WordCloud

chat_word_cloud = chat[['Message']].copy()chat_word_cloud['Message']= chat_word_cloud['Message'].apply(remove_emoji)
chat_word_cloud['Message']= chat_word_cloud['Message'].apply(remove_urls)
chat_word_cloud['Message']= chat_word_cloud['Message'].replace('nan', np.NaN)
chat_word_cloud['Message']= chat_word_cloud['Message'].replace('', np.NaN)
chat_word_cloud['Message']= chat_word_cloud.Message.str.replace(r"(a|j)?(ja)+(a|j)?", "jaja")
chat_word_cloud['Message']= chat_word_cloud.Message.str.replace(r"(a|j)?(jaja)+(a|j)?", "jaja")text = " ".join(review for review in chat_word_cloud.Message.dropna())wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, 
background_color='black', colormap='Set2', collocations=False,
stopwords = stopwords).generate(text)# Plot
plot_cloud(wordcloud)

Emojis.

What data can we get from emojis in group chat?

A sum of all different used emojis.

total_emojis_list = list(set([a for b in chat.emoji for a in b]))total_emojis = len(total_emojis_list)print('Sum of all used Emojis', total_emojis)

A new DataFrame shows the top ten emojis ordered from highest to lowest based on the number of repetitions.

total_emojis_list = list([a for b in chat.emoji for a in b])
emoji_dict = dict(Counter(total_emojis_list))
emoji_dict = sorted(emoji_dict.items(), key=lambda x: x[1], reverse=True)emoji_df = pd.DataFrame(emoji_dict, columns=['emoji', 'count'])
emoji_df.head(10)

A Plotly TreeMap chart shows variation in the amounts of use of each emoji.

fig = px.treemap(emoji_df, path= ['emoji'],
    values = emoji_df['count'].tolist(),
)
fig.show()

What if analyze the number of emojis sent by each author, with their respective Plotly TreeMap?

Emojis Distribution by Pluto

Emojis Distribution by Donald

Emojis Distribution by Mickey

Emojis Distribution by Goofy

Conclusions.

It was fun and a great knowledge experience for me to have done this analysis. It is curious to realize that you can obtain a lot of interesting information in tribal things of daily life, like in this case a group of conversations from a social network.