WhatsApp Chats Analysis with Python

8 min readApr 10, 2023

WhatsApp chats have become a staple in our uni life!

📱🎓 Ever wondered what insights can be uncovered from those messages? Python and data analysis offer exciting insights like most active chatters, hot topics, popular discussions, most active times and more! In this article, I’ll explore how to analyze WhatsApp chats and unlock valuable information from the convo history.

Getting the data

I used my university’s class WhatsApp group as a case study to provide interesting insights into how my classmates interact and communicate with each other outside of the classroom. To export the chat history, not including video or photos, I followed the steps below.

Open the WhatsApp chat that you want to export.
Tap on the three dots on the top right corner of the chat screen.
Select “More” and then “Export Chat” from the drop-down menu.
Choose “Without Media” to exclude media files such as photos and videos from the export.
Choose the preferred method of sharing the exported chat file, such as sending it to your email or even just saving it to your device.
Once the file is saved, you can access it and use it for further analysis using Python.

Importing libraries and packages

I’ll be utilizing various libraries and packages, such as Pandas, Matplotlib, and NLTK, which can help with tasks such as data manipulation, visualization, and natural language processing. Therefore, before diving into the analysis, it’s important to make sure these libraries are installed and imported correctly.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white')
import plotly.express as px

import re
from collections import Counter
import emoji
import collections
import datetime
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import nltk 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

import warnings
warnings.simplefilter(action='ignore', category=UserWarning)

Creating the dataframe from the text file

Once we have the text file containing the exported WhatsApp chat data, we can create a Pandas DataFrame in Python to organize the data in a structured format. This will allow us to easily manipulate and analyze the chat data using various Python libraries and techniques.

# Read chat data from text file
with open('WhatsApp Chat with CPE 300L.txt', 'r', encoding='utf-8') as f:
    chat_data = f.readlines()

# Define regular expressions to extract data
message_regex = re.compile(r'^(\d{2}\/\d{2}\/\d{4}, \d{2}:\d{2}) - ([^:]+): (.+)$')
system_message_regex = re.compile(r'^(\d{2}\/\d{2}\/\d{4}, \d{2}:\d{2}) - (.+)$')
media_message_regex = re.compile(r'^(\d{2}\/\d{2}\/\d{4}, \d{2}:\d{2}) - ([^:]+) attached (\S+) \(.*\)$')

# Initialize data lists
dates = []
times = []
members = []
messages = []
message_types = []
message_lengths = []
reaction_counts = []
word_counts = []
hashtags = []
mentions = []
emojis = []

# Loop through chat data and extract required information
for line in chat_data:
    # Check if line contains a message
    match = message_regex.match(line)
    if match:
        dates.append(match.group(1)[:10])
        times.append(match.group(1)[11:])
        member = emoji.demojize(match.group(2)).strip()
        members.append(member)
        messages.append(match.group(3))
        message_types.append('text')
        message_lengths.append(len(match.group(3)))
        reaction_counts.append(0)
        word_counts.append(len(match.group(3).split()))
        hashtags.append(re.findall(r'#(\w+)', match.group(3)))
        mentions.append(re.findall(r'@(\w+)', match.group(3)))
        emojis.append(re.findall(r'[\U0001F600-\U0001F650]', match.group(3)))
    else:
        # Check if line contains a system message
        match = system_message_regex.match(line)
        if match:
            dates.append(match.group(1)[:10])
            times.append(match.group(1)[11:])
            members.append('System')
            messages.append(match.group(2))
            message_types.append('system')
            message_lengths.append(len(match.group(2)))
            reaction_counts.append(0)
            word_counts.append(len(match.group(2).split()))
            hashtags.append([])
            mentions.append([])
            emojis.append([])
        else:
            # Check if line contains a media message
            match = media_message_regex.match(line)
            if match:
                dates.append(match.group(1)[:10])
                times.append(match.group(1)[11:])
                member = emoji.demojize(match.group(2)).strip()
                members.append(member)
                messages.append(match.group(3))
                message_types.append('media')
                message_lengths.append(0)
                reaction_counts.append(0)
                word_counts.append(0)
                hashtags.append([])
                mentions.append([])
                emojis.append([])

# Create pandas dataframe from extracted data
df = pd.DataFrame({
    'date': dates,
    'time': times,
    'member': members,
    'message': messages,
    'message_type': message_types,
    'message_length': message_lengths,
    'reaction_count': reaction_counts,
    'word_count': word_counts,
    'hashtags': hashtags,
    'mentions': mentions,
    'emojis': emojis
})

Further exploration

Dimension of the data (number of rows, number of columns). The data has 24318 rows and 11 columns.

With the head() pandas method, we can see some records from the data.

I’m going to drop some rows as they are irrelevant to the analysis.

df.drop([0, 1, 2, 3, 4], axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

Identifying all the unique members or participants of the WhatsApp group chat using pandas unique() method.

Some of the members’ name are incorrect. I will make changes to this.

# rename incorrect member names

replace_dict = {
    'System': 'System messages',
    'Queen Lizzy :princess:': 'Queen Lizzy',
    'Ahmed Abubakar:face_with_medical_mask:': 'Ahmed',
    'Aishort': 'Aisha',
    'ViNe:keycap_6::keycap_9:': 'ViNe 69',
    'Real John :soccer_ball:': 'John',
    '+234 816 440 8811': 'Abdulmalik',
    '+234 708 310 7624': 'Inioluwa',
    'Khaleesi:crown::princess:': 'Khaleesi',
    'Abdulrahman :glasses:': 'Abdulrahman',
    'Ayotunde Martins': 'Martins',
    'Augustine :soccer_ball:': 'Augustine',
    'Kenny Habeeb': 'Kenny',
    'Abdulraqib': 'Raqib',
    'Peace :face_with_medical_mask:': 'Peace',
    'Nelson Isralia': 'Nelson',
    'Dami': 'Dammy',
    'Mayowa Solomon': 'Mayowa',
    'Blurryface:smiling_face_with_sunglasses:': 'Blurryface',
    'Femi Clinton': 'Femi',
    '+234 816 892 3626': 'Abdulbasit',
    'Scholar Mukhtar': 'Mukhtar',
    '+234 701 897 1552': 'Maryam',
    'Nerry Kylen': 'Nerry'
}
df['member'] = df['member'].replace(replace_dict)

Now let's check them again.

Quick Stats.

def split_count(text):

    emoji_list = []
    data = re.findall(r'[^\s\u1f300-\u1f5ff]', text)
    for word in data:
        if any(char in emoji.distinct_emoji_list(text) for char in word):
            emoji_list.append(word)

    return emoji_list

total_messages = df.shape[0]
avg_message_length = df['message_length'].mean()
media_messages = df[df['message'] == '<Media omitted>'].shape[0]
df["emoji"] = df["message"].apply(split_count)
emojis = sum(df['emoji'].str.len())
URLPATTERN = r'(https?://\S+)'
df['urlcount'] = df.message.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
links = np.sum(df.urlcount)

print("Stats")
print("Messages:", total_messages)
print('Average message length:', avg_message_length)
print("Media messages:", media_messages)
print("Emojis:", emojis)
print("Links:", links)

Messages sent messages sent per member, words per message,
emojis sent per message links sent per message and other stats.

# Calculate messages sent per member
messages_sent = df.groupby('member')['message'].count()

# Calculate words per message
df['words_per_message'] = df['message'].apply(lambda x: len(x.split()) if isinstance(x, str) else 0)
words_per_message = df.groupby('member')['words_per_message'].mean()

# Calculate emojis sent per message
df['emojis_per_message'] = df['emojis'].apply(lambda x: len(x) if isinstance(x, str) else 0)
emojis_per_message = df.groupby('member')['emojis_per_message'].mean()

# Calculate links sent per message
df['links_per_message'] = df['message'].apply(lambda x: len(re.findall(r'http\S+', x)) if isinstance(x, str) else 0)
links_per_message = df.groupby('member')['links_per_message'].mean()

# Combine stats into a single dataframe
stats_df = pd.concat([messages_sent, words_per_message, emojis_per_message, links_per_message], axis=1)
stats_df.columns = ['messages_sent', 'words_per_message', 'emojis_per_message', 'links_per_message']

# Sort by messages sent in descending order
stats_df = stats_df.sort_values('messages_sent', ascending=False)

# Print results
stats_df

Frequently used words.

Most commonly used words in the group.

# Filter out messages that contain media files
non_media = df[~df['message'].str.contains('<Media omitted>')]

# Extract all messages from the DataFrame and join them into a single string
all_messages = ' '.join(non_media['message'].astype(str).tolist())

# Convert the string into a list of words
all_words = all_messages.split()

# Count the frequency of each word using Python's Counter object
word_freq = Counter(all_words)

# Select the top 10 most frequent words and their counts
top_words = word_freq.most_common(10)

# Extract the words and counts into separate lists
words = [word[0] for word in top_words]
counts = [count[1] for count in top_words]

# Create a bar chart of the top 10 most frequent words
plt.bar(words, counts)
plt.title('Top 10 Most Frequent Words (Excluding Media)')
plt.xlabel('Words')
plt.ylabel('Counts')
plt.show()

Emojis.

In addition to words, I included emojis to gain insights into the most frequently used emojis and their corresponding meanings in the WhatsApp chat.

Frequency of emojis

total_emojis_list = list([a for b in df.emojis for a in b])
emoji_dict = dict(collections.Counter(total_emojis_list))
emoji_dict = sorted(emoji_dict.items(), key=lambda x: x[1], reverse=True)

emoji_df = pd.DataFrame(emoji_dict, columns=['Emoji', 'Frequency'])

The laughing emoji is the most commonly used emoji

Distribution of emojis

# Define custom colors for the pie chart
colors = ['#ffd700', '#ff69b4', '#1e90ff', '#ff8c00', '#00ced1']

# Create a pie chart of the emoji frequencies with custom colors
fig = px.pie(emoji_df, values='Frequency', names='Emoji', title='Overall emoji distribution', color_discrete_sequence=colors)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(width=800, height=500, showlegend=True)

# Show the plot
fig.show();

Word Cloud.

Creating a word cloud is a fun and informative way to visualize the most commonly used words in the chat, giving a better understanding of the overall tone and topics of conversation.

# Remove messages containing "Media omitted"
x = df[~df['message'].str.contains('Media omitted')]

# Concatenate all cleaned messages into a single string
all_messages = x['message'].dropna().str.cat(sep=' ')

# Generate word cloud
wordcloud = WordCloud(width=800, height=800,
                      background_color='white',
                      stopwords=STOPWORDS,
                      min_font_size=10).generate(all_messages)

# Plot word cloud
plt.figure(figsize=(8,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Most Active dates.

plt.figure(figsize=[10, 8])
sns.barplot(x='count', y='date', data=df['date'].value_counts().head(20).reset_index().rename(columns={'index':'date', 'date':'count'}), color='#F0D653')
plt.title('Top 10 days with highest number of Messages')
plt.xlabel('Number of Messages')
plt.ylabel('Date')
plt.show()

The group was most active on 16/11/2022.

Most Active times of the day.

plt.figure(figsize=[10, 8])
sns.barplot(x='count', y='time', data=df['time'].value_counts().head(20).reset_index().rename(columns={'index':'time', 'time':'count'}), color='#F0D653')
plt.title('Most Active Times')
plt.xlabel('Number of messages')
plt.ylabel('Time')
plt.show();

The group is mostly active in the night.

Most Active hours.

# Extract the hour from the time column
df['hour'] = df['time'].str.split(':', expand=True)[0]

# Plot the most active hours
plt.figure(figsize=[10, 8])
sns.barplot(x='count', y='hour', data=df['hour'].value_counts().head(20).reset_index().rename(columns={'index':'hour', 'hour':'count'}), color='#F0D653')
plt.title('Most Active hours')
plt.xlabel('Number of Messages')
plt.ylabel('Hour')
plt.show()

Most Active days of the week.

# Extract the weekday from the date column
df['weekday'] = df['date'].dt.day_name()

# Plot the most active hours
plt.figure(figsize=[10, 8])
sns.barplot(x='count', y='weekday', data=df['weekday'].value_counts().reset_index().rename(columns={'index':'weekday', 'weekday':'count'}), color='#F0D653')
plt.title('Most Active days of the week')
plt.xlabel('Number of Messages')
plt.ylabel('Weekday')
plt.show()

Most active and least active group participants.

# most active
df['member'].value_counts().sort_values(ascending=True).plot(kind='barh',figsize=(15,15), color='#F0D653')
plt.title('The least active group members')
plt.ylabel('Participant')
plt.xlabel('No of Messages');

# least active
df['member'].value_counts().tail(20).sort_values(ascending=True).plot(kind='barh',figsize=(15,15), color='#F0D653')
plt.title('The least active group members')
plt.ylabel('Participant')
plt.xlabel('No of Messages');

Messaging progression

Taking a closer look at how messaging has evolved and what it means for the daily interactions.

date_df = df.groupby("date").sum()
date_df.reset_index(inplace=True)
fig = px.line(date_df, x="date", y="word_count", title='Messaging progression')
fig.update_xaxes(nticks=20)
fig.show()

Messaging evolution from March 2022 to November 2023.

The analysis of these WhatsApp group chats reveals the evolution of messaging and its impact on communication. Insights into communication patterns - from active chatters to popular topics and peak activity times, trends, and habits were gained, providing a fascinating glimpse into interactions. Understanding all of this helps improve future communication. The power of data analysis was utilized to uncover valuable insights behind the vast amount of conversational data. These remind us of the power of data to reveal information about our daily interactions, leading to more meaningful and effective conversations in the future.

Thanks for reading🤓. You can check out my GitHub repo for the full code, documentation and other resources related to the project. If you found this piece of article insightful, kindly drop a clap and ensure you follow me for more insightful articles like this. Cheers!