Exploring a WhatsApp chat group with Python.

Published in

MCD-UNISON

8 min readDec 12, 2020

Based on the article: “Whatsapp Group Chat Analysis using Python and Plotly” by Saiteja Kura

“**Whatsapp**” by **Mark Knol** is licensed under CC BY-NC-SA 2.0

Who sends the more messages? What are the words and Emojis that we use the most? This and more interesting information about a WhatsApp chat group is going to be revealed by following a set of simple steps shown in this article.

I am a newbie to python, so, this assignment from a class gave me an excellent intro to data wrangling as I practice different techniques to achieve the goal of striping a WhatsApp chat group and see what is going on in there.

Getting the data

First things first, in this stage we will collect the data from WhatsApp. This is done by going to the chat group that we want to analyze, we click on the settings (the three dots at the right upper corner), then select “More” option and finally “Export chat”, then select “without media”, and that’s it, this will export the conversation into a .txt file. The file it will look something like this:

Example of a .txt file containing a WhatsApp conversation

This file will be processed, so we can save it to a data frame format using Pandas library.

Extracting the information from the chat

The idea in this part is to read line by line the file containing the conversation and to break down the message into 4 tokens, each line will look something like this (except when the text is too long and covers more than one line):

23/6/2014 4:48 p. m. — Manuel A.: No quiere venir

And we want to separate in this form:

{Fecha} {Hora} — {Autor}: {Mensaje}

So, we need a code that detects whether each line begins with a date or not an if it so, we can then extract the date, time author, and message, then split the line into these 4 tokens and create a data frame with this information.

#libraries to be used
import pandas as pd
import numpy as np
import re
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import emoji
import regexdef startsWithDateAndTime(s):
    patron = '^[0-9]+\/[0-9]+\/[0-9]+\s[0-9]+:[0-9]+\s[a-zA-Z].\s[a-zA-Z].\s-'
    result = re.match(patron, s)if result:
        return True
    else:
        return False
  
# Prueba diferentes formas de nombre de usuario
def FindAuthor(s):
    patterns = [
        #Prier nombre
        '([\w]+):',
        #Primer nombre y apellido
        '([\w]+[\s]+[\w]+):',
        #Primer + Inicial segundo nombre. 
        '([\w]+[\s]+[\w]+).:',
        #Primer nombre + Segundo nombre + Apellido
        '([\w]+[\s]+[\w]+[\s]+[\w]+):',
        #Primer nombre + Segundo nombre + Apellido P + Apellido M
        '([\w]+[\s]+[\w]+[\s]+[\w]+[\s]+[\w]+):',  
        #Telefono
        '([+]\d{2} \d{3} \d{3} \d{4}):',           
        #Nombre + Emoji
        '([\w]+)[\u263a-\U0001f999]+:',              
    ]
    pattern = '^' + '|'.join(patterns)
    result = re.match(pattern, s)
    if result:
        return True
    return False
  
def getDataPoint(line):
    #dividimos el texto, de un lado nos quedamos con la 
    #fecha y hora del otro con el autor y el mensaje
    splitLine = line.split(' - ') 
    dateTime = splitLine[0]
    message = ' '.join(splitLine[1:])
    
    #Aqui obtenemos la fecha
    splitLine2 = dateTime.split()
    date = splitLine2[0]#Aqui Obtenemos la hora
    x = re.search("[0-9]+:[0-9]+\s[a-zA-Z].\sm.$", dateTime)
    time = x.group(0)
    
    #Aqui obtenemos el mensaje
    message = ' '.join(splitLine[1:])
    if FindAuthor(message): 
        splitMessage = message.split(': ') 
        author = splitMessage[0] 
        message = ' '.join(splitMessage[1:])
    else:
        author = None
    return date, time, author, messageparsedData = []# Nombre del archivo .txt conteniendo el chat
archivoChat = 'ChatPayasitos.txt'#Abrimos el archivo y lo llamaremos fp
with open(archivoChat, encoding="utf-8") as fp:
    messageBuffer = [] 
    date = None
    time = None
    author = None
   
    #Hacemos un ciclo para leer todo el archivo
    while True:
        line = fp.readline() 
        
        if not line: 
            break
        line = line.strip() 
        #Revisamos si la lina comienza con una fecha
        if startsWithDateAndTime(line):
            if len(messageBuffer) > 0: 
                parsedData.append([date, time, author, ' '.join(messageBuffer)]) 
            messageBuffer.clear() 
            date, time, author, message = getDataPoint(line) 
            messageBuffer.append(message) 
        else:
            messageBuffer.append(line)df = pd.DataFrame(parsedData, columns=['Date', 'Time', 'Author', 'Message'])
df["Date"] = pd.to_datetime(df["Date"])

This will get me a data frame like this:

Now we will eliminate all the messages where the author is “None”

df = df.dropna()

Now we end up with the participants of the chat group

df.Author.unique()

Authors in the chat group

NOTE: WhatsApp's files may differ across regions and different OS.

Analyzing the Data

Since we are exporting the chat without media, with the help of the frase “<Media omitted>” in our data set, we can find the total number of media messages shared in the group.

For finding the total emojis used we will be using the emoji library. We will also create a separate column emojis that consists of only the emojis for that particular message.

For finding the total number of links shared we will write a regex pattern and use the re library in python to identify URLs in a given message. We will also create a separate column urlcount that consists of the count of URLs in a particular message.

def split_count(text):
    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_list.append(word)return emoji_listtotal_messages = df.shape[0]media_messages = df[df['Message'] == '<Multimedia omitido>'].shape[0]df["emoji"] = df["Message"].apply(split_count)
emojis = sum(df['emoji'].str.len())
URLPATTERN = r'(https?://\S+)'
df['urlcount'] = df.Message.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
links = np.sum(df.urlcount)print("Grupo obtenido del archivo", archivoChat)
print("Mensajes de texto", total_messages)
print("Multiedia", media_messages)
print("Emojis", emojis)
print("Links", links)

Now lets separate the media messages from the text messages into two different data frames:

media_messages_df = df[df['Message'] == '<Media omitted>']
messages_df = df.drop(media_messages_df.index)

About the Authors

Now let’s see what every author has been sending to the group. To do that we will create a single list for every one of them:

# Creates a list for every Authors
l = messages_df.Author.unique()for i in range(len(l)):
  req_df= messages_df[messages_df["Author"] == l[i]]
  print(f'Stats of {l[i]} -')
  print('Messages Sent', req_df.shape[0])
  words_per_message = (np.sum(req_df['Word_Count']))/req_df.shape[0]
  print('Words per message', words_per_message)
  media = media_messages_df[media_messages_df['Author'] == l[i]].shape[0]
  emojis = sum(req_df['emoji'].str.len())
  print('Emojis Sent', emojis)
  links = sum(req_df["urlcount"])   
  print('Links Sent', links)   
  print()

Let’s look at the emojis!

In the last code we calculated the total number of emojis used lets now see the total number of unique emojis used. To accomplish this, we can combine all the lists in emoji columns, and by using the set data type find the number of unique emojis.

total_emojis_list = list(set([a for b in messages_df.emoji for a in b]))
total_emojis = len(total_emojis_list)
print("Distintos tipos de emojis enviados:", total_emojis)# Distintos tipos de emojis enviados: 289

Most used Emojis in Group. To find out which emojis are used the most, we use the next code:

import collectionstotal_emojis_list = list([a for b in messages_df.emoji for a in b])
emoji_dict = dict(collections.Counter(total_emojis_list))
emoji_dict = sorted(emoji_dict.items(), key=lambda x: x[1], reverse=True)emoji_df = pd.DataFrame(emoji_dict, columns=['emoji', 'count'])
emoji_df.head(20)

Now, we are going to use Plotly library to plot a char to see the emojis distribution by author:

import plotly.express as px
l = messages_df.Author.unique()
for i in range(len(l)):
  dummy_df = messages_df[messages_df['Author'] == l[i]]
  total_emojis_list = list([a for b in dummy_df.emoji for a in b])
  emoji_dict = dict(collections.Counter(total_emojis_list))
  emoji_dict = sorted(emoji_dict.items(), key=lambda x: x[1], reverse=True)
  print('Emoji Distribution for', l[i])
  author_emoji_df = pd.DataFrame(emoji_dict, columns=['emoji', 'count'])
  fig = px.pie(author_emoji_df, values='count', names='emoji')
  fig.update_traces(textposition='inside', textinfo='percent+label')
  fig.show()

Word Cloud

Word Cloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance. Let's create a word cloud from the messages send in this group.

First, lest see how many words do we have in the chat

text = " ".join(review for review in messages_df.Message)
print ("There are {} words in all the messages.".format(len(text)))
There are 657318 words in all the messages.

Stop words. Stopwords are natural language words which have very little meaning, such as “and”, “the”, “a”, “an”, and similar words. For the word cloud we will be using the Word Cloud library in python. A predetermined list of stop words already exists in the library.

import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
from wordcloud import WordCloud
from nltk.tokenize import word_tokenizestopwords = set(stopwords.words('spanish', 'english')) 
stopwords.update(["Omitido", "q", "http", "www", "jajaja", "si",  "https", "com"])
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(40, 30))
    # Display image
    plt.imshow(wordcloud) 
    # No axis details
    plt.axis("off");
    
# Generate word cloud
wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, background_color='black', colormap='RdBu', collocations=False, stopwords = stopwords).generate(text)
#wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, background_color='black', colormap='Pastel1', collocations=False, stopwords = stopwords).generate(text)
# Plot
plot_cloud(wordcloud)

More Stats

Messages through time. Now let’s look at the messages that have been sent through the days.

date_df = messages_df.groupby("Date").sum()
date_df.reset_index(inplace=True)fig = px.line(date_df, x="Date", y="MessageCount", title='Number of Messages as time moves on.')fig.update_xaxes(nticks=20)
fig.show()

Message distribution by day. We will use radar charts to understand the day-wise distribution.

def dayofweek(i):
  l = ["Lunes", "Martes", "Miercoles", "Jueves", "Viernes", "Sabado", "Domingo"]
  return l[i];
day_df=pd.DataFrame(messages_df["Message"])
day_df['day_of_date'] = messages_df['Date'].dt.weekday
day_df['day_of_date'] = day_df["day_of_date"].apply(dayofweek)
day_df["messagecount"] = 1
day = day_df.groupby("day_of_date").sum()
day.reset_index(inplace=True)fig = px.line_polar(day, r='messagecount', theta='day_of_date', line_close=True)
fig.update_traces(fill='toself')
fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0,4200]
    )),
  showlegend=False
)
fig.show()

Tieme of the day with more messages. At what time of the day the group is more active.

messages_df['Time'].value_counts().head(10).plot.barh() 
plt.xlabel('Mensajes')
plt.ylabel('Hora')