Whatsapp Group Chat Analysis with Python

Published in

MCD-UNISON

5 min readOct 30, 2020

The present analysis is based in the article of Saiteja Kura : Whatsapp Group Chat Analysis using Python and Plotly

Lot of people is using whatsapp, and most of the whatsapp user is in a chat group (toxic or not). Some of us ask ourselves questions like “Why this dude is in the group if never send a message” or “I’m for sure that today nobody will send a message”.

Next steps are based in Saiteja Kura’s work with some modifications due the OS of the phone and convenience. Detailed process can be found there.

Getting data & create data frame.

First we need to export the group chat using the option export in the group options in the phone, the chat file is stored in a txt file and saved in data directory.

Saved data must be parsed in order to be stored in a data frame, next the function that recognizes the date at start of the line that indicates that is a unique message, if yes the message is splitted in a Date, Time, Author and Message to be stored in a pandas Data Frame. In this case I needed to change some parts of code because the txt file is diferent on iOS.

Anonymizing data.

To protect privacy of the chat members we need to anonymizing the names, changing all names with LOTR characters.

Here, some relevant data from the data frame created, result of df.info and df.head, this small set consts of 914 records that implies a similar number of messages.

Group Wise Stats

Let see some stats from the data frame, first thing is that we can count the media messages counting the occurrence of “<Media omitted>”. By using emoji library we can found the total of emojis, creating a emoji column we have a emoji for particular message, additionally we create a column to count all URLs in the messages using re library.

The group messages has a 914 messages in total, thats because the exported chat has recently started, one interesting data is that are at least one emoji every two messages or an average of two messages by emoji.

Next, we separate text messages from media messages in two data frames, messages_df only contain text messages.

More Stats

Let see more stats, we show the letter count for a message and word count for that message,

More stats came from the emojis we already count the total emojis, and we have separate each, let get unique number of each one of them. with that information we can find the most used emoji in the group.

With 68 occurences “🙏” is the most used emoji, follow by “😘” with 60 occurences, next we show emoji distribution in a pie chart.

With the same library “Plotly”, we can show this chart individually.

Word Cloud

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud. We created a word cloud for all messages.

First join all messages in a large message.

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

We use the word Cloud library including the list already included in the library. We set spanish set of words because the native language of the group. We add some more words to ommit some occurrences.

And More Stats

We can get the the messages behavior in time, group the messages by date.

Day Distribution

Using a radar chart, we show the distribution by day of week.

Conclusion

There’s a lot of information about all our day to day activities, we assume several things about that information but is great when you can probe your assumptions even more using tools like Plotly and Python, in my personal experience was great discover this tools, and see all that trends, charts and info and all we see here is just a star in the sky of data science.

Thanks to Saiteja Kura for the original post, go there and see more interesting articles.