Telegram group chat analysis

Loosely based on Whatsapp Group Chat Analysis using Python and Plotly by Saiteja Kura

Luis Durazo
MCD-UNISON
5 min readNov 7, 2020

--

Today, we will practice python and basic data science skills by completing a project for fun: analyzing data in a friend/colleagues group chat. We have chosen telegram for two reasons:

  1. It is vastly superior to whatsapp, specially when it comes to exporting data, if you follow a whatsapp guide on analyzing data, the export is plain text, while the telegram export supports other formats.
  2. Most of my communication with friends happens through telegram anyway.

Data preparation

We will not deep dive into how to get the data for telegram, for that, please refer to this guide.

I chose to download the chat history in json format, and then I manually removed some of the metadata, to be left with an array of messages, see a sample (with anonymized data) here:

This is all you need to get started, now, I chose the location data and file default name result.json now, let's load this data into a jupyter notebook or any other python interface of your preference.

In the code below I also added a section to randomize all names for this post purposes, but it is entirely optional for non-public usages.

Let's start by the basics, who has the most messages and what types of messages are common?

Let's see some charts with plotly, some folks in the group like to record audio rather than type, it's time to know who wins the no-hands contest:

Results:

In our next analysis, we will get a breakout of how emojis and words are distributed and counted, this code below will add that data to the data frame, while we query some generics about the data:

For each of our friends, let's just print some insights:

Note: I manually truncated some of the data above for the post purposes.

From that we can conclude a few things:

  1. Larry seems to be an outlier, likely a short-lived account, a bot, or a telegram service.
  2. Charles has the highest average of words per message, although he is not the one with the most words sent.
  3. Paul seems to be the one that shares the most words, and his average isn't low either.

Let's now see how the emoji distribution is looking:

Now, let's repeat the exercise but per user:

The friend list is too big so I'll just share a couple of them:

Not only emojis and word counts are important, it is also important to know what words we use as a group. I chose a specific stop word list and also used some custom ones that do not appear in the code below, most of them just hiding real names.

As you can see, we mostly laugh in the group, with the ocasional swear word.

How many words have we shared over time? let's take a look.

Notice that I'm creating a new datetime field instead of using the date field that already exists, in contrast to the whatsapp guide this one is based on, where the author has to parse text, I simply extrapolate a new column that more easily let's me resample by the datetime index that is created.

Results:

We have, overtime, shared less and less messages although the group remains active

I'm interested in the first few spikes, 2016 through 2017 were our most active years of conversation, what was our top 10? what were the hour distribution of all messages?

July 28, 2017 was our most active day, with almost 5,000 words shared. On the right, a pretty lean distribution of messages depending on the hour of the day

And last but not least, how active we were each day of the week, with a surprising low count on the sundays!

We hang together very frequently on the weekends, that may help explain why we don't send messages to each other, we are after all talking face to face, and that's great!

Takeaways

I really had fun with this project, and it's a very easy and useful introduction to some of the basic tools a data scientist has available for use.

--

--