Telegram group chat analysis

Loosely based on Whatsapp Group Chat Analysis using Python and Plotly by Saiteja Kura

Published in

MCD-UNISON

5 min readNov 7, 2020

Today, we will practice python and basic data science skills by completing a project for fun: analyzing data in a friend/colleagues group chat. We have chosen telegram for two reasons:

It is vastly superior to whatsapp, specially when it comes to exporting data, if you follow a whatsapp guide on analyzing data, the export is plain text, while the telegram export supports other formats.
Most of my communication with friends happens through telegram anyway.

Data preparation

We will not deep dive into how to get the data for telegram, for that, please refer to this guide.

I chose to download the chat history in json format, and then I manually removed some of the metadata, to be left with an array of messages, see a sample (with anonymized data) here:

[  
  {
   "id": 1,
   "type": "message",
   "date": "9999-99-99T99:99:99",
   "from": "user 1",
   "from_id": 1,
   "text": "Sample text"
  },
  {
   "id": 2,
   "date": "9999-99-99T99:99:99",
   "from": "user 2",
   "from_id": 2,
   "text": "Sample text"
  }
]

This is all you need to get started, now, I chose the location data and file default name result.json now, let's load this data into a jupyter notebook or any other python interface of your preference.

In the code below I also added a section to randomize all names for this post purposes, but it is entirely optional for non-public usages.

Let's start by the basics, who has the most messages and what types of messages are common?

Let's see some charts with plotly, some folks in the group like to record audio rather than type, it's time to know who wins the no-hands contest:

Results:

In our next analysis, we will get a breakout of how emojis and words are distributed and counted, this code below will add that data to the data frame, while we query some generics about the data:

For each of our friends, let's just print some insights:

Larry  sent   8  words, average  2.0  per message
Johnny  sent   110298  words, average  3.33  per message
Donnell  sent   113769  words, average  3.29  per message
Robbie  sent   55062  words, average  2.930  per message
Paul  sent   117545  words, average  3.98  per message
William  sent   16973  words, average  2.91  per message
Arnold  sent   37971  words, average  3.67  per message
Jamaal  sent   55788  words, average  3.222  per message
Charles  sent   76261  words, average  4.95  per message
Michelle  sent   22872  words, average  3.59  per message
Lorene  sent   15235  words, average  4.04  per message

Note: I manually truncated some of the data above for the post purposes.

From that we can conclude a few things:

Larry seems to be an outlier, likely a short-lived account, a bot, or a telegram service.
Charles has the highest average of words per message, although he is not the one with the most words sent.
Paul seems to be the one that shares the most words, and his average isn't low either.

Let's now see how the emoji distribution is looking:

Now, let's repeat the exercise but per user:

The friend list is too big so I'll just share a couple of them:

Not only emojis and word counts are important, it is also important to know what words we use as a group. I chose a specific stop word list and also used some custom ones that do not appear in the code below, most of them just hiding real names.

As you can see, we mostly laugh in the group, with the ocasional swear word.

How many words have we shared over time? let's take a look.

Notice that I'm creating a new datetime field instead of using the date field that already exists, in contrast to the whatsapp guide this one is based on, where the author has to parse text, I simply extrapolate a new column that more easily let's me resample by the datetime index that is created.

Results:

We have, overtime, shared less and less messages although the group remains active

I'm interested in the first few spikes, 2016 through 2017 were our most active years of conversation, what was our top 10? what were the hour distribution of all messages?