Automate data collection from public channels and groups on Telegram using Telegram’s API and Python.

Ishita Gopal
11 min readMar 23, 2023

--

This tutorial illustrates how to use the Telethon library in Python to collect messages from any public channel or group chats on Telegram. It introduces fundamental concepts of concurrent and asynchronous programming, which are required for utilizing the Telethon package effectively. The tutorial also guides on extracting and saving information from the retrieved responses and uses k-means clustering to analyze topics discussed in the posts.

In the tutorial, data is gathered from the New York Times channel on Telegram, which was originally established to report on the Russia-Ukraine war. Amidst the conflict and limited information resources, Telegram channels became crucial sources of information in Ukraine.

About Telegram:

Telegram is a popular messaging application, featuring public channels and groups that can accommodate hundreds of thousands of subscribers/participants. It’s a messaging app with social media-like features. The platform offers rich data for studying diverse topics like communication dynamics, information diffusion, user behavior and cross-platform analysis. The possibilities for research for this data are abundant and varied and the data is easy and free to collect with Telegram’s API!

But first, you will need a Telegram account linked to your mobile number.

Start by:

  1. Downloading the app
  2. Register your application and get your API id and API hash.
  3. Store the credentials (id/hash) safely. You can store them in a .env file:
TELEGRAM_API_ID = "987298"
TELEGRAM_API_HASH = "o898dnjdu23801kmcloewij"
PHONE_NUM = "+19810023456"
  1. Install the Telethon library. This is an asyncio Python 3 library used to interact with Telegram’s API. You can find some examples of using asyncio here.
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade telethon
  1. You will have to update both IPython to 7.0+ and ipykernel to version 5.0+ for async to be available. (More details about using async in notebooks can be found here). Note that Asyncio code can create some issues when using it in Jupyter Notebooks. There are some differences in the syntax that is written jupyter v/s a .py script. Also, according to Telethon’s documentation: “When using Telethon with such interpreters, you are also more likely to get “sqlite3.OperationalError: database is locked” with them. If they cause too much trouble, just write your code in a .py file and run that, or use the normal python interpreter.”
pip install IPython ipykernel --update
  1. Install the asyncio library. This library allows you to do concurrent asynchronous programming in Python.
python3 -m pip install asyncio

Lets get started!

from telethon import TelegramClient
import pandas as pd
import json
from pathlib import Path
import json
import os

from config import Config # import the api id, hash from here
# You can test the code by inputing your cridentials and uncommenting
# Ideally dont hardcode it here
# api_id = Config.api_id
# api_hash = Config.api_hash

Some notes on concurrent programming & asyncio:

What does asyncronous mean?

The usual program in Python excutes code sequentially/synchronously — i.e. — tasks/unit of work must complete before moving forward in the program. For example, a function must execute and provide a result before you can move on to your next task in the program.

This is not required in asyncronous programming. Asynchronous programming is a type of parallel programming where you do not need to wait for a unit of work to complete before proceeding. For example, you can submit a HTTP request and while waiting for the HTTP request to finish/return a response, you can do other work that’s waiting in a queue.

Note: This is not multithreading. Instead, we have have a single thread and we are switching between tasks to utilize time more efficiently and not waste time waiting for tasks to complete before proceeding.

Buikding blocks to do this:

Coroutines?

This is a fuction which can pause and resume its execution at any stage. To make such functions:

  1. Simply put the async keyword at the beginning of your function
async def my_coroutine():
do something
return
  1. When you call such a function/coroutine, it will return a coroutine object. Unlike a normal function, this function has not been “executed”…. yet.
  2. These coroutine objects will be executed using other means (an event loop, explained below).

Await keyword?

This keyword is used to pause the execution of a coroutine. We simply add the keyword await in front of the object you want to await. When the function encounters await → function will stop there → awaitable object will start running → once the awaitable object completes → then the remaining coroutine will proceed.

async def my_coroutine():
do something
await something # wait for this task to complete
do something else # proceed with the function
return

When there are more than one coroutines, awaitable objects provide context switching ability or an “exit gate” through which we can switch between multiple coroutines. It allows you to run another coroutine during the period of delay created by await.

Awaitable objects can be coroutines, Tasks, Futures. Tasks allow you to schedule coroutines. You can convert a coroutine into a task (asyncio.create_task()) and schedule it to run concurrently. Futures allows represents a future result of an asynchronous operation.

Executing coroutines

We need an event loop which executes all the tasks which are passed to it. Generally runs in the main thread.

asyncio.run(my_coroutine())

Interpreters like Jupyter run the asyncio event loop implicitly, which is why we dont create it it below. All async functions must be scheduled for execution on the event loop instead of attempting to create a new loop in the same thread. Again, the way Jupyter runs creates slight differences in the way code is written in a script (.py files).

Let’s run the code!

Set your API credentials

# Get credentials from the Config.py file 
api_id = Config.api_id
api_hash = Config.api_hash

# This is the name of the session file that will be created and stored in the working directory. Session files are used to store the state of the client, so that it can be resumed later so you dont have to login everytime you run the code
session_name = Config.session_name

Create the function to collect messages

# Make a coroutine main()
# This function/coroutine takes in
# 1. the name of the chat we want to collect
# 2. the number of messages to collect
# There are many other arguments you can pass (https://docs.telethon.dev/en/stable/modules/client.html?

async def main(chat_name, limit):
# "async with" creates asynchronous context managers
# It is an extension of the “with” expression for use only in coroutines within asyncio programs
async with TelegramClient(session_name, api_id, api_hash) as client:

# Get chat info
chat_info = await client.get_entity(chat_name)

# Get all the messages, given the limit
# It will return the latest 5 messages if limit is 5
messages = await client.get_messages(entity=chat_info, limit=limit)

# return the results in a dictionary
return ({"messages": messages, "channel": chat_info})

Use the above function to collect all the messages from The New York Times telegram channel (@nytimes)

This will open an input box and ask you to input your phone number the first time you use it. Provide the code sent to your Telegram app.

# limit=None will collect all the messages from nytimes Telegram channel (https://t.me/nytimes)
# This open an input box and ask you to input your phone number
#
chat_input = "nytimes"
results = await main(chat_name = chat_input, limit=None)

Returns a dictionary with results

results.keys()dict_keys(['messages', 'channel'])

Let’s look at the returned channel information

We can see for example,

  1. Its a channel
  2. title is The New York Times
  3. its a “verified” channel
results["channel"].to_dict(){'_': 'Channel',
'id': 1606432449,
'title': 'The New York Times',
'photo': {'_': 'ChatPhoto',
'photo_id': 4996942669779413476,
'dc_id': 1,
'has_video': False,
'stripped_thumb': b'\x01\x08\x08\xce>O\xd8\x866\xf9\xb9\xe7\xaeh\xa2\x8a\x00'},
'date': datetime.datetime(2022, 3, 9, 19, 25, 13, tzinfo=datetime.timezone.utc),
'version': 0,
'creator': False,
'left': True,
'broadcast': True,
'verified': True,
'megagroup': False,
'restricted': False,
'signatures': False,
'min': False,
'scam': False,
'has_link': False,
'has_geo': False,
'slowmode_enabled': False,
'call_active': False,
'call_not_empty': False,
'fake': False,
'gigagroup': False,
'access_hash': -1839593494233108316,
'username': 'nytimes',
'restriction_reason': [],
'admin_rights': None,
'banned_rights': None,
'default_banned_rights': None,
'participants_count': None}

There are 2262 messages posted till today

len(results["messages"])2262

Messages are stored as objects in a list

results["messages"][:10][<telethon.tl.patched.Message at 0x132938a60>,
<telethon.tl.patched.Message at 0x13296c670>,
<telethon.tl.patched.Message at 0x13296cc10>,
<telethon.tl.patched.Message at 0x13296f160>,
<telethon.tl.patched.Message at 0x13296f6a0>,
<telethon.tl.patched.Message at 0x13296f820>,
<telethon.tl.patched.Message at 0x132970160>,
<telethon.tl.patched.Message at 0x132970640>,
<telethon.tl.patched.Message at 0x132970df0>,
<telethon.tl.patched.Message at 0x1329723a0>]

Lets look at an example message and see what information is returned.

We can see a lot of fields are returned, like, the message id, the date when the message was posted, its text, the associated media etc.

results["messages"][0].to_dict(){'_': 'Message',
'id': 2320,
'peer_id': {'_': 'PeerChannel', 'channel_id': 1606432449},
'date': datetime.datetime(2023, 3, 23, 18, 56, 52, tzinfo=datetime.timezone.utc),
'message': 'Here are some of the stories we’re covering from around the world:\n\nIn Israel, Another Divisive Law on Another Day of Mass Protest\n\nIsrael’s Parliament passed legislation early Thursday that would make it more difficult to declare prime ministers incapacitated and remove them from office, a move that critics said was aimed at protecting the country’s leader, Benjamin Netanyahu, who is on trial for corruption.\n\n‘Give Me an Abrams!’ Ukrainian Tank Commanders Grow Impatient.\n\nUkraine’s military, equipped with Soviet-era tanks and relying on decades-old training, is holding its own against Russia’s attacks. But commanders long for Western weapons, and are growing impatient.\n\nLeader of Indian Party Opposing Modi Is Sentenced in Defamation Case\n\nRahul Gandhi, the leader of the main party opposing Prime Minister Narendra Modi of India, was convicted of defamation and sentenced to prison on Thursday, the latest blow to the beleaguered opposition party just a year before national elections.\n\n@nytimes',
'out': False,
'mentioned': False,
'media_unread': False,
'silent': False,
'post': True,
'from_scheduled': False,
'legacy': True,
'edit_hide': True,
'pinned': False,
'from_id': None,
'fwd_from': None,
'via_bot_id': None,
'reply_to': None,
'media': {'_': 'MessageMediaDocument',
'document': {'_': 'Document',
'id': 5829103990356313086,
'access_hash': -2750025792869668095,
'file_reference': b"\x02_\xc06\xc1\x00\x00\t\x10d\x1c\xc2\x9ca~u\xaaS\xdcqA'Z\x06}r \x9a\xb0",
'date': datetime.datetime(2023, 3, 23, 18, 50, 18, tzinfo=datetime.timezone.utc),
'mime_type': 'video/mp4',
'size': 2127021,
'dc_id': 4,
'attributes': [{'_': 'DocumentAttributeVideo',
'duration': 40,
'w': 426,
'h': 240,
'round_message': False,
'supports_streaming': True},
{'_': 'DocumentAttributeFilename',
'file_name': '107074_1_23vid-Israel-Protests_wg_240p.mp4'}],
'thumbs': [{'_': 'PhotoStrippedSize',
'type': 'i',
'bytes': b'\x01\x16(\xb7\xb0z\x8ezRyx\xaa\xcf9\xc78\x18\xefP\xa5\xcc\xa8q$\x99\x1dA4\xc0\xd0\xf2\xcf\xa5\x1e]V]G\xb1\x03?J\x93\xfbB#\x8c\xe6\x90\x120T\x19c\x81EQ\x9a\xed\x1d\xc9\x07\xaf\xf8QL\nSM\xe6\x11\xc6\x00\xf7\xa3;\xd5A\xfe\x11\x8a(\xa4\x02\x1eq\x8e\xa7\xbd#\x0cw\xa2\x8a\x00h\xc6N\x7fJ(\xa2\x80?'},
{'_': 'PhotoSize', 'type': 'm', 'w': 320, 'h': 180, 'size': 10320}],
'video_thumbs': []},
'ttl_seconds': None},
'reply_markup': None,
'entities': [{'_': 'MessageEntityTextUrl',
'offset': 68,
'length': 62,
'url': 'https://nyti.ms/42u4hAl'},
{'_': 'MessageEntityTextUrl',
'offset': 414,
'length': 62,
'url': 'https://nyti.ms/3lFvgrZ'},
{'_': 'MessageEntityTextUrl',
'offset': 680,
'length': 68,
'url': 'https://nyti.ms/3z1e1V7'},
{'_': 'MessageEntityMention', 'offset': 998, 'length': 8}],
'views': 4559,
'forwards': 3,
'replies': None,
'edit_date': datetime.datetime(2023, 3, 23, 18, 58, 1, tzinfo=datetime.timezone.utc),
'post_author': None,
'grouped_id': None,
'restriction_reason': [],
'ttl_period': None}

Lets get the text of the message

results["messages"][0].text'Here are some of the stories we’re covering from around the world:\n\n[In Israel, Another Divisive Law on Another Day of Mass Protest](https://nyti.ms/42u4hAl)\n\nIsrael’s Parliament passed legislation early Thursday that would make it more difficult to declare prime ministers incapacitated and remove them from office, a move that critics said was aimed at protecting the country’s leader, Benjamin Netanyahu, who is on trial for corruption.\n\n[‘Give Me an Abrams!’ Ukrainian Tank Commanders Grow Impatient.](https://nyti.ms/3lFvgrZ)\n\nUkraine’s military, equipped with Soviet-era tanks and relying on decades-old training, is holding its own against Russia’s attacks. But commanders long for Western weapons, and are growing impatient.\n\n[Leader of Indian Party Opposing Modi Is Sentenced in Defamation Case](https://nyti.ms/3z1e1V7)\n\nRahul Gandhi, the leader of the main party opposing Prime Minister Narendra Modi of India, was convicted of defamation and sentenced to prison on Thursday, the latest blow to the beleaguered opposition party just a year before national elections.\n\n@nytimes'

Save the json results

msg_list = [msg.to_dict() for msg in results["messages"]]

Save the json to the file: ‘json_data/nytimes.json’

# Save results 
#Path(os.path.join("json_data")).mkdir(parents=True, exist_ok=True)
out_path = os.path.join(Config.output_dir, f"{chat_input}.json")
with open(out_path, "w") as f:
json.dump(msg_list, f, default=str, ensure_ascii=False)

Read in json file and convert the json to a pandas data frame

out_path = os.path.join(Config.output_dir, f"{chat_input}.json")
nytimes_df = pd.read_json("json_data/nytimes.json")
nytimes_df.head()
png
nytimes_df["just_date"] = pd.to_datetime(nytimes_df.date).dt.date# date range
f"{nytimes_df.just_date.min()} - {nytimes_df.just_date.max()}"
'2022-03-09 - 2023-03-23'# Daily frequency of messages
daily_count = nytimes_df.groupby("just_date").size().reset_index(name="freq")
daily_count.head()
png
# plot Freq of messages 
daily_count.plot(x="just_date", y="freq", xlabel="Date", figsize= (20, 10))
<Axes: xlabel='Date'>
png
# They post 6 messages daily on average since the last year 
daily_count.freq.mean()
6.180327868852459

NYT posted ~6 messages daily, on average since the last year.

# Almost all of the messages have associated media 
nytimes_df.media.isna().value_counts()
False 2256
True 6
Name: media, dtype: int64
# Lets see what media is shared
import re
nytimes_df["media_type"] = nytimes_df.media.apply(lambda a: re.sub("MessageMedia", "", a["_"]) if pd.notnull(a) else a)
# Create a plot
nytimes_df["media_type"].value_counts(normalize=True, dropna=False).plot(
kind="bar",
ylabel="Prop of messages",
xlabel = "Media Type",
figsize=(20,10))
<Axes: xlabel='Media Type', ylabel='Prop of messages'>
png

Almost all of the messages have associated media, mostly photos.

import matplotlib.pyplot as plt
# How long are the messages?
num_words = nytimes_df.message.str.split().str.len()
num_words = num_words[num_words>0]

num_words.hist(bins=50, figsize=(20,10))
plt.axvline(x=num_words.median(), color="red")

# Most are upto 200 words long, with a median of ~80
<matplotlib.lines.Line2D at 0x138cc4760>
png

What are they posting about?

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
STOP = stopwords.words('english')
text = nytimes_df["message"].dropna()
text = text[text.str.len()>0]
text
0 Here are some of the stories we’re covering fr...
1 Russia's deputy foreign minister said the risk...
2 Here are some of the stories we’re covering fr...
3 Audio recordings obtained by The New York Time...
4 Here are some of the stories we’re covering fr...
...
2255 Street battles hit a Kyiv suburb, some of the ...
2256 U.S. Officials Say Superyacht Could be Putin’s...
2257 After a night of shelling, Ukrainians assess t...
2259 The low rumble of heavy artillery fire echoed ...
2260 Welcome to the Telegram channel from The New Y...
Name: message, Length: 1574, dtype: object

Get the top 30 words

The top words are related to the war in Ukrain. This makes sense given that NYT started their Telegram channel in response to the Russian invasion and to specifically provide information to Ukrainians.

# Get Tf-idf
vectorizer = TfidfVectorizer(stop_words=STOP) # remove stop words
dtm=vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()
# vectorizer.vocabulary_
# Top 30 words
sum_words = dtm.sum(axis=0) # a 1x9211 matrix
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

words_freq[0:30]
[('ukraine', 74.75422446047203),
('russian', 63.95892182328106),
('russia', 63.615283734213754),
('ukrainian', 50.37482809696707),
('war', 43.82904992818235),
('said', 43.54295266526063),
('nytimes', 41.77337597405641),
('read', 41.634014616441455),
('city', 36.600083866380245),
('forces', 35.650601441088384),
('military', 33.39564497543622),
('putin', 32.55728695802786),
('president', 31.170918680973834),
('officials', 28.024244521553427),
('zelensky', 25.96130773755016),
('moscow', 25.571666056988455),
('kyiv', 25.183398106660228),
('country', 23.390986444553423),
('nuclear', 21.360840499464178),
('eastern', 21.318021456366925),
('new', 21.113055132212214),
('mr', 20.028885709969316),
('people', 19.792430866276085),
('one', 19.47443238113837),
('region', 18.96873905462279),
('would', 18.7245990014243),
('invasion', 18.659302727248615),
('plant', 18.55612245735065),
('kherson', 18.256492944730066),
('tuesday', 17.982956968670173)]

Let’s find topics they are talking about using KMeans clustering

We see topics related to attacks, specific incidents (e.g. cluster 0, 1, 4), likely topics about the international responses (cluster 2), topic about Brittney Griner (cluster 6), topic about Zelensky (cluster 9)

# What are the different topics discussed?
from sklearn.cluster import KMeans
K = 10
kmeans_10 = KMeans(n_clusters=10)
kmeans_10.fit(dtm)
KMeans(n_clusters=10)
order_centroids = kmeans_10.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(K):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
Cluster 0:
missiles
drones
least
strikes
ukraine
air
russia
kyiv
people
across
Cluster 1:
kherson
forces
russian
ukrainian
city
region
ukraine
russia
military
south
Cluster 2:
griner
brittney
star
basketball
drug
court
charges
american
lawyers
trial
Cluster 3:
nuclear
plant
zaporizhzhia
power
watchdog
shelling
ukraine
agency
inspectors
facility
Cluster 4:
russia
grain
ukraine
european
oil
nato
gas
war
energy
nytimes
Cluster 5:
bakhmut
city
wagner
eastern
group
ukrainian
russian
forces
soledar
ukraine
Cluster 6:
zelensky
president
volodymyr
ukraine
mr
russia
said
visit
russian
war
Cluster 7:
russian
ukrainian
city
mariupol
said
civilians
ukraine
kyiv
people
killed
Cluster 8:
putin
russia
war
ukraine
russian
president
vladimir
read
nytimes
said
Cluster 9:
ukraine
billion
aid
weapons
tanks
send
defense
states
military
united

--

--