Telegram Group/Channel Data Extraction (User’s information, chats, and specific messages), and Data Processing
Telegram data extraction using Telethon package, and data processing to get dataframe of required information.
Telegram is a cloud-based instant messaging and voice over IP service developed by Telegram Messenger LLP, a privately held company registered in London, United Kingdom.
Most of the blockchain and cryptocurrencies related companies use telegram to communicate with their world wide customers and supporters because of its unique features.
Telegram groups’ data, such as user’s informations, chats of specific channels, are analyzed to get insights of channels or to get airdrop participants info etc.
In this tutorial, I will explain how to get users’ information, chats, and messages which contain keyword, step by step.
Step 1. Prerequisites.
First of all you need to have a telegram account in order to extract information. Follow steps written below to get credentials (api id and api hash).
- Register on telegram with your mobile number.
- Create telegram app here.
- Note down App api_id and App api_hash.
- Join telegram group which information is needed. You can join telegram via group’s share link. ( Example link : https://t.me/c0ban_global)
Step 2. Telethon installation.
Now, install Telethon python package on your system using terminal command pip install telethon
.
step 3. Telegram client creation.
Create telegram client.
from telethon import TelegramClient, sync
import pandas as pdapi_id = your API ID
api_hash = ‘Your API Hash’
group_username = ‘your group name’ # Group name can be found in group link (Example group link : https://t.me/c0ban_global, group name = 'c0ban_global')client = TelegramClient('session_name', api_id, api_hash).start()
# You will be asked to enter your mobile number- Enter mobile number with country code
# Enter OTP (For OTP check Telegram inbox)
Step 4. Getting User's Information.
Extract users information as list using client.get_participants
. Participants list in below format.
User(id=357343635, is_self=False, contact=False, mutual_contact=False, deleted=False, bot=False, bot_chat_history=False, bot_nochats=False, verified=False, restricted=False, min=False, bot_inline_geo=False, access_hash=-7182373398681298465, first_name=’DC’, last_name=Aichara, username=dcaichara, phone=None, photo=UserProfilePhoto(photo_id=1534779226214999983, photo_small=FileLocation(dc_id=2, volume_id=238230043, local_id=76130, secret=-6996523002233662137, file_reference=b’\x00\\\x90s\xa46\x89\xa0/Z0\xc0K]\x8a!:\x15\x8f\x07\x90'), photo_big=FileLocation(dc_id=2, volume_id=238230043, local_id=76132, secret=-2346907879263725163, file_reference=b’\x00\\\x90s\xa4w\xc0\xa5K”l+\xe0\x94\x9b\xb4\xa9\xf2\x07\xbe\xe8')), status=UserStatusOffline(was_online=datetime.datetime(2019, 3, 19, 0, 42, 32, tzinfo=datetime.timezone.utc)), bot_info_version=None, restriction_reason=None, bot_inline_placeholder=None, lang_code=None)
You can extract information required from participants list. You can further process data to get a dataframe of intended information.
participants = client.get_participants(group_username)
# This code can be used to extracted upto 10k user's details
# Let's get first name, last name and username
firstname =[]
lastname = []
username = []
if len(participants):
for x in participants:
firstname.append(x.first_name)
lastname.append(x.last_name)
username.append(x.username)
# list to data frame conversion
data ={'first_name' :firstname, 'last_name':lastname, 'user_name':username}
userdetails = pd.DataFrame(data)
Step 5. Getting Chats.
We use client.get_messages
to get chat history. See chat history format below.
Message(id=105406, to_id=PeerChannel(channel_id=1050637540), date=datetime.datetime(2019, 3, 18, 21, 46, 40, tzinfo=datetime.timezone.utc), message=’Hello everyone ?’, out=False, mentioned=False, media_unread=False, silent=False, post=False, from_scheduled=False, from_id=667966548, fwd_from=None, via_bot_id=None, reply_to_msg_id=None, media=None, reply_markup=None, entities=[], views=None, edit_date=None, post_author=None, grouped_id=None)
chats =client.get_messages(group_username, n) # n number of messages to be extracted
# Get message id, message, sender id, reply to message id, and timestampmessage_id =[]
message =[]
sender =[]
reply_to =[]
time = []
if len(chats):
for chat in chats:
message_id.append(chat.id)
message.append(chat.message)
sender.append(chat.from_id)
reply_to.append(chat.reply_to_msg_id)
time.append(chat.date)
data ={'message_id':message_id, 'message': message, 'sender_ID':sender, 'reply_to_msg_id':reply_to, 'time':time}
df = pd.DataFrame(data)
Step 6. Extracting Specific messages.
Extracting messages which have specific keywords. Use cleint.iter_messages
to get messages with specific keyword. See message example for keyword c0ban
in telegram channel c0ban Global Community
chat history.
Message(id=29, to_id=PeerChannel(channel_id=1272398905), date=datetime.datetime(2018, 11, 16, 8, 17, 9, tzinfo=datetime.timezone.utc), message=’Welcome to c0ban global community 👋’, out=False, mentioned=False, media_unread=False, silent=False, post=False, from_scheduled=False, from_id=656819292, fwd_from=None, via_bot_id=None, reply_to_msg_id=None, media=None, reply_markup=None, entities=[], views=None, edit_date=None, post_author=None, grouped_id=None)
# messages =[]
time = []
for message in client.iter_messages(group_username, search='keyword'):
messages.append(message.message) # get messages
time.append(message.date) # get timestampdata ={'time':time, 'message':messages}
df = pd.DataFrame(data)
I hope, you liked this article. Complete code is available on Github. Reach out to me on LinkedIn or Twitter, if you have any query.
Reference: https://media.readthedocs.org/pdf/telethon/stable/telethon.pdf
Note: To extract private channel’s data, you must have admin privileges.
Bonus: If you struggle to find best hyperparameters for boosting algorithms, read my latest article to help yourself.
P.P.S. : Please, read my other articles here.