Analyzing emojis😀

Abbas makhzomi
5 min readJul 9, 2022
emoji data

in this article, im going to show you how I used the apriori algorithm to analyze emojis🤨.

analyzing data is a criterion these days. every programmer should know at least the basics of it. so I managed to learn the basics of data mining. I had a little fun experiment along the way so I decided to share it🤓.

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. as you know, this algorithm is often used for market basket analysis. but with a bit of creativity, I used it for another fun purpose. I used Apriori to analyze the emojis that people often used together😎.

full code available at : https://github.com/llabbasmkhll/emoji-apriori

so let us begin🤝.

for the analysis first of all we need the data. I'm gonna use python to extract messages from the telegram messenger.

#export.py
from telethon import TelegramClient
import asyncio
import json
import sys
sys.setrecursionlimit(10000)

#paste your api credentials from my.telegram.org
api_id = 123456 # paste your api id here
api_hash = 'paste your api hash here'
# set the amount that you whant to extract from each group
limit = 3000
async def main():
async with TelegramClient('session', api_id, api_hash) as client:
dialogs = await client.get_dialogs()
dialogs = [d for d in dialogs if d.is_group]
result = []
for dialog in dialogs:
entity = await client.get_entity(dialog.entity)
async for message in client.iter_messages(entity , limit = limit):
try :
print('{:>14}: {} : {} : {}'.format(message.id,
message.sender.first_name , message.text , message.date))
messageDict = message.to_dict();
messageDict['sender'] = message.sender.to_dict()
result.append(messageDict)
except :
pass
json_object = json.dumps(result,default=str)
with open("groups-messages-"+str(limit)+".json", "w") as outfile:
outfile.write(json_object)if __name__ == '__main__':
asyncio.run(main())

after replacing api credentials from http://my.telegram.org run the code to get a groups-messages.json file. this file contains messages from your telegram groups.

if you didnt succeed in running the code, i have provided my extraction in this google drive folder. just download it and you can follow along. keep in mind my groups language was Persian.

the next step is to create a new Jupiter notebook file. and create a code block.

import pandas as pd
import numpy as np

as always import numpy and pandas. numpy helps us in the calculation stuff and pandas helps us handle the dataset.

lets load the data and have a sneak look at the type of the columns by this code block. that would be the first step of our data mining process. ill name it getting to know your data.

messages = pd.read_json("groups-messages-1000.json")
messages.info()
dataset

this will output general information about our dataset. as you can see the dataset has 31 columns. its obvious that we dont need all of them.

we only need the one that provides us with the body of the message. this would be the message column. so messages[‘message’] is the series that we want to analyze.

# prepair dataset for analysis
# convert empty messages to null and remove them
messages['message'].replace('', np.nan, inplace=True)
messages.dropna(subset=['message'], inplace=True)
messages.head()

ill call this step preparing step. after running this code, well have this output :

so basically Apriori takes every customers transaction and based on the items that the customer bought together it creates some rules.

in our situation the items are emojis. so it makes sense to delete messages that doesnt contain any emoji. we can do this by this code block.

import emojimessages.drop(messages[messages['message'].str.match(emoji.get_emoji_regexp()) == False ].index, inplace=True)
messages.head()

that will leave us a dataset of messages that contain emojis:

afterward, we need to change the form of the dataset to the schema that Apriori accepts and understands. ill call this stage the transforming step.

for Apriori, we need to have a column for each item ( product so to say ). to do this run this code:

# create a new operation dataframe to transform the dataset into a new formatemojis = pd.DataFrame(columns=emoji.EMOJI_DATA)
emojis['message'] = messages['message']
emojis.head()

this will give us a dataset with 4704 columns 🤯. a column for each emoji + one for the message itself.

as you can tell all the rows are NaN. we can fill the rows according to the message body by this code block.

for column in emojis :
if column == 'message':
continue
try :
emojis[column] = emojis.message.str.contains(column)
except :
pass
emojis.fillna(False, inplace=True)
emojis.drop(columns=['message'], inplace=True)
emojis.tail()

this will leave us the given dataset:

emojis

to mine the rules by Apriori first, we have to extract frequent itemsets by the Apriori algorithm. thankfully python makes it super easy. well do this by running this code:

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
frequent_itemsets = apriori(emojis, min_support=0.009, use_colnames=True)frequent_itemsets
frequent itemsets

finally, we have to provide them to the association rule function by :

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=10)
rules.sort_values('confidence', ascending=False, inplace=True)
rules

with that done we have ower rules

how do you want to choose good rules is up to you.

i chose them by considering lift and support and just observing the rule. The best rules that i did find are the following:

  • 😂=>🤣
  • 👩‍🦯=>🦯
  • 😍=>❤️
  • 😒=>😤

so thats it for this journey. i hope you enjoyed and learned cuz i did.

a clap would be awesome😌❤️

--

--

Abbas makhzomi

a web artisan , eager to solve problems and learn more .