Emoji Prediction

Oyku Akkoyun
turkcell
Published in
7 min readDec 6, 2021

Social media constitutes an important and large area in our lives. Considering this high usage, it will be seen how important it is to understand. In particular, studies in Natural Language Processing have found a large area to work with social media data.

In social media, which people often use to communicate and express their likes and opinions, there are many areas that can be analyzed and contributed to the literature. One of them is the emoji estimation study, which will also contribute to sentiment analysis. Emojis are now included in almost every sentence. It can even replace words most of the time. Because it can convey emotions and information to us more and more effectively than words. In fact, it has also created its own universal language with its increasing usage.

Fig1. Emoji Usage [1]

Considering the high data set that comes with a high usage rate, it is very obvious to use emojis in sentiment analysis. I was tried to predict emoji by working on Twitter data. A corpus was created with tweet data set that can be studied on the prediction. The 2 models used in emoji prediction are Support Vector Machine and Multinomial Naive Bayes.

  • DATA

The data was pulled from Kaggle. There is a dataset containing 70000 tweets in total. Emoji labeling has been done on the data set. There are 20 label assignments in data. These are mostly hearts and also smiling face with sunglasses, winking face, christmas tree, etc. First of all, the distribution of tweets according to these emoji labels in the data set was examined. It was noted that there is no normal distribution here and the imbalance between the number of classes corresponding to some emoji classes. Emoji labels with similar expressions were combined and the number of classes was reduced to 4. The emoji labels of these classes are: Joy, Vacation, Sadness(Dark Heart), Love.

Fig2. Data Distribution
  • LEMATIZATION PROCESS

The preprocessing work on the data continues by removing the stop words while creating the TF-IDF model. Wordnet corpus was used while creating word vectors. Each word is tagged as noun, verb, adjective and adverb. After the lemmatization process, the appearance of the data set is as in the table below. The TEXT_FINAL column contains the vectors made up of words before and after the lemmatization process for each sentence.

Fig3. Before Lemmatization Process

Punctuation marks or meaningless characters, expressions were not removed in the preprocessing step. There was no need for such a process since there were root words after the lemmatization process.

Fig4. After Lemmatization Process

After this process, the data set is separated for training and testing. It is divided into 80% training and 20% test data. Then, the train and test data were converted from text to numeric data with LabelEncoder.

‘TF-IDF is a statistical measure that evaluates how relevant each word in a document collection is to the document.’ [2] The method is implemented by multiplying these two values: how many times a word appears in the document (term frequency-TF) and how many times it appears in a series of documents (inverse document frequency-IDF) [3]

TF Formula
IDF Formula

The IDF value remains constant until new documents are added; but the TF value will differ in each sentence for each word. Therefore, it is calculated individually for each sentence.

TF-IDF Vectorizer and Count Vectorizer were used together. This unity is provided by the pipeline architecture. This architecture enabled to produce very fast responses. Normally, Naive Bayes training takes a short time and SVM takes a long time, while the training process is completed at almost the same time with the pipeline architecture.

  • TRAINING & TESTING
  1. Multinomial Naive Bayes

Naive Bayes is a probabilistic method used in classification especially NLP. In this classifier method, the presence or absence of a feature does not affect the state of the other feature. This algorithm is used to predict the label of a given text and is basically based on Bayes’ theorem. The probability of each tag is calculated on the sample set we have determined, and the tag with the highest rate of these probabilities is found.

Bayes’ theorem looks at the conditions associated with a situation and calculates the probability of occurrence of the situation based on prior knowledge of these conditions.

Bayes Theorem

Multinomial Naive Bayes uses the term frequency, i.e. how many times a particular term appears in a document. The term frequency is normalized by dividing the document length. After this normalization process, term frequency can be used to compute maximum likelihood estimates based on the training data to estimate the conditional probability. [4]

This algorithm can be used for both discrete and continuous data. It is easy to implement and also highly scalable. Therefore, it can easily handle large datasets. Despite these advantages, the prediction rate may remain low compared to other probability algorithms. Although not suitable for regression, it is a pretty good method for textual classification.

The emoji prediction results after MNB training are as follows:

Multinomial Naive Bayes (MNB) Confusion Matrix
MNB Accuracy Scores

Here, looking at the F1 score, it is seen that the prediction success of Multinomial Naive Bayes is 67% for this data set.

2. SUPPORT VECTOR MACHINES

SVM is a learning method which is supervised often used in classification problems. It does the classification by separating data points on a plane. In this, a line is drawn in the plane. There are different classes on different sides of this drawn line. The aim here is that the drawn line is at the maximum distance to both classes.

Fig.5. Support Vector Machine[5]

Hyperplanes are used to classify data points. If data points fall on different sides of the hyperplane, they are in a different class. The dimension in the hyperplane is related to the number of features in our data.

There are two different classes in the figure 5, blue and red. The main issue of classification is to decide in which class the new data will come. The ±1 interval of the line separating the two classes is called the margin. The higher the margin, the more accurate the classification will be for class 2 or more. [5]

The emoji prediction results after SVM training are as follows:

SVM Confusion Matrix

Confusion matrix shows the matching rates of correct labels and predictions after Support Vector Machine training.

SVM Accuracy Scores

Here, looking at the F1 score, it is seen that the prediction success of Support Vector Machine is 71% for this data set.

CONCLUSION

The two predictions do not seem to fail. One of the factors affecting the result here is that the entire dataset does not consist of English tweets. There are also Italian, French and Chinese words in the sentences. Apart from this, the sarcasm issue, which creates problems in all sentiment analysis processes, is also a factor here. There may be sentences that we label as joy but contain sarcasm, which will affect the success rate.

Apart from all these factors, the distribution of emoji in the dataset is the biggest problem. This problem has been tried to be solved by reducing the class, but still, since the classes are combined here, it cannot be said that the correct data set is fully formed.

You can download Jupyter notebook and csv file from here.

References

  1. https://www.statista.com/chart/17275/number-of-emojis-from-1995-bis-2019/
  2. https://www.semanticscholar.org/paper/Using-Neural-Networks-to-Predict-Emoji-Usage-from-Zhao-Zeng/453769b9e338a6ebf6026225515df8fb012a11e3
  3. https://towardsdatascience.com/word2vec-research-paper-explained-205cb7eecc30
  4. https://towardsdatascience.com/mse-is-cross-entropy-at-heart-maximum-likelihood-estimation-explained-181a29450a0b
  5. https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

--

--