Emoji2vec

Where did Emojis come from?

Caroline Vanacore
6 min readMay 24, 2019

The first widely-used set of emojis was created in 1999 by Japanese interface designer Shigetaka Kurita, while he was working for a mobile phone operator company called NTT DoCoMo. The set consisted of 176 12x12 pixel images inspired by Chinese characters, items and expressions he observed around the city, such as street signs, weather symbols, and emotions.

In 2010, emojis started being incorporated into Unicode, the standardized system for indexing characters, which allowed them to become popular outside of Japan. In 2011, Apple added the official emoji keyboard to iOS, and Android followed suit in 2013.

Each year, the Unicode Consortium, a non-profit organization found to “develop, extend, and promote the use of the Unicode Standard and related globalization standards,” considers emojis submitted via a formal submission and approval process, and adds them to the standard.

Why do we care?

Since Apple added the official emoji keyboard, emoji use has become increasingly prevalent. In 2015, emoji usage spiked increased by over 800%, and the Oxford dictionary named “Face with Tears of Joy” emoji as the Word of the Year. In 2017, Facebook released statistics that more than 60 million emojis are used on Facebook each day, and more than 5 billion are sent on messenger each day.

https://en.oxforddictionaries.com/word-of-the-year/word-of-the-year-2015

As the use of emojis online is growing, companies have been implementing them into their digital marketing strategies as a way to connect with users. Business Insider reported that the use of emojis in marketing grew by 775% in 2016.

Given that emojis usage is so prevalent and that most of them inherently have emotions built in, emojis can give a lot of insight into sentiment of online text.

Emojis as vectors

Last week, when doing natural language processing on tweets, I copied and pasted a couple emojis into a jupyter notebook. I noticed that most of them showed up normally, but a few of them showed up as a combination of other multiple other emojis. When I tried to paste the female scientist, it showed up like this:

This reminded me of the analogies in word2vec.

In researching, I learned that what was happening in jupyter notebook had to do with unicode representations, and a unicode charactre called the zero-width-joiner, or “Zwidge.” Most emojis are represented by a single unicode character, but a lot of emojis are represented by combinations of unicode characters, joined by a Zwidge. When the zwidge is placed between two characters, they are printed in their connected form, if it exists and is supported by the operating system in use. Otherwise, the fallback is printed, which is the emojis seperately.

Jupyter notebook doesn't have support for some of the newer emoji combinations, so the two separate emojis were printed.

So it turned out vectorization wasn’t the reason that emojis were being broken into two emojis in the jupyter notebook, but the search led me to came across something called emoji2vec.

Emoji2vec: Skip-gram method

I read two different examples of ways to vectorize emojis based on the meaning of the emojis, in the same way word2vec vectorizes words.

The first example trained emoji embeddings from a dataset of 100 million English tweets. They used the skip-gram method, which is a generalization of n-grams, but the words don’t need to be consecutive, the just need to be within a determined window.

The model uses word pairs to get a probability for every word in the vocabulary being the nearby word. Then, if words are found in similar contexts, they have similar vectors and appear in the same area in the vector space. Emojis that appear around similar words are determined to be more similar to each other.

They measured the effectiveness of their model by comparing cosine similarities (how similar the vectors were to each other) to a manually labeled data set of emoji pairs, where users were asked to give scores for “similarity.”

Emoji2vec #2: Learning Emoji Representations from their Description

Another group of people took a different approach to emoji vectorization. Instead of looking at online text, such as tweets or captions, that contained emojis, they used the official Unicode descriptions and key words associated with each emoji: 6088 descriptions associated with 1661 emoji symbols.

Then, for each word in the descriptions and keywords of an emoji, they used Google News’ word2vec embeddings to find the word vector, and summed them all together to create a new vector representing each emoji.

They found that using their emoji2vec emoji embedding to augment sentiment analysis improved performance of models using only word2vec when classifying tweets that were part of a data set of 67,000 english tweets that were manually labeled positive, neutral, or negative. They also found that it outperformed the skip-gram method described above.

They projected the learned emoji embeddings from 300-dimensional space into 2-dimensional space using t-SNE (t-Distributed Stochastic Neighbor Embeddin), a method that attempts to preserve relative distances in lower-dimensional spaces.

They also performed an analogy task, which is a common evaluation for word embeddings. Though there are fewer obvious analogies in emojis than in language, they were able to come up with some examples, and observed that often the correct response was in the top 5 closest emojis.

Overall, these two different methods were able to relatively successfully create emoji embeddings, and both methods improved the performance of sentiment analysis over models just using text. Both have released the embeddings online that can be imported as dictionaries. As emoji use continues to increase online, it will be interesting to see how emoji embedding advances.

Sources

--

--