The right hashtags can be the difference between a successful marketing campaign and talking to yourself online, but you can’t read 300 million tweets just to know what’s trending. That’s why Hootsuite’s machine learning team developed a service to suggest hashtags while drafting a social media post.
We represent hashtags and words as vectors. Think of a vector as a location. When we take in text we convert each word into a vector, sum them up, and average them together to get a vector that represents the message. Our suggested hashtags are the hashtags whose embedding vector is closest to the message vector. For example, if a message says “Stanford Berkeley” the algorithm would find the average coordinates (vector) of those two locations and suggest the city (hashtag) closest to it. In this case, #SanFrancisco.
What’s an embedding vector anyway?
Embedding vectors represent words in a format computers can process. Intuitively, each dimension in an embedding vector represents a feature of the word. Our team used the BlazingText algorithm to produce these embeddings. For each word in the training set, BlazingText tries to predict the embedding vector of a nearby word. As a result, words have similar embeddings if they appear together often. For this project we calculate embedding vectors given what hashtags occur with other hashtags and what words occur with other words. The list of hashtags and words in existence comes from an anonymized data set of all posts made by Hootsuite customers in the past 3 months.
How do you use embedding vectors to suggest hashtags?
We simply return the hashtags whose embedding vectors are closest to an input embedding vector where closeness is defined as cosine similarity. Our service reuses this logic for two models and just changes what the embedding vectors are.
The first suggestion algorithm extracts hashtags the user already wrote in their message and suggests similar hashtags. The embeddings are calculated with BlazingText as described earlier. Hashtags without an embedding vector are ignored.
The second suggestion algorithm identifies all the words that occur alongside each hashtag; allowing repeated words. We calculate embeddings for each hashtag by averaging the BlazingText embeddings for each word that occurred with the hashtag. This approach allows us to suggest hashtags for any message even if the user hasn’t already written a hashtag.
What tools did you use to process all your data?
We used AWS Sagemaker for data exploration, endpoint deployment, hyperparameter tuning, and some embedding calculation. Our data preprocessing itself was done with Spark and Scala. Working on the cloud was critical for our team as our program uses 160GB of RAMs at some stages in our pipeline. “Big Data” tends to be used as a buzzword, but there’s no denying we were working with a lot of bits!
Is this really all there is to it?
Not quite… we can’t tell you all our secrets after all! There are lots of little details like removing stop words or expanding contractions we won’t tell you about. Sorry! If you are interested in reading more about hashtag suggestion, we recommend reading the EmTaggeR research paper which inspired us.