Word Embedding Explained, a comparison and code tutorial

8 min readFeb 12, 2019

When to use word embedding from the popular FastText word dictionary and when to stick with TF-IDF vector representations, a description with coded examples.

TF-IDF and Word Embedding are two of the most common methods in Natural Language Processing (NLP) to convert sentences to machine readable code. In this article we will describe:

What are TF-IDF vectors Vs Word Embedding vectors
How to apply both methods to a spam classification task
When is it better to use word embedding

Word Embedding converts a word to an n-dimensional vector. Words which are related such as ‘house’ and ‘home’ map to similar n-dimensional vectors, while dissimilar words such as ‘house’ and ‘airplane’ have dissimilar vectors. In this way the ‘meaning’ of a word can be reflected in its embedding, a model is then able to use this information to learn the relationship between words. The benefit of this method is that a model trained on the word ‘house’ will be able to react to the word ‘home’ even if it had never seen that word in training.

We will be using the FastText word embedding dictionary, which was developed by Facebook AI Research center. The model is trained by attempting to guess a missing word given other known words in a sentence. Like all word embeddings, FastText was trained using an extremely large text corpus, in this case Wikipedia.

Luckily for us we can apply the results without needing to redo all that training! Just by downloading the wiki-news-300d-1M.vec lookup dictionary here which contains the 300-dimensional mappings of 1 Million unique words.

Here is how to load the data into a jupyter notebook, note that in the interest of saving memory we will keep only the 100'000 most common words and ignore the rest.

# Loading the data file from local download
path_fastText = 'wiki-news-300d-1M.vec'
dictionary = open(path_fastText, 'r', encoding='utf-8',
                  newline='\n', errors='ignore')
embeds = {}
for line in dictionary:
    tokens = line.rstrip().split(' ')
    embeds[tokens[0]] = [float(x) for x in tokens[1:]]
    
    if len(embeds) == 100000:
        breakprint embeds['car']
>> [-0.016, -0.0003, -0.1684, 0.0899, -0.02, -0.0093, 0.0482, -0.0308, -0.0451, 0.0006, 0.168 ... ]

Term Frequency — Inverse Document Frequency (TF-IDF) is another more common tool in NLP for converting a list of text documents to a matrix representation. Each document is converted to a row of the TF-IDF matrix and each word is stored in a column vector. The size of the vocabulary (or number of columns) is a parameter which should be specified, a vocabulary of the top 5'000–10'000 most common words is often sufficient. Read more on text cleaning here. TF-IDF are sparse vectors where the number of non-zero values in the vector is equal to the number of unique words in the document. So if a document contains the word ‘house’ then the house column will have a one in place of a zero for that document row.

During fitting, the tf-idf function discovers the most common words in the corpus and saves them to the vocabulary. A document is transformed by counting the number of times each word in the vocabulary appears in the document. Thus a tf-idf matrix will have the shape [Number_documents, Size_of_vocabulary]. The weight of each word is normalized by the number of times it appears in the corpus, so a word that appears in only 10% of all documents will be assigned a higher value (and thus treated more importantly) then one which appears in say 90% of documents.

In summary:

It can be seen from the above discussion that word embedding clearly caries much more information then a tf-idf column but comes at the cost of being more memory intensive and more difficult to apply. Indeed the question remains how do you apply word embedding to a full sentence and not just a single word?

Applied Code Example

Let’s solve the above question with an example, say we want to classify text messages into two categories, spam and not spam. How would we use the two methods to do this?

To compare the two methods fairly it is necessary to use the same model and dataset. The dataset we will use is a spam detection dataset which is a set of roughly 5000 short SMS texts labeled as ‘ham’ or ‘spam’. The model we will use is the Sklearn Linear Support Vector Classifier.

import pandas as pd
import numpy as nppath_to_text = '..\Datasets\SMS Spam Detection (Kaggle)\spam.csv'
data = pd.read_csv(path_to_text, encoding='latin-1')[['v1', 'v2']]# Creating the feature set and label set
text = data['v2']
label = data['v1']print data[10:14]

Creating the feature tables

Word Embedding Method:

In order to train our model using the full sentences and not just on a single word we must find a way to pass multiple words to our model simultaneously. The solution is to concatenate each word vector together and pass the combined vector. So concatenating 20 words together where each word is a 300-dimensional embedding will yield a 6,000-dimensional vector. But wait! What to do in the case that not every text is exactly 20 words? For cases that are fewer then 20 words we will pad the end of the vector with zeroes, the model will learn not to assign any meaning to these values. For cases that are longer then 20 words we will resort to keeping only the first 20 words and dropping the rest.

As a useful trick we will use the text_to_word_sequence function from the Keras preprocessing library. This function will automatically convert a string to a list of word tokens and at the same time clean the data by removing punctuation and capitalization.

from keras.preprocessing.text import text_to_word_sequence
array_length = 20 * 300embedding_features = pd.DataFrame()
for document in text:
    # Saving the first 20 words of the document as a sequence
    words = text_to_word_sequence(document)[0:20] 
    
    # Retrieving the vector representation of each word and 
    # appending it to the feature vector 
    feature_vector = []
    for word in words:
        try:
            feature_vector = np.append(feature_vector, 
                                       np.array(embeds[word]))
        except KeyError:
            # In the event that a word is not included in our 
            # dictionary skip that word
            pass    # If the text has less then 20 words, fill remaining vector with
    # zeros
    zeroes_to_add = array_length - len(feature_vector)
    feature_vector = np.append(feature_vector, 
                               np.zeros(zeroes_to_add)
                               ).reshape((1,-1))
    
    # Append the document feature vector to the feature table
    embedding_features = embedding_features.append( 
                                     pd.DataFrame(feature_vector))
print embedding_features.shape
>> (5572, 6000)

We have now converted our data to a machine readable table! As expected the table size is [number of documents, length of feature vector].

TF-IDF method:

Creating the tf-idf feature table is very simple using the sklearn TfidfVectorizer. We define the number of words we want to keep (in this case 6000) and fit the vectorizer using the full corpus of text data. Next the model will transform the text data into the tf-idf representation.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = list(text)tfidf = TfidfVectorizer(max_features = 6000) 
tfidf.fit(corpus)
tfidf_features = tfidf.transform(corpus)print tfidf_features.shape
>> (5572, 6000)

Training and testing the model

We will train on 70% of the data and use the remaining for testing. Both feature tables will be trained on identical SVM classifiers, and the results will be displayed in a table.

Firstly the labels must be converted from strings to binary using the sklearn label encoder.

from sklearn.preprocessing import LabelEncoder
from sklearn.svm import LinearSVC
from sklearn.metrics import precision_recall_fscore_support# Converting the labels from strings to binary
le = LabelEncoder()
le.fit(label)
label = le.transform(label)

Next training both models and predicting the test results.

# Taking 70/30 train test split
train_percent = 0.7
train_cutoff = int(np.floor(train_percent*len(text) ) )# Word Embedding
embeded_model = LinearSVC()
embeded_model.fit(embedding_features[0 : train_cutoff], 
                  label[0 : train_cutoff])
embeded_prediction = embeded_model.predict(
                   embedding_features[train_cutoff + 1 : len(text)])# TF-IDF table
tfidf_model = LinearSVC()
tfidf_model.fit(tfidf_features[0 : train_cutoff], 
                  label[0 : train_cutoff])
tfidf_prediction = tfidf_model.predict(
                  tfidf_features[train_cutoff + 1 : len(text)])

Comparing the results.

results = pd.DataFrame(index = ['Word Embedding', 'TF-IDF'], 
          columns = ['Precision', 'Recall', 'F1 score', 'support']
          )
results.loc['Word Embedding'] = precision_recall_fscore_support(
          label[train_cutoff + 1 : len(text)], 
          embeded_prediction, 
          average = 'binary'
          )
results.loc['TF-IDF'] = precision_recall_fscore_support(
          label[train_cutoff + 1 : len(text)], 
          tfidf_prediction, 
          average = 'binary'
          )

Results of SVM model using both feature sets

It can be seen that the Word Embedding and TF-IDF had F1 accuracy scores of 90.5% and 93.1% respectively. Perhaps surprisingly the best results are obtained using the more generic TF-IDF method by a factor of roughly 3%. This results shows that choosing a more complex methods will not always achieve better results when it comes to machine learning.

There are a couple of reasons to explain why TF-IDF was superior:

The Word embedding method made use of only the first 20 words while the TF-IDF method made use of all available words. Therefore the TF-IDF method gained more information from longer documents compared to the embedding method. (7% of total documents are longer then 20 words)
The word embedding method could have over fit the data. By having a larger vocabulary the embedding method is likely to assign rules to words that are only rarely seen in training. Conversely, the TF-IDF method had a smaller vocabulary and so rules could only be formed on words which had been seen in many training examples.
The word embedding method contains a much more ‘noisy’ signal compared to TF-IDF. A word embedding is a much more complex word representation and carries much more hidden information. In our case most of that information is unnecessary and creates false patterns in our model.

Conclusion and Future Work

The article has:

Explained the difference behind using word embeddings versus TF-IDF matrices
Demonstrated how to make use of the FastText word embedding for your own projects
Demonstrated how to convert text to a machine readable format for use in machine learning using two distinct methods
Compared the effectiveness of Word Embedding and TF-IDF on a classification task

Future Work:

Changes to the model design or vocabulary size could reduce over-fitting and improve performance
Identical vocabularies could be used for both feature creation tables, leading to a more direct comparison of the role which embedding has on NLP projects