Natural Language Understanding With SVM’s

Published in

Axons

17 min readOct 5, 2019

Our most powerful way of communicating is also our most complicated one. Without any visual information, only fluctuations of air pressure, we can build stories in our mind, feel emotions and convey our thoughts. The debate whether a computer will ever give meaning to allegedly empty words is still raging on. How can a computer, which just crunches numbers, feel a certain way about the word ‘happiness’ or visualize the things that come to mind when you say the word ‘tree’. Natural Language gives us a way of communicating with so much more than just words.

Language Modelling

If you try and process language, you end up with very sparse data. Chances are that not all possible word combinations have been used in the training data. A solution to this problem is using the first-order Markov assumption that only the previous word is relevant. A very popular way of language modeling is probabilistic N-gram models.

Markov assumption The chance of a word p_i coming after a sequence of words W with W=p_(i-1),p_(i-2),…,p_(1) can be approximated by using the n^(th)-order Markov assumption. If n=1 this is called first-order Markov assumption, if n=2 this is called second-order Markov assumption, etc.

So for example in the sentence “the cat wears socks on a sunny day”, the word ‘day’ only applies to the n previous words, such as ‘sunny’, ‘a’, etc. The relevance goes down the further you go away from the word. This concludes that only the words most close are the most relevant. This reasoning helps us to define language, and study the correlation between certain words and/or sentences. N-grams help us with defining the scope of relevance of words that come before or after a certain word, as explained further.

n-grams Language modeling has been overpowered by probabilistic models such as N-grams. A unigram model calculates the probability of a word coming after every word. A bigram model approximates the probability of a word coming after the previous word. So as for our example the chance that the word ‘sunny’ comes after the word day P(‘day’|’sunny’). A trigram model approximates the probability of a word coming after the two previous words, so P(‘day’| ‘a’,’sunny’). N-grams thus look N-1 words into the past. The actual calculation of these probabilities is done with the maximum likelihood estimation (MLE).

maximum likelihood estimation The maximum likelihood estimation is obtained by getting the counts from a corpus and normalizing the counts so that they are between 0 and 1.

So for example, if you have a word occurring 1000 times in a corpus of a million words, the chance of that word coming after a random different word is 1000/1000000 = 1/1000.

Word Representation

Word embeddings, also known as word vector representations, are a way of converting words to dimensional vectors. There are several ways to convert a word into a dimensional vector, the first thing to do is to get yourself a vocabulary. It is not possible to take all the English words into account, this would result in too many dimensions and an enormous amount of time to train them without a supercomputer. From this vocabulary, the words can be represented by higher dimensional vectors in several ways.

As example take vocabulary {woman, child, king, queen, man, royalty, masculinity, femininity, age}. Each word would then be represented by a vector with the same length as the vocabulary.

one-hot representation One hot representation or 1-of-N representation is the most straightforward way to represent a given word into a dimensional vector. From the given vocabulary, the word ‘queen’ would be represented by the vector [0,0,0,1,0,0,0,0,0].

distributed representation This representation is not as straight forward as with one-hot representation. With this representation, the word ‘queen’ could be represented by the vector [0.99,0.03,0.82,0.99,0.99,0.05,0.94,0.54]. Each word is represented by means of its neighbours. A lot of information can be retrieved by how it relates to them. These distributions are usually constructed by neural networks and the exact meaning of each number in such representation is often a mystery to us. So instead of mapping each word mapping in a one-to-one relation to its vector, as with one-hot representation, each word is represented across all elements in the vector. The second representation gives more information regarding the word. With distributed vector representations things like King-Man+Woman=Queen, which are vector manipulations, are possible, thus several kinds of relationships become visible.

Learning Distributed Word Vectors

There are two ways of learning distributed word vectors, continuous Bag-of-Words model and continuous Skip-gram model. These two are the opposite of each other, where Bag-of-Words model starts from the context words (predicting the focus word from its context), continuous Skip-gram model starts from the focus word (predicting the context from its focus word). Consider the following sentence: “the CBOW model is totally different from the Skip-gram model”. Consider sliding over the sentence word per word, the context of each word are the N words surrounding it. So the context for the focus word ‘different’ are for example the words ‘is totally’ and ‘from the’.

continuous Bag-of-Words model With the continuous Bag-of-Words model, the context words form the input layer and each word is encoded via the one-hot representation. This means that if the vocabulary size is V there will be V-dimensional vectors with only one place set to 1 and the others to 0. By knowing which output you would like with each word, the weight matrix of the hidden layer W can be adjusted to improve the probability of the right word. Note that the hidden layer uses a linear activation function and simply passes on the weighted input vector to the output layer. From the hidden layer to the output layer, the second weight matrix W2 is used to compute probabilities for every word in the vocabulary given any input word.

continuous Skip-gram model The Skip-gram model is used to learn the distributed word representations. With this model, the focus word is taken as input vector and the context words are retrieved as the output vectors. This is the opposite of with the continuous Bag-of-Words model. The input vector gets weighted by the weight matrix W between the input layer and the hidden layer as with the CBOW model. The hidden layer also uses a linear activation function passing on the weighted vectors to the output layer where they are weighted by weight matrix W2 before each of the C context word distributions are represented at the output layer.

Learning Sentence Vectors

There are several ways to learn sentence vectors. The first one is just taking the average of all the word vectors and dividing them by the number of words to take the average, this represents the sentence vector. The second approach is taking the word vectors, multiplying them with their TFIDF values and then taking the average. The TFIDF values are the Term Frequency-Inverse Document Frequency values which denote the importance of a word in a corpus. The third approach is to use unique identifiers for each portion of the sentence and then proceed using standard word embedding models.

averaging The most simple way to do this is to sum the word vectors and divide by the amount of word vectors. This results in an average sentence vector for your sentence as defined by the following equation where n is the total amount of word vectors and V is the collection of word vectors.

TFIDF Another approach is to first multiply the word vector with their TFIDF (Term Frequency Inverse Document Frequency) score. The TFIDF score is used to indicate how important a word is in a document. This will lower the importance of words that don’t appear often and heighten the words that do. After multiplying the word vectors with their TFIDF score they are summed and averaged just like the first approach as defined by the following equation where n is the total amount of word vectors, V is the collection of word vectors and TFIDF the TFIDF score.

The TFIDF score is obtained by multiplying the term frequency with the inverse document frequency. The term frequency is the likelihood of a word occurring in a document and the inverse document frequency is used to indicate how rare a word is in a document. A term is annotated by t, a document by d, the total number of documents in a collection by N, the term frequency is annotated by tf and the document frequency by df.

Pointwise Mutual Information

Pointwise mutual information is an information theory approach to find collocations. A collocation is a specific way of saying something and defines which words are more likely to co-occur in a sentence. The PMI is thus defined with two words according to the following equation. It measures how much information we gain from one word with respect to the other. The PMI is calculated by taking the logarithm of the likelihood of x and y to co-occur divided by the individual likelihoods of the words. PMI can come in useful when calculating the correctness of sentences by taking the PMI of the bigrams of a sentence and retrieving the set of bigrams that yield the highest result when you add them all together.

Lemmatising

Lemmatising is the technique of taking the lemma from each word of a sentence. This is also done with the NLTK toolkit. Be doing this, words that are written in different forms like ‘going’ and ‘go’ will be reduced to the same lemma in order to better grasp the meaning of a sentence. Machine learning techniques would consider the above words as different and could learn them as different words with different meanings which could be troublesome.

Support Vector Machines

Support vector machines are maximum margin discriminative classifiers, they rely on support vectors to make up their decision boundaries. Support Vector Machines use hyperplanes to separate several instances of other classes and try to maximize the space between each class by maximizing the margin.

Support Vectors

Support vectors are the closest points to the decision boundary and exert the largest force onto the decision sheet. That is why they are called ‘support’ vectors, they support the decision surface. The space between the support vectors is called the margin.

Hyperplanes

A hyperplane is defined as the plane one dimension lower than the original space. For 1-dimension this is a point, for 2-dimension this is a line, for 3-dimension this is a plane and so forth.

In this equation, the vector w is the vector perpendicular to the hyperplane and b is what is called the bias. Without the bias term, your classifier would always go through the origin and not separate the data in the most efficient way.

maximum margin hyperplane A maximum margin hyperplane is defined as a hyperplane from which the distance to the points of either classes is equal, the margin between the hyperplane and the cluster of points is at a maximum. This principle of maximum margin hyperplanes is used within the Support Vector Machines.

Parameter Tuning

There are several parameters to specify in order to control how the training instances will be separated by the decision boundaries.

c parameter The c parameter controls the trade-off between a smooth decision boundary and generalizing better or a jagged decision boundary but overfitting your data more. If you have a high c parameter it will produce a jagged decision boundary and if you have a low c parameter it will produce a more smooth decision boundary which will generalize more but will learn your training data less closely.

gamma parameter The gamma parameter defines how far the influence of a single training instance reaches. If you have a high gamma parameter then only the closest instances to your separating hyperplane will be taken into consideration. If you have a low gamma parameter the instances further away from the decision boundary will be taken into consideration and will result in a straighter decision boundary. A trade-off has to be made about this parameter because it will result in overfitting if you set a too high gamma parameter but will also result in your data not being learned properly if you set the gamma parameter too low.

Kernel Functions

There are different types of kernels to choose from. If you know that your data is linearly separable, you should go for a linear kernel because it will train faster than let’s say a polynomial kernel. A kernel is a transformation of your input data to let the SVM process it more efficiently. They transform the feature vectors into another dimensional space and separate them there.

kernel function

linear function

A linear function as kernel is going to use the dot product to try and separate the classes with a decision boundary in another dimensional space.

radial basic function

A polynomial function as kernel is going to use a Gaussian function to try to separate the classes with a decision boundary in another dimensional space.

Multi-class Classification

Multi-class classification is the classification of multiple classes instead of two classes, which uses binary classification. For multi-class classification two approaches are available, one vs one and one vs all.

one vs all classification For one vs all classification you will need N classifiers if you have N classes. It will result in 10 binary classifiers with one class against all the others. Let’s say there are 10 classes, it will first learn a classifier with class 1 and classes 2,3,4,5,6,7,8,9. Then it will learn a classifier with class 2 and classes 1,3,4,5,6,7,8,9 and so forth.

one vs one classification For one vs one classification you will need a classifier for each two classes pair which will result in N(N-1)/N classifiers if you have N classes which is much more computationally expensive than the one vs all approach but should yield better results. If there are 10 classes, you will need a classifier for class 1 and class 2, class 1 and class 3, …, class 1 and class 10. Then you will have a classifier for class 2 and class 1, class 2 and class 3, etc.

Hinge Loss Function

The used hinge loss function to optimize is formulated in the following equation with w and x defined in the equation about hyperplanes, t defined as the expected output and y defined as the actual output. This loss function is used with maximum margin classifiers such as Support Vector Machines.

Hard Margin vs Soft Margin

Hard and soft margin are defined by how strict the decision boundaries are enforced by the Support Vector Machines. With a high c parameter, the cost of misclassifying an instance is penalized very hard and thus a jagged decision boundary is formed. By having a hard margin the risk increases of overfitting and not generalizing enough on unseen data. With a soft margin and thus a low c parameter the penalty of misclassifying an instance is not so hard and you end up with a much smoother decision boundary. By having the cost parameter too low, the risk of underfitting increases and thus also the ability to generalize well on unseen instances. There is thus always a trade-off between how hard or soft your margin, decision boundary and cost should be.

Setup & Experiments

Experiments were done with python and scikit-learn. The corpus used is self-composed to maintain the hands-on experience from building something like this from the ground up. Due to the limited amount of training data I disregarded the distributed word representation for obvious reasons and continued following experiments with the one-hot word representation.

Constructing Dataset

The dataset used is handcrafted and consists of labeled sentences in order for the machine learning techniques to learn the meaning of the sentences. The format of each line in the dataset consists of different annotations that provide information about that sentence. An example entry looks like ”lights,+,kitchen,on,now, Light on in kitchen.”, with the following format and underlying meaning ”category,question, location,action,time,sentence”. Each piece of information is separated by a comma with at the end the actual sentence. The first piece of information is the category. The different categories are [’lights’, ’camera’, ’openapp’, ’direction’, ’information’, ’other’, ’weather’, ’heating’, ’shutters’, ’toaster’, ’garagedoor’, ’coffeemachine’, ’shower’, ’time’, ’conversation’]. The second piece is whether the sentence indicates a question or not. The third piece of information is about the location of the action and the fourth piece is about the action occurring at that location. The fifth piece of information is the time the action has to be executed and the final piece is the sentence itself.

Word & Sentence Representations

The word and sentence representations are constructed with a one-hot encoding and an averaged and TFIDF sentence representation.

from sklearn.feature_extraction.text import TfidfVectorizerdef one_hot_vector(word, vocabulary):
  vector= [0 for i in range(len(vocabulary))
  vector[vocabulary.index(word)] = 1
  return vectordef sentence_vector_averaged(vocabulary):
  sentence_vector = [0 for i in range(len(vocabulary))]
  for word in vocabulary:
    sentence_vector = [x + y for x, y in zip(sentence_vector,
      one_hot_vector(word))]
  return [(el/count) for el in sentence_vector]def tfidfs(sentences):
  vectorizer = TfidfVectorizer(min_df=1)
  X = vectorizer.fit_transform(sentences)
  idf = vectorizer.idf_
  tfidfs = dict(zip(vectorizer.get_feature_names(), idf))
  return tfidfsdef sentence_vector_tfidf(tfidfs):
  sentence_vector = [x + (y*tfidfs[word]) if word in tfidfs else x
    + y for x, y in zip(sentence_vector, one_hot_vector(word))]
  return sentence_vector

SVM Model

The decision function shape is chosen to be ‘one-vs-one’ because it is much less sensitive to the problem of imbalanced datasets. The class_vectors are generated the same way as with the word vectors and a one-hot encoding. The sentence_vectors are generated with a sentence representation explained above and the gamma, cost and kernel parameter will have variable values as you will see in the test results below.

from sklearn.svm import SVCmodel = svm.SVC(C=cost, gamma=gamma, kernel=kernel,
    decision_function_shape=’ovo’)
model.fit(sentence_vectors, [cv.index(1) for cv in class_vectors])
predicted_classes = model.predict(testing_sentence_vectors)

Accuracy Results

The following results were executed by running the SVM with each time different values for the c and gamma parameter. A linear function kernel is tested against a radial basis function kernel and an averaged sentence representation against a TFIDF sentence representation.

Averaged sentence representation & linear function kernel

TFIDF sentence representation & linear function kernel

Averaged sentence representation & radial basis function kernel

TFIDF sentence representation & radial basis function kernel

Analysis

Sentence Representation Methods

The sentence representation methods yielded interesting results with the TFIDF sentence representation being more stable and quicker in good results but with the averaged sentence representation always getting a little higher or the same accuracy, never less.

averaged sentence representation The averaged sentence representation performed generally well with all the machine learning techniques. This sentence representation approach had a much greater variance than the TFIDF sentence representation. It starts out with fairly low results along with the early tested parameters but gains speed as those parameters get to their optimal configuration. Because of the averaging of the sentence vector, all instances are going to be closer together in the higher dimensional space and thus more difficult to separate with less optimal tuned parameters. With optimal parameters, this is not so much of an issue anymore.

TFIDF sentence representation The TFIDF sentence representation also performed generally well with all the machine learning methods. The results were less erratic than with the averaged sentence representation. Even with not optimal tuned parameters, it performed decently. Although this sentence representation performed well it got slightly less accuracy than the averaged sentence representation. Because of the multiplication with the TFIDF score, these sentence vectors are going to be more scattered around in the higher dimensional space and thus easier to separate with less optimal tuned parameters. With optimal parameters, this is not so much of an issue anymore.

SVM Kernels

The Support Vector Machines approach with a linear kernel function had more difficulty gaining good steady results with the averaged sentence representation against the TFIDF sentence representation but still got the highest accuracy of the two sentence representations, which was 0.81. With the radial basis function kernel the same story, the TFIDF sentence representation gets better results quicker and overall but the highest accuracy was with the averaged sentence representation, which was 0.82.

The accuracy of an averaged sentence representation versus a TFIDF sentence representation is quite different for the configurations and for both the linear and the radial basis function kernel.

linear function kernel With a linear function kernel a cost of misclassification of 0.1 yields a highest accuracy of 0.01 with the averaged sentence approach and a highest accuracy of 0.51 with the TFIDF sentence approach. With less tuned parameters it is harder to separate the instances in the higher dimensional space.

radial basis function kernel With a radial basis function kernel there is also quite a difference between the linear function kernel and the radial basis function kernel. It is clearly visible that with a linear basis function kernel, the averaged sentence representation keeps getting decent results while the gamma parameter goes up. This is not the case with the TFIDF sentence representation, with this representation the accuracy drops after reaching a certain threshold of a gamma value around 0.1/0.5. This is probably due to the gamma parameter being too strict and the instances being further apart in the higher dimensional space than with the averaged sentence representation, it starts overfitting real soon.

Conclusion

linear function > radial basis function The positive sides of a linear function kernel outweigh the positive sides of a radial basis function kernel mainly based on the argument of computational intensity. A linear kernel is faster to train than a radial basis function kernel. The linear function kernel also has a parameter less to tune which is the gamma parameter. Text classification is a problem which can be handled in a linearly separable way because of the already high dimensional space in which text resides, it doesn’t really help the performance to transform the data into even higher-dimensional space.