NLP Engineers often deal with corpus of documents or texts. Raw text cannot be directly fed into the machine learning algorithms. It is very important to develop some methods to represent these documents in a way computers/algorithms understand i.e vectors of numbers. These methods are also called as feature extraction methods or feature encoding. In this blog we will learn and implement 3 very important feature extraction methods.
This is very flexible, intuitive and easiest of feature extraction methods. The text/sentence is represented as a list of counts of unique words, for this reason this method is also referred as count vectorisation. To vectorize our documents, all we have to do is count how many times each word appears.
Since bag-of-words model weighs words based on occurrence. In practice, the most common words like “is”, “the”, “and” add no value. Stop words (introduced in my blog in this series) are removed the prior to count vectorisation.
Vocabulary is The total number of unique words in these documents.
Vocabulary: [‘dog’, ‘a’, ‘live’, ‘in’, ‘home’, ‘hut’, ‘the’, ‘is’]
The model is only concerned with whether known words occur in the document, not where in the document. Obviously there is significant information loss by simply using a document vector to represent an entire document as the order or structure of words in the document is discarded, but this is sufficient for many computational linguistics applications. it computationally simpler and actively used when positioning or contextual info aren’t relevant.
2. TF-IDF (Term Frequency- Inverse Document Frequency)
TFI-DF is a method that provides a way to give rarer words greater weight. We will try to understand both the definitions in TF-IDF separately for better understanding.
Term Frequency : tf(t,d)
This summarizes how often a given word appears within a document. It is a measure of how frequently a word presents in a document.
There are 2 popular methods to represent this.
1. Term frequency adjusted for document length: tf(t,d) = ( number of times term t appear in document d )÷ (number of words in d)2. logarithmically scaled frequency: tf(t,d) = log (1 + number of times term t appear in document d )
doc1 = ‘a dog live in home’
tf(dog,doc1) = 1/5. (according to method 1)
tf(dog,doc1) = 1+log(1). (according to method 2)
Inverse Document Frequency: idf
IDF is a measure of term importance. It is logarithmically scaled ratio of the total number of documents vs the count of documents with term t.
Numerator: Total number of documents
Denominator: Total number of Documents with term
D = [ ‘a dog live in home’, ‘a dog live in the hut’, ‘hut is dog home’ ]
D is the corpus
idf(dog, D) = log( total number of documents (3) / total number of documents with term “dog” (3) ) = log(3/3) = log(1) = 0
TFIDF: tf x idf
We can now compute the TF-IDF score for each term in a document. Score implies the importance of the word.
As you can see in the above example. If the term “dog” appears in all the documents, then the inverse document frequency of the word will be zero, thus the TFIDF score will be zero. What this basically implies is that if the same word is present in all the documents, then it has no relevance.
TF-IDF makes the feature extraction more robust than just counting the number of instances of a term in a document as presented in Bag-of-words model. But it doesn’t solve for the major drawbacks of BoW model, the order or structure of words in the document is still discarded in TF-IDF model.
Sparsity: As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them). NLP practitioners usually apply principal component analysis (PCA) to reduce the dimensionality.
Naive Bayes Models: An over-simplified assumptions model, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification with BoW model or TF-IDF