NLP Series: Distributional Semantics | Occurrence Matrix

Amit Sehgal
3 min readJul 11, 2020

Would like to share some notes on Distributional Semantics. It all starts with the saying “ A word can be identified by the company it keeps ”

Let’s start by looking an example:

“Everyday I go to work in a bingo. There are two options for me. If I miss the 8:30 AM one I can board the 08:45 AM bingo. Though the 08:30 one is a fast bingo and takes only 30 mins while other takes 40 mins to reach my destination.”

One can easily make out from the para above the “bingo” is a representing some form of a train. Hence one can derive Semantics by the accompanying context and context could be words or document (tweet/post) or a sentence in document.

Lets now take a look at two approached of Distributional Semantics

  • Occurrence Matrix
  • Co-occurrence Matrix

Occurrence is the one which creates a matrix which has context on one axis (document, sentence, tweet, post) and terms (unique word with stop word removal i.e. Vocabulary ) on the other axis.

Now the value in the matrix could be 0/1 based on the existence of the term in that context/ document. Or it could be frequency of the term in that context. Or it could be more sophisticated tf-idf value of the that term in that context.

Consider four documents each of which is a paragraph taken from a movie. Assume that your vocabulary has only the following words: fear, beer, fun, magic, wizard.

The table below summarizes the term-document matrix, each entry representing the frequency of a term used in a movie:

This is the term frequency matrix. Each cell represent the freq of occurrence of the term in the context

Application of Occurrence Matrix

Now if you see one document (which could be tweet) can be represented as a vector of features equal to number of terms.

e.g. Vector(‘The Prestige’) = 8*’fear’ + 0*’beer’ + 5*’fun’ + 25*’magic’+8*wizard

If you notice if see in terms of ML the entire document for a given movie can be represented as a numerical vector now. Hence this is one approach of text vectorization.

This has now enabled us to perform mathematics on the text data and apply some data science too e.g. now you can find out which two movies are potentially of similar genre mathematically. Wondering how?

If you remember the dot product of vector, the higher the dot product more similar the vectors are so by that property.

The two movies Harry Potter and the Sorcerer’s Stone and The Prestige has the biggest dot product among all, so these two are most similar movies in the given sample.

The vector for Harry Potter and the Sorcerer’s Stone is (10,0,6,18,20) and the vector for The Prestige is (8,0,5,25,8). The dot product will be (10,0,6,18,20)*(8,0,5,25,8) = 80 + 30 + 450 + 160 = 110 + 450 + 160 = 720

Appendix

What is TF-IDF, Term frequency and inverse document frequence

The Term Frequency (TF) depict how often the term is occurring in a given document over the total no. of the terms

TF = freq of term ‘t’ in a document ‘d’ / total terms in document ‘d’

The Inverse Document Frequency (IDF) depict the uniqueness of the word i.e. the inverse of how many documents have that term. Higher IDF clearly raises the importance of the term for the particular doc.

IDF = log (No. of documents/ No. of document containing term ‘t’)

Why log so as to “scale down” the impact of large no. of documents i.e. if term appears in 100 out or 1 billion documents. log10 prevent the IDF portion to not grow too huge to nullify the impact of TF

TF-IDF is the product of = TF * IDF.

--

--