Simplified Vector Space Model in Information Retrieval

Hafidz Jazuli
Aug 28, 2017 · 3 min read

37 years ago, Gerard Salton introduced a new way to represent words into dimensional space 1. Since then, more researcher interested in information retrieval subject after stagnant for a period. Why this model so interesting?

The first time i knew Salton’s model when i read Grossman’s book: Information Retrieval: Algorithm and Heuristics. Although, this book not quietly comfortable for beginner, but the very fast explanation provided by Grossman will at least challenge your intuition through writer’s view into lot of significant and recent publication. You will found the vector space model in the second chapter as first sub chapter. Likely, Grossman explained the vector model space very similar like was Salton explain the model in his paper. “The intuition…”, he said, “.. if we can show any words as trajectory in the space, then we can calculate the similarity between them”. Like one of fundamental concept in classical physic, a vector represent a physical object by its magnitude in the space dimensional manner. But, rather than magnitude, we used a weight in the concept of language. The intuition of weight is how many important words used in a document that matched with the search of information (query). For example, let say we need to search information “vector space” and have three documents in the collection: D1, D2, and D3. Each of them has two matched words: “… vector … model …”, “… language … model …”, and “… vector … space …”. Since a query consisted by two words, the we have two dimensional space (see image below). By looking into our drawing of vectors, we conclude that vector D3 is matched with our ‘information need’. Our drawing of vector still worked if we have thousand or even million documents, but it just difficult to draw at least four dimensional space defined by only four term. So, if we represented a query and any documents as vectors in the space, then we can assumed that closer any document’s vector into the query’s vector than higher their similarity to the query.

If we continued by given many possible query into the space, then naively, we can created indexes defined by any document which its vector passed by any term’s magnitudes. We can still extend this model to find how much likely a document similar to the query (relevance ranking) by calculate distance between their vector to the query’s vector. You may use dot product (Manhattan distance) or relative angle to calculate the distance between vector. Another case which we can still use this model is document clustering, simply by implementing the similarity concept between vector to find centroid and maximum distance for each group.

Event vector space model have been good enough to visualize similarity, but this model rarely used in the modern information retrieval. The calculation of angle is not good enough to used as proper term weighting. Once again, the exhaustive calculation like statistic proved as better solution to find proper term weighting in the application of Natural Language Processing (NLP) as Information Retrieval (IR) should be stand for.

:)

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade