Restructuring Sparse High Dimensional Data for Effective Retrieval

Kirils Mensikovs
I Am The Bot
Published in
1 min readJan 29, 2017

The task in text retrieval is to find the subset of a collection of documents relevant to a user’s information request, usually expressed as a set of words. Classically, documents and queries are represented as vectors of word counts. In its simplest form, relevance is defined to be the dot product between a document and a query vector–a measure of the number of common terms. A central difficulty in text retrieval is that the presence or absence of a word is not sufficient to determine relevance to a query. Linear dimensionality reduction has been proposed as a tech- nique for extracting underlying structure from the document collection. In some domains (such as vision) dimensionality reduction reduces computational com- plexity. In text retrieval it is more often used to improve retrieval performance. We propose an alternative and novel technique that produces sparse represen- tations constructed from sets of highly-related words. Documents and queries are represented by their distance to these sets, and relevance is measured by the number of common clusters. This technique significantly improves retrieval per- formance, is efficient to compute and shares properties with the optimal linear projection operator and the independent components of documents.

--

--