Information Retrieval- A Brief Overview

Soumya Shukla
5 min readApr 20, 2019

--

Information retrieval is the activity of finding material (usually documents) of an unstructured nature (usually text) from a large collection of documents, that satisfies an information need of the user. Any system designed to perform this kind of retrieval is called an Information Retrieval System.

Information retrieval can be performed on the web, within emails, on a standalone machine (laptop) or in a company’s knowledge base. In the mid-90s, the advent of the web and later social media gave rise to a lot of unstructured information being shared and uploaded on a daily basis on the internet. This created a need to have a good Information Retrieval system (Search Engine).

Examples: Google, Yahoo, Bing are general search engines and specific ones are the web of science, DBLP, Google Scholar, etc.

Information Retrieval Architecture

Document and Query Representation:

Representation is used to make certain features/attributes explicit for a system. It is also used to push irrelevant things to the background in order to reduce processing time.

The documents and query (which form the input unit of the IR system), are mapped from an external representation to an internal representation called Indexing, which is then stored in a system. Real world example for a need of good representation is Plato’s cave which can be pursued in different ways by different viewers. This also helps answer a fundamental question that, “Why is a system not able to perform 100% for all users?”, which is because the information the system contains is never complete for all types of users.

The mapping of external representation to internal form comes with a cost of some information loss. Nevertheless, this enhances the retrieval time.

For example, Externally a document or query may be represented in an unstructured text format initially which is then converted to a token or group of words with a category or structure to it. They could further also be represented in binary format which is easy for a computer to store and retrieve.

Indexing can be of two types; 1. Manual or 2. Automatic

Methods to convert a document into a list of terms; 1. Using language or semantic based methods in NLP and Using Statistical Approach.

Statistical Approach is frequency-based method and makes certain assumptions for creating a Model. Some of the assumptions for document content are as below;

a. A quantitative approach with first order approximation assumes that, if a document contains a word, then it is about the word. In simple terms, this means the document will be relevant to a query if it contains one or more words mentioned in the query.

b. The number of appearances of a word is related to its importance. This means if a document contains a highly frequent word, this word is more important. Note: This assumption is not always true. Stop Words are an exception to it.

After Indexing, we have a Ranked List of words along with their frequencies (number of occurrences in a document). A graph of frequency versus rank is given below;

The hyperbolic plot of the frequency of words Vs words by rank order specifies 2 cut-offs. Upper cut-off consists of common/stop words and lower cut-off contains rear words.

While Stemming and Term class are methods to increase the frequency of words, Phrase construction is for decreasing frequency.

Retrieval/Matching function

Retrieval is a process of matching the documents with the user’s query and returning back the desired result. This is done based on a retrieval function which returns a list of documents ordered as per their relevance with respect to the query. Retrieval function can be different in different IR models.

Relevance Feedback

The idea of relevance feedback is to learn and modify existing knowledge, which can be achieved by learning from examples or correcting errors as per user feedback.

Query Reformulation

Query reformulation is done to refine the query with respect to the user’s feedback. Probabilistic retrieval model is used for reformulation which considers that document representation is given, and a query is not given but is estimated from the sample.

Document Re-indexing

Document re-indexing is the process of reorganizing the documents with respect to the user’s feedback. Probabilistic indexing model is used for this, which considers the given query (set of queries) and the document is not given. The model tries to learn a document by looking at the given queries.

System Evaluation

Evaluation is used to decide how good a system is with respect to the user and within a given set of systems. For evaluation of system performance one of the two measures can be used; Distance-based measure (precision, recall, fallout) and Normalized performance measures.

IR Models

Putting the above concepts in context, the below section elaborates on various IR models available in the literature with their practical implications. Different categories of retrieval model include Boolean, Vector space, probability distribution model, and Probabilistic Models.

For any IR Model three things are studied;

a. Document Representation

b. Query representation

c. Retrieval/Matching function

Boolean Model — This model is based on Boolean expressions and Boolean queries. Boolean queries are systems for specifying information needs which gives an exact match if the Boolean expression is satisfied. The user can ask for a specific term or a complex expression which is a Boolean combination of multiple terms. To form this combination of terms, Boolean operations such as AND, OR, NOT are used.

Vector Space Model — In this model, the documents and query are represented as a vector and based on their similarity, documents are retrieved. The vectors can be either binary in Boolean VSM or weighted in Non-binary VSM which can be used for ranking.

Probability Distribution Model — In probability distribution model, a document is represented as a distribution of terms and based on query representation the matching is done as similarity using entropy or by calculating the expected utility of document. The distribution model can be based on 2 types; Similarity-based and Expected utility based

Probabilistic Models — This model is based on the principle of probability ranking. The idea is to rank documents according to their probability of being relevant to a query, given all pieces of evidence available Pr(R|x).

Retrieval Model is used in query reformulation, Indexing Model for re-indexing and

Unified Model attempts to form a generalized framework which is capable of using all the available information for different matching problems.

--

--