Information Extraction

When we index content for search, it’s natural to think in terms of documents. But searchers aren’t necessarily looking for documents. They’re looking for information. That information may be a small part of a document or may be distributed across multiple documents.

There’s no single approach to reliably extract all of the useful information from a collection of documents. Moreover, what will be useful to a searcher depends on that searcher’s needs, not all of which can be anticipated.

Still, there are ways to extract information from a document collection to make its information more findable. Let’s explore some of them.

As discussed in an earlier post on content annotation, entities are members of a controlled vocabulary. A controlled vocabulary can be associated with a particular entity type (e.g., people, organizations, locations) or it can be a collection of untyped entities (e.g., technical terms).

The most valuable information in documents often relates to entities, so it makes sense for indexing to pay special attention to sentences or paragraphs that mention entities. The rest of the document may not be about that entity — that’s more of a content classification question — but that portion of the document could contain useful information for someone interested in the entity it mentions.

Recognizing that a document mentions an entity is useful, but it’s even more useful to understand the context of that mention. For example, a sentence might relate a person to a location or organization, or it might tell us a person’s age. Indexing this sort of information requires us to take a step beyond entity extraction to relationship extraction.

One approach to relationship extraction is to recognize patterns in a document, whether through rules or machine learning. The relationships can be stored in a triplestore or semantic network, a knowledge base that represents information in the form of subject–predicate–object triples, the predicate indicating the relationship between the subject and object entities. For example, a triplestore might represent the information that Joe Biden is the president of the United States, or that he was born in 1942.

It’s also possible to extract relationships from documents at query time using a question answering approach. Make sure to optimize the index for passage retrieval and not just document retrieval, so that sentences or paragraphs are effectively indexed as their own documents. Implementing passage retrieval efficiently depends on the architecture of the underlying search platform, particularly how it handles nested document structure.

Sometimes we’re interested in information that is subjective rather than objective. For example, we might want to know whether people like or dislike a product, as well as the main reasons for their likes or dislikes.

Extracting this kind of affective or subjective information is known as sentiment analysis or opinion mining. In general, sentiment analysis is a form of content classification that can be applied at the document or passage level. As such, it is especially amenable to neural text classification methods, such as the many sentiment models available on HuggingFace.

Indexing the results of sentiment analysis allows searchers to ask subjective questions about the information contained in the documents.

As discussed in an earlier post on content structure, content understanding should start with representing what a document is about. A summary not only serves as an efficient representation of the document, but may also make the information in the document more salient.

Broadly speaking, there are two approaches for document summarization.

Extractive summarization, which extracts the tokens, phrases, or sentences from the document that (hopefully) best summarize it. Methods for identifying the text to extract range from simple statistical approaches (like tf-idf) to more sophisticated keyword extraction methods. And, of course, cutting-edge summarization approaches rely on deep learning.

Abstractive summarization is more ambitious, generating new sentences rather than simply extracting content from the original document. Abstractive summarization is a hot area of deep learning — specifically sequence-to-sequence (seq2seq) learning — since it can be framed as mapping a sequence of tokens (the original document) to a summary.

As mentioned in the introduction, a searcher may be searching for information that is distributed across multiple documents. How can a search engine make this information findable?

In general, synthesizing information across multiple documents is harder than extracting or summarizing information from a single document.

But one way that a search engine can synthesize information across multiple documents is aggregation. For example, aggregation can tell us how many people reviewed a product and what fraction of reviews were favorable. A more sophisticated use of aggregation might return a histogram, or could compare the price of a product to other products like it, perhaps expressing it as a percentile.

In general, aggregation, whether performed offline as part of indexing or applied online to search results, can be a powerful way to distill a signal that is distributed across the document collection.

All of these techniques serve as a reminder that the authors or processes that create documents do not always organize information in the same way that searchers want to access it. Part of our jobs as search application developers is to deliver an information architecture with the flexibility to accommodate a diversity of information needs. Remember: most searchers want information, not documents.

Previous: Content Moderation



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store