One of the challenges facing search applications is that users aren’t necessarily searching for documents. To put it another way, the unit of content that a user is looking for doesn’t necessarily correspond to a unit of content in the search index.
Given this potential mismatch, it’s useful to understand content in terms of its structure. By representing content structure in the index and making the query processing more aware of that structure, a search engine can more effectively and efficiently connect users to the information they seek.
Content understanding should start with representing what a document is about. An efficient representation should therefore be a document summary.
Documents sometimes include natural summaries. The title of a document often serves as a summary, albeit a short one. Unfortunately, not all document titles are summaries: some are generic, while others are more focused on luring readers than informing them about the document content. Still, titles tend to be useful. And more formal documents usually include an abstract that is intended to serve as a summary — the original “tl;dr”.
But we can’t always rely on natural summaries. Titles, even if they are focused on document content, can only represent so much in a few words. Informal documents rarely include abstracts. Moreover, these elements are usually designed for human consumption rather than search indexing — let alone for retrieval and ranking using a traditional inverted index. Hence, there’s often value in automatically generating a document summary that is suitable for search indexing, to support retrieval and ranking.
Broadly speaking, there are two approaches for document summarization:
- Extractive summarization, which extracts the tokens, phrases, or sentences from the document that (hopefully) best summarize it. Methods for identifying the text to extract range from simple statistical approaches (like tf-idf) to more sophisticated keyword extraction methods. And, of course, cutting-edge summarization approaches rely on deep learning.
- Abstractive summarization is more ambitious, generating new sentences rather than simply extracting content from the original document. Abstractive summarization is a hot area of deep learning — specifically sequence-to-sequence (seq2seq) learning — since it can be framed as mapping a sequence of tokens (the original document) to a summary.
Regardless of the approach, content summarization is a valuable tool to obtain a document representation that promotes its signal and trims out the noise. Such a representation is especially useful for topic-oriented searches.
But, as we mentioned at the outset, sometimes the user isn’t looking for a document. More specifically, the information that the user is looking for doesn’t necessarily correspond to an entire document. In that case, the user might be better served by part of a document.
As with summaries, documents often provide a natural segmentation. Documents generally break up text into sentences and paragraphs. Longer documents are often split into logical sections, sometimes even providing the reader with a table of contents as a guide.
To the extent that a document’s length reflects how much information it contains, it’s likely that a searcher will be better served by part of a long document than by the whole document. For example, the user might only be interested in a fact expressed in a single sentence. In general, long documents are likely to express a broad variety of information needs, and it’s painful for searchers to have to pore through thousands of words, searching for the one sentence or paragraph that addresses their particular need.
The question is how the search engine best directs searchers to the relevant document excerpts. There are a few approaches:
- Heuristic or rules-based document segmentation, such as splitting documents based on formatting elements. This approach can be effective if document formatting is consistent across large portions of the index, but it will struggle with unusual formatting, or if the formatting conventions vary significantly across the index. Still, it’s a good place to start.
- Machine learning-based document segmentation that learns from a labeled collection of training data that includes boundary markers. Like other sequence learning problems, it was historically addressed with approaches like hidden Markov models (HMM) and conditional random fields (CRF); but a more modern approach would be to use deep learning — specifically, a long short-term memory (LSTM) model.
- Search result snippets, also known as query-biased summaries, generate query-dependent summaries that generally extract a span of text from each result containing the query keywords. More sophisticated approaches synthesize multiple such spans. The dynamic nature of this approach makes it flexible, but it comes at a cost: much of the computation has to take place at query time, which imposes strict latency requirements.
Whether content segmentation is an explicit part of indexing or an implicit part of query processing, its goal should be to establish units of content that best map to the needs expressed by search queries. And it’s important to preserve how these segments are related to one another, especially so that parts of documents inherit appropriate attributes — like authorship — from the documents they belong to.
Content summarization and segmentation are important but challenging problems. Done well, they establish what a document is about, as well as breaking it up into more useful units. In a world where users increasingly expect search engines to be “answer engines”, it’s important to invest in understanding content structure and directing users as precisely as possible to the information they’re looking for.