What is Content Understanding?
There’s a lot of writing about search — especially about ranking and relevance. And recently there’s been an increased focus on the particular challenges of query understanding. But, surprisingly, there hasn’t been much discussion of content understanding. This post will be the first of many to address this gap.
Content understanding is the foundation of the search process.
Let’s first place content understanding in the context of the search process. At a high-level, search works as follows:
- Content understanding represents each piece of content in the index.
- Query understanding represents each search query as a search intent.
- Relevance of content is a function of query and content understanding.
- Ranking orders the relevant, retrieved content by its desirability.
As we can see, content understanding is the first step — and thus the foundation — of the search process. Without indexed content, we don’t have search. And without content understanding, we can’t have robust search.
So now we know where content understanding fits into the search process and why it’s so important. But what does content understanding entail?
Content understanding is what makes content findable.
Indexing text documents by their words (aka tokens) in an inverted index (aka a posting list) is perhaps the most naive form of content understanding. Even this naive process typically requires some of the same steps used in query understanding, such as case folding (more generally, character filtering) and stemming. Indeed, it’s critical to align the text processing of query understanding with that of content understanding, which is usually accomplished by sharing the same text analyzer.
But these simple string processing steps are only the beginning of content understanding. Much as query understanding analyzes queries through a combination of holistic and reductionist techniques to categorize queries and recognize entities, robust content understanding uses similar methods to transform raw content into a more useful representation. But unlike queries, content tends to be extremely varied in both size (e.g., long-form articles) and format (e.g., images).
Content and query understanding establish a virtuous cycle.
It’s possible to implement content understanding in a vacuum, much as it’s possible to implement query understanding in a vacuum. Many techniques for document classification and annotation can be useful for enriching content.
But content and query understanding work best when they work together. Given a mapping of queries to content, e.g., through engagement data, it is possible to implement query understanding by aggregating the content understanding of the associated content. Conversely, we can infer content understanding from the query understanding of associated queries. While we have to be careful to avoid circular feedback loops, we should aim to take advantage of the mapping between content and queries as much as possible.
As with query understanding, the devil is in the details.
Content understanding is fundamental to search — even more fundamental than query understanding. But being fundamental doesn’t make it easy.
In particular, it can be challenging to apply generic content understanding techniques to your particular content. Classifiers and annotators, whether for text, images, or any other kinds of content, need to have models that not only are trained for your content but operate at the right level of granularity. For example, a classifier trained to recognize cats and dogs may not be able to distinguish between a Maine Coon and a Siamese. That might not matter in some context, but it may be quite important to someone searching for cat photos.
Content understanding is a challenge, but it’s a challenge at the heart of delivering great search. This publication will strive to offer practical tips for addressing this underserved challenge.
Next: Content Classification