Content classification and annotation offer useful approaches for content understanding, recognizing whether a piece of content is about a particular topic or mentions a particular entity. But most content exists in a space that is too rich to reduce to classification and annotation. A document is more than a category and a bag of entities. For content understanding to be worthy of the name, it needs to embrace the richness of the content it represents.
A more granular approach to content understanding focuses on the similarity between documents. A document is trivially identical to itself — that is, on a scale of 0 to 1, its similarity with itself is a 1. The less similar two documents are to one another, the closer their similarity is — or should be — to 0. Of course, the question is how we model and measure this similarity.
Naive Content Representations
A simple way to represent a document is as a bag of words or tokens. That translates naturally into a vector in a space where each word has its own dimension: a document is assigned a 1 in each dimension for which it contains a word, and a 0 for all of the other dimensions. Using this vector representation, the number of words two documents have in common equals the dot product of their vectors, and the cosine of the angle between the vectors serves as a naive similarity measure.
This representation is about as naive as it gets, but there are incremental ways to improve on it. Stemming or lemmatizing the tokens can significantly reduce the dimensionality of the vector space without much loss of fidelity. Assigning weights to tokens using tf-idf treats tokens as contributing more to the meaning of a document if they are repeated within the document (high tf) and don’t occur in many other documents (high idf).
Thinking Outside the Bag
Even with these incremental improvements, a key weakness of the bag-of-words approach is that it assigns a dimension to each unique word or token, at most performing a minimal amount of normalization like stemming. But multiple words can mean the same thing (synonymy), and one word can mean multiple things (polysemy). So mapping each token to its own dimension is a poor way to represent meaning. After all, a choice between synonyms (e.g., ”clear wrapping paper” vs. “transparent wrapping paper”) shouldn’t drastically change a document representation. Conversely, if two documents use the same word in different ways (e.g., a word with multiple meanings like “coach”), the document representations should reflect the different meanings.
So, how do we move beyond the bag-of-words model to look more holistically at the tokens in a document?
Computer scientists working with language have been aware of this challenge for decades, going back to work on factor analysis for document classification in the 1960s. But the breakthrough that opened the floodgates of vector representations of text came in the 1980s with latent semantic indexing (LSI).
Latent Semantic Indexing
LSI starts with a document-token matrix that assigns a non-zero value (such as the tf-idf weight) for each token that occurs in a document. This matrix can be enormous, but LSI uses a linear algebra technique called singular value decomposition to identify the most significant factors of the matrix, which then makes it possible to reduce the matrix to a much lower-dimensional approximation. This process reduces the vector space of tokens to a much lower-dimensional space, with one dimension for each “concept”.
LSI is theoretically interesting, but it’s challenging to apply in practice. It‘s computationally expensive, even using approximation methods. Moreover, the dimensions are difficult to interpret and not necessarily aligned with intuitive concepts. The mathematics of LSI offers a beautiful simplicity, but its generative language model is unrealistic.
An advance on LSI that emerged in the early 2000s is latent Dirichlet allocation (LDA). LDA models a document as a distribution of a small number of topics, and then models each token in the document as corresponding to a distribution of those topics. As a model, LDA is more complex than LSI — in fact, it’s a generalization of LSI. But that complexity pays off: LDA has proved far more practical than LSI for classification and other text applications. Indeed, LDA and variants of it are still used for topic modeling.
These days, LSI and even LDA feel like ancient history. Living in the age of AI, we tend to use word embeddings to map documents to vector spaces.
Like the previous methods, word embeddings map text to a vector space — a geometry where the cosine between vectors corresponds to semantic similarity. Word embeddings emerged from the linguistics field of distributional semantics. The intuition behind embeddings is that “a word is characterized by the company it keeps”, an idea popularized in the 1950s by linguist John Rupert Firth.
Modern word embedding approaches stem from word2vec, which Tomas Mikolov and his colleagues at Google created in 2013. The idea was that you could not only recognize similar words by the similarity of their vectors, but that you could even perform mathematical operations like addition and subtraction on them, e.g., “king” + “woman” - “man” = “queen”.
Word embeddings have advanced at a frenetic pace in the last several years, with models like GloVe, fastText, ELMo, and BERT finding their way into production applications. The availability of pretrained models is especially appealing for developers who lack the data or computational resources to train a robust model from scratch. While there may not be a pretrained model trained on content that is representative of a target application, fine-tuning a pre-trained model is still easier than training a model from scratch.
Opportunities and Challenges
All of the above reflects an evolution of techniques to model and measure the similarity between two documents, motivated by the desire to represent documents as more than just categories and bags of words or entities.
But what can we do with a content similarity measure? And what pitfalls do we have to watch out for?
Here are some of the opportunities to apply content similarity measures:
- Recommendations. A content-based similarity measure provides a strong foundation for recommendations, either independent or in combination with collaborative filtering.
- Document clustering. Given a similarity measure, it’s possible to construct a graph connecting each document to its nearest neighbors, and then to perform clustering on that graph. The resulting clusters can be used for search, recommendations, analytics, and other applications.
- Diversification. Sometimes it’s best to not show documents that are too similar to one another, particularly in the context of search results and recommendations. Any method for diversifying results can use content-based similarity measure as a way to measure diversity or lack thereof.
But there are also challenges with applying content similarity measures:
- Oversimplification. Reducing similarity to a single number is convenient, but it can’t account for the rich variety of ways in which documents can be similar. A single number is at best a lossy representation of similarity.
- Noise and bias. No similarity measure is perfect. In the best case, the similarity measure will be noisy, and similarity values need to be taken with a grain of salt. More realistically, every measure has its own biases.
- Heterogeneity. Documents tend to vary in length, format, and other attributes. A content similarity measure is likely to struggle with this heterogeneity, conflating structural differences with semantic ones.
Measuring content similarity offers a more granular approach to content understanding than classification and annotation. A similarity score between 0 and 1 is useful for a variety of applications, such as recommendations, clustering, and diversification. But a single number is a lossy representation of similarity, and any measure will have to contend with noise, bias, and content heterogeneity. Still, using embeddings to model content similarity is a critical component in a comprehensive content understanding suite.