The most fundamental form of holistic content understanding is classification. Content classification maps a piece of content — that is, an entry in the search index — to one or more elements of a predefined set of categories. The categories can be product types, document topics, image colors, or any other set of enumerated values that describes the content.
Classifying content makes it more findable, since the classifications can be used for retrieval and ranking. The categories for a classifier can be organized as a flat list, or they can be arranged in a hierarchical taxonomy (aka a tree).
Broadly speaking, there are two ways to perform content classification: rules and machine learning. Rules-based approaches are simple but brittle, while machine-learning approaches are more complex but more robust.
Rules-Based Content Classification
Rules-based classifiers use explicit rules to classify content. The rules typically involve matching strings or regular expressions. For example, we can have a rule that, if a product title contains the substring “phone”, then its product type is “Cell Phones”. We can make the rule case-insensitive, so that we match product titles containing the string “iPhone”.
Rules-based classifiers are simple but brittle. Continuing with our above example, many products with titles containing the substring “phone” are cell phones; but many others are cell phone cases. We can improve the rule’s precision by excluding products whose contains the substring “case”. We can then improve recall by matching the brand names of popular cell phones, such “samsung galaxy” and “pixel”. But each of these changes introduces its own false positives, and no rule will catch everything.
Perfecting a rule to increase its precision and recall achieves diminishing returns at the cost of complexity. And creating complex rules undermines the simplicity and explainability that makes rules attractive in the first place. In general, it’s best to keep rules simple and accept the limits of their accuracy.
Machine Learning for Content Classification
Building a machine learning model for content classification is more complex than creating rules, but it tends to be much more robust. Building a machine learning model requires a collection of training data: examples that associate content with their categories. For example, if we’re building a classifier to map product titles to product categories, then our training data would be pairs of the form (title: “Apple iPhone 13”, category: “Cell Phones”), (title: “Canon Pixma MG3620”, category: “Printers”), etc. Labeled training data is the life blood of machine learning in general, and classification in particular.
When it comes to training data, both quantity and quality matter. More data improves model accuracy, as does label quality — that is, accurate labels in the training data. But when we use human judgments to generate labels, both quantity and quality come at a cost, since we have to pay for each judgment — and even more if we use redundant judgments to ensure quality. As with all things, we have to manage trade-offs. We can sometimes avoid this expense by using behavioral data to collect implicit human judgments, but doing so creates its own risks around both quality and bias. There’s no free lunch.
But quantity and quality aren’t the whole story. Training data must be representative of the content to which the resulting model will be applied. Unrepresentative training data introduces bias, which in turn leads the model to produce systematic errors. For example, if only 10% of products are cell phones, training data in which 50% of products are cell phones will produce a model that over-labels products as cell phones. This example is innocuous, but models trained with unrepresentative data produce real harm when their bias affects people’s lives and livelihoods. Collect your training data carefully.
Choosing a Machine Learning Model for Content Classification
Classification is the most studied problem in machine learning, so there are lots of approaches you can use for it. Models based on decision trees, such as random forests and gradient-boosted decision trees, can be useful if each piece of content is associated with categorical, ordinal, or numerical data.
But for text and images, the most natural approach to building a classifier is to use embeddings that represent the content as real-valued vectors in a high-dimensional vector space. There are many pretrained embeddings freely available for text and images. You can often use a pretrained model as-is, but you may benefit from fine-tuning the model for your particular application. Either way, you can get an enormous head start from these public resources.
And, as a general principle, keep things as simple as possible. The quantity, quality, and representativeness of your training data is more critical to your success than the sophistication of your machine learning model. You also want to avoid premature optimization, instead learning from rapid iterations. You’re better off learning quickly and often from smaller collections of training data, and then perfecting the model when you’re done. Take an agile perspective: iterating quickly will optimize for the speed of *your* learning.
Content classification can only be as good as the categories.
Regardless of how you build a content classifier, remember that your classifier can only be as good as the categories to which it classifies content. Ideally, the categories should be coherent, distinctive, and exhaustive.
Coherent categories embody clear patterns that a rule or machine-learning model can recognize. Distinctive categories are cleanly separated from one another: after all, if it’s hard to distinguish two categories from each other, then how is a classifier supposed to be able to decide between them? Finally, exhaustive categories cover the whole universe of content.
No category set is perfect. But the quality of the category set is often the bottleneck for classification. If you’re struggling to build a robust content classifier, check to make sure the category set isn’t the root cause.
Classification is the most fundamental form of content understanding. You can use simple rules-based classifiers, or you can invest in robust machine-learning approaches. For text and images, take advantage of pre-trained embeddings. Remember that the quantity, quality, and representativeness of your training data matters more than the sophistication of your machine learning model. And strive for categories that are coherent, distinctive, and exhaustive. Remember that a classifier can only be as good as its categories.