Application of machine learning approaches has been quite successful on text data with a wide range of applications like sentiments of products from customers’ reviews, trending topics on the social network to meaningful concept extraction from product data. When it comes to classification tasks, supervised approaches perform much better than unsupervised approaches. However, supervised models require training data — the quantum of training data required depends on the complexity of the task at hand.

Generating training data is usually manual, and hence requires significant human involvement — mostly from the domain experts. Hence generation of training data, particularly for large volumes is time-consuming, expensive and error-prone due to fatigue. To reduce these drawbacks, it is desirable to explore alternative approaches that can work with fewer data. But is this possible? In this article, we look at some of the approaches that can be useful.

Background on Labelling

There are two approaches to data labelling — active and passive. Passive approaches do not distinguish between unlabelled points to be labelled next. Active approaches, on the other hand, intelligently select the next data point that is expected to improve the performance of the model. However, active labelling requires multiple iterations of labelling and modelling (one iteration for identifying new data point to be labelled, labelling it, and building model). Hence active labelling approach has a longer time to market.

Passive labelling can be leveraged for a better time to market. Passive labelling can work wonders when used in combination with ensemble approach. The ensemble approach combines multiple weak labellers to yield a strong labeller. (Strong labellers are entities that label each data point correctly, while weak labellers may make mistakes sometimes. E.g. a medical professional can be a strong labeller for a disease given the symptoms of a patient, while a quack or a chemist can be a weak labeller). The idea of using an ensemble of weak labellers has been there for a long time (In fact AdaBoost was developed two decades back with this intention). Recently Snorkel [1] has been designed where a number of weak labellers are combined. Snorkel is a system wherein labels from multiple weak labellers are combined to yield a label that can be more accurate than the individual weak labellers. For this, Snorkel uses a graphical model. Unfortunately, Snorkel cannot handle multi-label cases where an instance of data can have multiple labels (For example, a shirt can have colour both black and red). One obvious question is whether it is possible to extend Snorkel to handle multi-label cases. It turns out that it is difficult because of the way Snorkel has been designed. However, it is possible to build alternative approaches. We have tried two with good results. The first approach is based on weighted majority voting of weak labellers; the second is based on another level of the classifier which is trained with input as labels from weak labellers and output as the actual label for the instance. Below is a snapshot of the weak labellers and the ensemble approach that leverage these weak labellers to yield better final labels.

Ensemble Approach

Reducing the number of parameters

Text input may contain a large number of terms from vocabulary, but not all will be relevant for the classification task at hand. Using the concept of the word-to-vec representation paradigm, each label can be represented as a vector. It is plausible that we can concentrate only on those terms that have good similarities with labels. This has the desirable effect that the number of parameters to be learnt is significantly reduced.

Reducing number of terms

In the above example, concepts that are similar are placed closer when we map them in their embedding dimension (for simplicity, we are showing in 2 dimension here).

Consider the vocabulary as shown above. If we are interested in the classification of food types, the terms enclosed in circles are more meaningful than others. Still, these terms constitute only a fraction of the whole vocabulary. Therefore, if we eliminate a majority of the terms based on their distance to the concepts of interest, the resulting model will have a much fewer number of parameters to be learnt. Here, we can use a concept of similarity between the output labels and input, and building the model with a pruned set of input. As a result, we have a “similarity-based approach”.

Using distributional concepts

Suppose we are interested in the classification of pet food into a number of types: dog food, cat food etc. We can think of dog food as a distribution over vocabulary where a term relating to dog food it will have a higher weight than the rest. Similarly, we can construct a distribution for cat food. Given a food description, we would like to classify it into one of the two food types. In this case, intuitively it makes sense to classify this as dog food if the input instance is closer to the dog food distribution than to cat food distribution.

For this approach, we need two concepts now. Concepts of distribution and distance between two distributions. For each word/term, we can compute the frequencies of each class from the labelled examples (as below) to get distributions. How a distribution P is related to distribution Q is measured as KL divergence. Unfortunately, KL divergence is not symmetric, i.e., KL(P, Q) is not the same as KL(Q, P). Therefore, we use a modified form of KL divergence, we call this KLD or KL distance that makes the measure symmetric.

KL distance

At training time, the task is to learn the distribution of labels. Given an instance of input, we can predict the label as the one which is closest to the label set in terms of KLD.

Why should a distribution-based approach be suitable for scarce data? Conceptually, we are guided by the difference in distribution between labels. Again, terms that are not relevant for food type, will have zero or low support in the label distributions. Hence, classification will mainly depend on terms with non-trivial support in the label distributions.

An example of KL distance

Use classifier in the intrinsic dimensions of the manifold

Data may come from a manifold, and hence classifier should be guided by the principle that points closer in the manifold should have the same labels. In other words, points close by in terms of Euclidean distance need not necessarily have the same labels.

An example of data on manifold (source: https://www.semanticscholar.org/paper/Algorithms-for-manifold-learning-Cayton/100dcf6aa83ac559c83518c8a41676b1a3a55fc0)

Research on Manifold learning [2] showed that functions from RKHS space have the capability to capture data distribution. Even more encouraging is that a linear classifier based on the kernel in the space can be a good classifier. To simplify, the predicted value is a simple function of a linear combination of a set of kernels, which can be computed between training instances and the current instance. Such models will be easy to learn too.

A bag of unsupervised tricks when there is no labelled data

What do we do when there is almost no labelled data? We can leverage a set of unsupervised techniques. One simple solution is to use regular expression and word embedding to look for words or terms similar to the labels. For this, we can use the existing techniques from natural language processing (e.g., lemmatization, stemming, word2 vector mapping). More sophisticated processing is possible through heuristic rules. Heuristic rules can be gleaned from the domain expert in the form of normalization rules. Alternatively heuristic can be created automatically through automated parsing.

Finally, do these heuristics work in practice? We have experimented with real-life examples for the extraction of product attributes. We have seen significant performance (96% accuracy) with labelled data of the order of 1000 across a number of attributes. Well, another relevant question will be whether such an approach works for image-based classification. Most of the image models have a large number of parameters, and as expected, require a large dataset (of the order of 10 thousand or more) for even simple binary classification. Hopefully, we will revisit this in a future article.

Acknowledgement: This was part of collaborative work with Tuhin Bhattacharya, Abhinav Rai, Gursirat Singh Sodhi, Arpit Gupta, Ashish Gupta, Himanshu Rai, Diksha Manchanda, Rahul Bansal and Abhijit Mondal.

Reference:

  1. Ratner, A., Bach, S.H., Ehrenberg, H. et al. Snorkel: rapid training data creation with weak supervision. The VLDB Journal 29, 709–730 (2020). https://doi.org/10.1007/s00778-019-00552-1
  2. Mikhail Belkin, Partha Niyogi, Vikas Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. J. Mach. Learn. Res. 7, 2399–2434 (2006)

--

--

Sakib Mondal
Walmart Global Tech Blog

Sakib is a Distinguished Data Scientist at Walmart. He has a keen interest in application of ML and optimization based techniques to practice.