Self-training in Machine Learning is the Future

Gagandeep Singh
School of ML
Published in
4 min readAug 6, 2020

Practical tips for training with less Data or no Data

Photo by Kaleidico on Unsplash

Data-wrangling of various sorts takes up about 70% of the time consumed in a typical AI project.

Getting labelled data is hard and even hard is extracting quality data out of it. Companies spend fortunes in labelling data. Training a machine-learning system requires large numbers of carefully labelled examples, and those labels usually have to be applied by humans.

Today we are going to explore how we can approach any NLP problem with absolutely no data or less data.

We are briefly going to talk about 2 things

1. Self-training

Training on data which is labelled and make predictions on unlabelled data. Then train a classifier on labelled data and unlabelled data which we predicted ( only choose data which were predicted with high confidence)

Pros

1. The cost of performing such an experiment is fairly low.

2. You might get good results if your task is relatively easy.

Cons —

  1. The main downside of self-training is that the model is unable to correct its own mistakes. If the model’s predictions on unlabelled data are confident but wrong, the erroneous data is nevertheless incorporated into training.

One question that might come to your mind is how do I collect data. You can use scrapping to scrap tweets from twitter or any social media depending on your requirement. One thing to make sure is check the quality of data by looking at few reviews or comments.

2. Practical Tips and Tricks

  1. Use data argumentation on the data that you have labelled. Although it will not make much difference because most of the words in the sentence will still be same but it will become robust to synonyms.

There is an amazing library (nlpaug) that does this all for you. Use this as a preprocessing step in your NLP pipeline.

2. Use Stacking technique. Train multiple classifiers on the same data and combine the predictions of all models. It is recommended to weight models separately depending on their accuracy.

You can learn more on Stacking Technique here

3. Text Similarity is one of my favourite technique. The main idea behind this technique is to cluster similar sentence together and take a few samples out of each group. In this way, you will have much greater diversity.

You can use TF-IDF, word2vec, glove, BERT, etc, to find similar sentences. My takeaway on this is that if given enough resources always go with BERT based models because they have dynamic embedding, unlike word2vec which has static embedding.

Ques: What are dynamic embeddings and static embeddings?Ans: You are given a sentence
The bank is located near the bank of the river.

TF-IDF, word2vec and glove will result in same embedding for both bank in the sentence but we know there is a difference. Static embeddings are predefined and do not change depending on the nearby words (you can retrain them but still they will be same for each occurance of it). BERT embeddings are however dynamic and can adopt according to the nearby words.

4. Language Translation — This technique is also great as it almost doubles the amount of labelled data that you have.

The main idea behind it is to use 2 translation service.

Text Conversion

Conclusion

Self-training or semi-supervised learning is the future. We all have seen the power of pre-trained variants of BERT. All these models were trained on a huge corpus of text data and their task was to predict next word. These models became so good at understanding words that they surpassed all existing models in various tasks.

References

  1. https://github.com/makcedward/nlpaug
  2. https://ruder.io/semi-supervised/
  3. https://medium.com/analytics-vidhya/semantic-similarity-in-sentences-and-bert-e8d34f5a4677
  4. https://www.economist.com/technology-quarterly/2020/06/11/for-ai-data-are-harder-to-come-by-than-you-think

--

--

Gagandeep Singh
School of ML

Data Scientist | NLP | Chatbot | Docker | Kubernetes