Member-only story
SaaS LLM
Software-as-a-Service based on Large Language Models
Published in
3 min readMar 22, 2023
Although pretrained models offer out-of-the-box functionality for many NLP (Natural Language Processing) tasks, but they go only that far. Soon, a need arises to train on your own data (custom corpus). The ways in which custom training is done, have changed over time, you can see this progression, below:
- Own data only: Frequency-based models such as bag-of-words and tf-idf (term frequency inverse document frequency) are typically trained from scratch, fully on the custom corpus. No pretrained models as such. Use them as vectorizers in any NLP tasks such as classification, via machine learning or deep learning neural networks.
- Re-training: Model like word2vec comes in pretrained flavor from Google. You can use it as a vectorizer in any NLP tasks, as mentioned above. But, Python library Gensim, allows you to build your own word2vec (and doc2vec) model from scratch also, fully based on your own corpus. This gives you your own word-vectors to be used in any NLP tasks, as mentioned above.
- Fine-tuning: Models like BERT (Bidirectional Encoder Representation of Transformers) come in pretrained flavor to be used as a vectorizer and also as base neural network. Based on downstream NLP tasks, such as classification, one needs to add, last layer to that base neural network and…