Google Developer Experts

Experts on various Google products talking tech.

Member-only story

SaaS LLM

Software-as-a-Service based on Large Language Models

Yogesh Haribhau Kulkarni (PhD)
Google Developer Experts
3 min readMar 22, 2023

--

(Source: Pixabay)

Although pretrained models offer out-of-the-box functionality for many NLP (Natural Language Processing) tasks, but they go only that far. Soon, a need arises to train on your own data (custom corpus). The ways in which custom training is done, have changed over time, you can see this progression, below:

  • Own data only: Frequency-based models such as bag-of-words and tf-idf (term frequency inverse document frequency) are typically trained from scratch, fully on the custom corpus. No pretrained models as such. Use them as vectorizers in any NLP tasks such as classification, via machine learning or deep learning neural networks.
  • Re-training: Model like word2vec comes in pretrained flavor from Google. You can use it as a vectorizer in any NLP tasks, as mentioned above. But, Python library Gensim, allows you to build your own word2vec (and doc2vec) model from scratch also, fully based on your own corpus. This gives you your own word-vectors to be used in any NLP tasks, as mentioned above.
  • Fine-tuning: Models like BERT (Bidirectional Encoder Representation of Transformers) come in pretrained flavor to be used as a vectorizer and also as base neural network. Based on downstream NLP tasks, such as classification, one needs to add, last layer to that base neural network and…

--

--

Yogesh Haribhau Kulkarni (PhD)
Yogesh Haribhau Kulkarni (PhD)

Written by Yogesh Haribhau Kulkarni (PhD)

PhD in Geometric Modeling | Google Developer Expert (Machine Learning) | Top Writer 3x (Medium) | More at https://www.linkedin.com/in/yogeshkulkarni/

No responses yet