Hands-On Quickstart to Training Large Language Models
Quickly Build a ChatGPT-style model for your domain of expertise
2 min readMay 20, 2023
The publication of pre-trained models on HuggingFace, and the advent of LoRA and PEFT allows everyone to build useful models. So let’s do that!
Start here
- Open this Colab notebook [1]. I have added comments and text to guide you. Good luck!
A few things to keep in mind
- I Highly recommend one of the HuggingFace courses. They have sections that explain most aspects of training.
- Decide what kind of NLP task you want to do: Causal_LM, Seq2Seq_LM, SEQ_CLS, TOKEN_CLS:
- Causal_LM: ChatGPT-like, form sentences from predicting the next word from the previous words.
- Seq2Seq : Transform a Sentence (or sentences) into other sentences. Translation is a classic example, summarization, explanation.
- SEQ_CLS : Sequence classification. Examples: Sentiment analysis (Is this review positive?, What’s the tone of this Tweet?), Intent recognition
- TOKEN_CLS: Token Classification. Example Uses: Named Entity Recognition. Is the “Bing” in “Chandler Bing” a last name, or a sound?
Basic Glossary
- Tokenization — converting words into numbers so that models can reason about them
- Pretrained — models that someone else has built and trained (usually to great expense — be sure to say Thank you!) that you can then improve upon (Fine-tune)
- Fine-tuning — “add” to the pretrained models’ knowledge base. You wanna do this if you want the model to answer domain-specific questions e.g. Questions on your company’s codebase, HR & Hiring, meeting summaries etc.
References
[1] https://colab.research.google.com/drive/17VpyJc40y7Oy1d437Rjy7PFcNDZISynF?usp=sharing. I found the original version of this Colab notebook in a Youtube video, but I lost the link. Sorry.