Hands-On Quickstart to Training Large Language Models

Quickly Build a ChatGPT-style model for your domain of expertise

Ariel Lubonja
2 min readMay 20, 2023

The publication of pre-trained models on HuggingFace, and the advent of LoRA and PEFT allows everyone to build useful models. So let’s do that!

Start here

  1. Open this Colab notebook [1]. I have added comments and text to guide you. Good luck!

A few things to keep in mind

  1. I Highly recommend one of the HuggingFace courses. They have sections that explain most aspects of training.
  2. Decide what kind of NLP task you want to do: Causal_LM, Seq2Seq_LM, SEQ_CLS, TOKEN_CLS:
  • Causal_LM: ChatGPT-like, form sentences from predicting the next word from the previous words.
  • Seq2Seq : Transform a Sentence (or sentences) into other sentences. Translation is a classic example, summarization, explanation.
  • SEQ_CLS : Sequence classification. Examples: Sentiment analysis (Is this review positive?, What’s the tone of this Tweet?), Intent recognition
  • TOKEN_CLS: Token Classification. Example Uses: Named Entity Recognition. Is the “Bing” in “Chandler Bing” a last name, or a sound?

Basic Glossary

  1. Tokenization — converting words into numbers so that models can reason about them
  2. Pretrained — models that someone else has built and trained (usually to great expense — be sure to say Thank you!) that you can then improve upon (Fine-tune)
  3. Fine-tuning — “add” to the pretrained models’ knowledge base. You wanna do this if you want the model to answer domain-specific questions e.g. Questions on your company’s codebase, HR & Hiring, meeting summaries etc.
Mandatory Cool Photo. Credit: https://youtu.be/-cTlH6P_n-E?t=229

References

[1] https://colab.research.google.com/drive/17VpyJc40y7Oy1d437Rjy7PFcNDZISynF?usp=sharing. I found the original version of this Colab notebook in a Youtube video, but I lost the link. Sorry.

--

--

Ariel Lubonja

I am a PhD student in Computer Science at Johns Hopkins University. Area: High Performance Computing, Graph Machine Learning