Large language models

Technocrat
CoderHack.com
Published in
10 min readSep 15, 2023

A language model is a machine learning model that predicts the probability of a sequence of words. It does this by learning patterns in a large corpus of text. Language models power many natural language technologies we use every day like predictive text, autocorrect, smart assistants and more.

Some well-known language models are OpenAI’s GPT-3, Google’s BERT and XLNet. GPT-3 is an autoregressive language model, meaning it generates text word by word. It has over 175 billion parameters, trained on internet data. BERT is a bidirectional encoder model used for tasks like question answering and sentiment analysis.

Photo by Olena Bohovyk on Unsplash

Building your own custom language model allows you to train on domain-specific data that these large pre-trained models have not seen. You have more control over how the model is optimized and can better align its incentives with your desired use cases. Here is an example of training a simple character-level language model on Shakespeare’s works:

import numpy as np 
import tensorflow as tf

# Load text data
text = open('shakespeare.txt').read()

# Build vocab index and one-hot encode characters
vocab = sorted(set(text))
char2idx = {char: idx for idx, char in enumerate(vocab)}
idx2char = {idx: char for idx, char in enumerate(vocab)}
text_encoded = np.array([char2idx[char] for char in text])

# Model parameters
char_dim = len(vocab)
lstm_units = 128

# Build LSTM model
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(char_dim, lstm_units))
model.add(tf.keras.layers.LSTM(lstm_units, return_sequences=True))
model.add(tf.keras.layers.LSTM(lstm_units))
model.add(tf.keras.layers.Dense(char_dim, activation='softmax'))

# Compile and train model
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(text_encoded, text_encoded, batch_size=32, epochs=10)

# Generate new text!
new_text = model.predict(text_encoded[:40])
new_text = ''.join([idx2char[idx] for idx in np.random.choice(range(char_dim), size=50, p=new_text[-1])])
print(new_text)
# Output: henry ears t twenty thee one a preack beason gresius

This model learns the patterns in Shakespeare’s writing by predicting the next character in sequences from his works. We can then use it to generate completely new text in Shakespeare’s style! Building and training custom models like this allows us to tap into the power of deep learning for our own use cases.

II. Data preparation

To build a language model, we first need to gather and prepare the data that we will train the model on.

A. Gathering data

There are a few options for obtaining data:

  • Web scraping: We can scrape text data from websites. We can scrape articles, books, forums, etc. The Newspaper3k library is useful for scraping news articles.
  • Existing datasets: Many datasets exist with large corpora of text that we can leverage. Some options are:
  • Project Gutenberg — Over 60,000 free eBooks
  • Common Crawl — Web crawl data
  • Wikipedia dump — The full text of Wikipedia
  • BookCorpus — A collection of over 7,000 public domain books
  • Purchasing data: As a last resort, we can purchase datasets from companies that aggregate and sell data.

B. Cleaning and preprocessing the data

Once we have our raw data, we need to preprocess it to get it into a format that we can train our model on. This includes:

  • Converting to plaintext by stripping HTML tags and other formatting
  • Removing or replacing strange characters and unicode
  • Normalizing punctuation, casing, and whitespace
  • Filtering out short and long sequences
  • Shuffling the data to remove order

We can use libraries like BeautifulSoup and nltk to help with preprocessing.

C. Defining a vocabulary and mapping words to indices

We need to define a vocabulary of words to use in our model and assign each word an index. This can be done by:

  • Taking the top N most frequent words
  • Using a frequency cutoff to only include words that appear at least m times
  • Applying a heuristic to determine included words based on average IDF of words, etc.

We map each word in the vocabulary to an index, so that our model only needs to work with integer indices rather than strings. Out-of-vocabulary words can be mapped to a single <UNK> token index.

from collections import Counter

words = [word.lower() for article in data for word in article.split(' ')]
counts = Counter(words)
vocabulary = [word for word, count in counts.most_common() if count > 5]
vocab_size = len(vocabulary)

word_to_idx = {word: i for i, word in enumerate(vocabulary)}
idx_to_word = {i: word for word, i in word_to_idx.items()}

This gives us our final preprocessed and vectorized training data that we can use to train the language model.

III. Model architecture

There are several model architectures that can be used for building language models. The most common are:

A. Recurrent neural networks (RNNs)

RNNs are neural networks that operate on sequences by iterating over the sequence elements in order. They have an internal state that is updated as the network processes each element in the sequence. A common type of RNN used for language modeling is the Long Short-Term Memory (LSTM) network. LSTM networks can learn long-term dependencies and mitigate the vanishing gradient problem that basic RNNs experience. Here is a code example of a LSTM layer in Keras:

lstm = keras.layers.LSTM(100) 
output = lstm(input_sequence)

B. Transformers (self-attention layers)

Transformers have become popular for language modeling in recent years. They use self-attention layers instead of RNNs to model relationships between sequence elements. The attention layers learn contextual relationships dynamically. Some well-known transformer models are BERT, GPT-2, and GPT-3. Here is a code example of a transformer layer in Tensorflow:

class TransformerLayer(tf.keras.layers.Layer):
def __init__(self, num_heads, d_model, dff):
super(TransformerLayer, self).__init__()

self.attention = MultiHeadAttention(num_heads, d_model)
self.ffn = FeedFowardNetwork(d_model, dff)

def call(self, inputs):
attention_output = self.attention(inputs)
ffn_output = self.ffn(attention_output)

return ffn_output

C. Contextualized word embeddings (ELMo, BERT)

Pre-trained contextualized word embedding models like ELMo and BERT model complex relationships between words by jointly conditioning each word’s embedding on the entire context in which it appears. These models can then be fine-tuned for a target task, reducing the amount of task-specific training data required. Here is example code to fine-tune BERT for a classification task:

bert_model = BERT('bert-base-uncased')

x_input = keras.layers.Input(shape=x_train.shape[1:])
x = bert_model(x_input)[1]

x = keras.layers.GlobalMaxPool1D()(x)
output = keras.layers.Dense(6, activation='sigmoid')(x)

model = keras.Model(x_input, output)
model.fit(x_train, y_train, validation_data=(x_val, y_val))





## IV. Training the model

Training a neural network language model requires careful selection of hyperparameters and optimization of resources.

### A. Setting hyperparameters

Some key hyperparameters to consider are:

- **Learning rate** — The step size while updating model weights. A value that is too high will diverge, while a value too low will take a long time to converge. Typically in the range of 0.01 to 0.0001.

- **Batch size** — The number of samples trained on at once. A larger batch size is more efficient but can cause issues fitting in memory. Usually in the powers of 2 from 16 to 512.

- **Epochs** — The number of full passes through the training data. As epochs increase, the model training improves but returns diminish and overfitting becomes more likely. Typically between 10 to 50 epochs.

- **Dropout** — The percentage of neural units dropped out of the network during training to prevent overfitting. A good default is 0.2 to 0.5.

- **Embedding size** — For language models, the dimensionality of word embeddings. Typically in the range of 100 to 300.

  • **Hidden size** — Number of units in the hidden layers of neural networks. Depends on dataset size and structure. Can vary from 100 to 1500.

Some sample code for defining a model in Keras with these hyperparameters:

model = keras.Sequential()
model.add(keras.layers.Embedding(input_dim=vocab_size, output_dim=300))
model.add(keras.layers.LSTM(512, dropout=0.2, recurrent_dropout=0.2))
model.add(keras.layers.Dense(vocab_size, activation=’softmax’))
model.compile(loss=’sparse_categorical_crossentropy’,
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=[‘accuracy’])

B. Using GPUs and distributed training

For large datasets, training a neural network can be very computationally expensive. Using GPUs and distributed training can speed this up considerably.

GPUs contain thousands of cores designed specifically for machine learning operations. Libraries like TensorFlow allow specifying to use a GPU to train on. Distributed training uses multiple GPUs, or entire computers, at once that work together. This can shorten training time for huge datasets down to minutes or hours instead of days or weeks.

An example of distributed training in TensorFlow:

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
# Define model architecture
model = ...
model.fit(train_dataset, epochs=10, steps_per_epoch=70)

C. Monitoring training and validating the model

During training, it’s important to monitor the model’s performance on both the training data and held-out validation data.

For the training data, we look at the loss and accuracy over each epoch. For the validation data, we check accuracy as well as other metrics like perplexity.

If the model’s performance on the validation data starts decreasing, this indicates overfitting is occurring. We can then stop training or adjust hyperparameters like adding more dropout.

An example of training a model in Keras with validation:

model.fit(train_dataset, 
epochs=10,
validation_data=val_dataset,
callbacks=[keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3)])

This will stop training if the validation accuracy does not increase for 3 epochs, preventing overfitting.Here is a draft of Section V in Markdown format:

Evaluating and Analyzing the Model

Once we have trained our language model, it is important to evaluate how well it performs and analyze what it has learned. There are a few main ways to evaluate a language model:

Perplexity

Perplexity is a measurement of how well a model predicts a sample of text. It is defined as the inverse probability of the text, normalized by the number of words. A lower perplexity indicates the model is more confident in its predictions and fits the data better. We can calculate perplexity on a held-out test set to evaluate how well our model generalizes.

For example, say we have the text “The quick brown fox jumped over the lazy dogs.” Our model assigns this sequence a probability of 0.0001. Since there are 9 words, the perplexity would be:

Perplexity = 1/0.0001^(1/9) = 9.2

A good language model should achieve a perplexity of less than 100 on its test set.

Benchmarking

We can also evaluate our model by benchmarking its performance on certain tasks like question answering, text generation, or classification. The exact metrics will depend on the task, but may include accuracy, F1 score, BLEU score for generation, etc. By benchmarking on established datasets for these tasks, we can see how well our model’s performance compares to other models.

Analyzing Model Representations

To understand what our model has learned, we can analyze its internal representations. For example, we can look at the contextualized embeddings to see which words have similar meanings. We can also do dimensionality reduction on the hidden states of the model to visualize its “concepts”. By analyzing how these internal representations change with more data and training, we can get a sense of how the model’s understanding develops.

With evaluation, analysis, and benchmarking, we can gain valuable insights into the strengths and weaknesses of our language model. We can then use these learnings to help improve and continue progress.

Deploying and Using the Model

Now that we have trained our language model, we want to actually use it in applications. The first step is exporting the model from our training framework (TensorFlow, PyTorch, etc.) into a format that can be loaded for inference. For example, in TensorFlow we can export a SavedModel which packages the model architecture, weights, and training configuration. In PyTorch, we can export a ONNX (Open Neural Network Exchange) model.

Once we have an exported model, we can load it in our application and use it for various NLP tasks. For example, to use it for question answering we can:

  1. Take in a question string from the user
  2. Tokenize the question into word IDs based on our vocabulary
  3. Pad or truncate the question to the maximum sequence length
  4. Feed the question through our model to get a context vector
  5. Calculate the similarity between this context vector and those of all possible answers
  6. Return the answer with the highest similarity

Here is some sample code to perform question answering with a Transformer model in TensorFlow:

import tensorflow as tf

# Load the exported model
model = tf.saved_model.load("model/")

# Tokenize the question
question = "What is the capital of France?"
question_ids = tokenizer.encode(question)

# Pad the question and get a context vector
question_ids = tf.keras.preprocessing.sequence.pad_sequences([question_ids],
maxlen=max_seq_length,
truncating="post",
padding="post")
context_vector = model(question_ids)

# Get embeddings for all answers
answers = ["London", "Paris", "Berlin"]
answer_ids = [tokenizer.encode(a) for a in answers]
answer_embeddings = model(answer_ids)

# Calculate similarity and get the best match
similarity = tf.matmul(context_vector, answer_embeddings, transpose_b=True)
best_match = tf.argmax(similarity).numpy()

print(answers[best_match]) # Prints "Paris"

To continue improving our model, we can retrain it on our entire dataset or on new data that becomes available. Retraining from scratch is often not feasible for large models, so we use the parameters from our exported model as a starting point, and perform fine-tuning on the new data. This allows us to leverage what the model has already learned and only update its knowledge incrementally.

With the capability to export, deploy, and retrain powerful language models, we open up many possibilities for intelligent systems that can understand and generate natural language.Here is a draft of Section VII in markdown format:

VII. Conclusion

In this article, we have gone through the major steps required to build your own custom language model:

  1. Gathering and preparing your data. This includes scraping data from the web, cleaning and preprocessing the data, and defining a vocabulary.
  2. Choosing an model architecture. We discussed RNNs like LSTMs, transformer models with self-attention, and contextualized word embeddings.
  3. Training the model with appropriate hyperparameters like learning rate, batch size, and epochs. Using GPUs and distributed training for large datasets.
  4. Evaluating the model with metrics such as perplexity and performance on downstream tasks like question answering. Analyzing what the model has learned.
  5. Deploying and using the model by exporting it and integrating it into applications where it can provide inference and predictions. Continually retraining the model with new data.

Language models have enabled significant advances in artificial intelligence and natural language processing. However, they also have some limitations. Language models may reflect biases in their training data and cannot perfectly capture all aspects of language. They also require massive amounts of data to train which can be difficult to obtain for some domains or languages.

The future of language models is promising. As we gather more data and computing power, language models will become more capable and nuanced. They could power intelligent assistants, improve machine translation quality, and enable other innovative NLP applications. Overall, building and deploying custom language models allows us to tap into their huge potential. I hope this article has been helpful to you! If you found it helpful please support me with 1) click some claps and 2) share the story to your network. Let me know if you have any questions on the content covered.

Feel free to contact me at coderhack.com(at)gmail.com

--

--