Training/Fine tuning a Masked Language Model (MLM)

De-noised Teacher Forced training example and explanation

Alex Punnen

Published in

Better ML

4 min readJul 10, 2023

Originally fromhttps://creazilla.com/nodes/38523-teacher-and-student-clipart

Forget about the below code for some time and continue

Full code here https://colab.research.google.com/drive/13WwOBS8b_VfpWcS-R2ytpoFCAKPZI27w?usp=sharing

def get_denoised(FlaxDataCollatorForT5MLM, tokenizer, prompt):
    encoded = tokenizer(prompt, truncation=False, padding=False, return_tensors="pt")
    batch_size =1
    input_length = encoded.input_ids.shape[1]
    denoiser = FlaxDataCollatorForT5MLM(tokenizer,.55,1.5)
    mask_indices = np.asarray([denoiser.random_spans_noise_mask(input_length) for i in range(batch_size)])
    labels_mask = ~mask_indices
    input_ids_sentinel = denoiser.create_sentinel_ids(mask_indices.astype(np.int8))
    labels_sentinel = denoiser.create_sentinel_ids(labels_mask.astype(np.int8))
    input_ids = denoiser.filter_input_ids(encoded.input_ids, input_ids_sentinel)
    labels  =  denoiser.filter_input_ids(encoded.input_ids, labels_sentinel)
    return labels,input_ids

Let’s use t5-base Model

If you want to understand denoised training, you need to see the below

# This below is what happens in the denoised training
prompt = "The <extra_id_0> walks in <extra_id_1> park"
encoded_prompt = tokenizer(prompt, truncation=False, padding=False, return_tensors="pt").input_ids
print(f"encoded_prompt ={encoded_prompt}")
labels ="<extra_id_0> cute dog <extra_id_1> the <extra_id_2>"
encoded_labels = tokenizer(labels, truncation=False, padding=False, return_tensors="pt").input_ids
print(f"encoded_labels ={encoded_labels}")
print(f"encoded_prompt.shape=encoded_labels.shape {encoded_prompt.shape} ={encoded_labels.shape}")

Output

encoded_prompt =tensor([[   37, 32099, 10681,    16, 32098,  2447,     1]])
encoded_labels =tensor([[32099,  5295,  1782, 32098,     8, 32097,     1]])
encoded_prompt.shape=encoded_labels.shape torch.Size([1, 7]) =torch.Size([1, 7])

prompt = "The cute dog walks in the green park"
labels, input_ids = get_denoised(FlaxDataCollatorForT5MLM, tokenizer, prompt)
print(f"denoised input_ids decoded = {tokenizer.decode(*input_ids,skip_special_tokens=False)}")
print(f"denoised labels decoded   = {tokenizer.decode(*labels,skip_special_tokens=False)}")
print(f"input_ids.shape {input_ids.shape} labels.shape {labels.shape}") # todo should this be equal

Output

denoised input_ids decoded = The cute<extra_id_0> in<extra_id_1> green<extra_id_2></s>
denoised labels decoded   = <extra_id_0> dog walks<extra_id_1> the<extra_id_2> park</s></s>
input_ids.shape (1, 8) labels.shape (1, 9)

So the Input “The cute dog walks in the green park” gets transferred as “The cute<extra_id_0> in<extra_id_1> green<extra_id_2></s>” and the target (ground truth) as “<extra_id_0> dog walks<extra_id_1> the<extra_id_2> park</s></s>”

The <extra_id> and </s> (end of the sentence) etc are the unique tokens

Now let’s train with these inputs

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = 't5-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)# or T5Tokenizer
len_tokenizer =len(tokenizer) # 32100 to get the sentinel ids
print(f"len_tokenizer={len_tokenizer}")

  prompt = "The cute dog walks in the green park"
  labels, input_ids = get_denoised(FlaxDataCollatorForT5MLM, tokenizer, prompt)
  print(f"denoised input_ids decoded = {tokenizer.decode(*input_ids,skip_special_tokens=False)}")
  print(f"denoised labels decoded   = {tokenizer.decode(*labels,skip_special_tokens=False)}")
  print(f"input_ids.shape {input_ids.shape} labels.shape {labels.shape}") # todo should this be equal
  denoised_input_ids = torch.from_numpy(input_ids)
  denoised_labels = torch.from_numpy(labels)
  denoised_attention_mask = torch.ones(input_ids.shape)

  model.train()
  for epoch in range(100):
      outputs = model(input_ids=denoised_input_ids,attention_mask=denoised_attention_mask,
                      labels=denoised_labels)
      loss = outputs.loss
      if epoch % 20 == 0:
          print(f"Epoch {epoch}  Loss {loss}")
      loss.backward()
      optimizer.step()
      optimizer.zero_grad()
  print(f"Epoch {epoch}  Loss {loss}")

Output

denoised input_ids decoded = The<extra_id_0> dog walks<extra_id_1> the<extra_id_2></s>
denoised labels decoded   = <extra_id_0> cute<extra_id_1> in<extra_id_2> green park</s></s>
input_ids.shape (1, 8) labels.shape (1, 9)
Epoch 0  Loss 3.533181667327881
Epoch 20  Loss 1.1342705488204956
Epoch 40  Loss 0.5043283700942993
Epoch 60  Loss 0.11743383854627609
Epoch 80  Loss 0.38232171535491943
Epoch 99  Loss 0.01801321841776371

Let’s test this model

  # After  training
  model.eval()
  test_prompt = "The  <extra_id_0> dog  walks in the <extra_id_2>"
  encoded = tokenizer(test_prompt, truncation=False, padding=False, return_tensors="pt")
  test_output = model.generate(input_ids = encoded.input_ids,num_return_sequences=1,max_length=125)
  test_answer = tokenizer.decode(test_output[0], skip_special_tokens=True)
  print(f"After Training:'{test_prompt}'-->'{test_answer}'")

Output

After Training:'The  <extra_id_0> dog  walks in the <extra_id_2>'-->'cute green park'

Not bad!

Some theory now — straight off https://huggingface.co/docs/transformers/model_doc/t5#training

T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing. This means that for training, we always need an input sequence and a corresponding target sequence. The input sequence is fed to the model using input_ids.

The target sequence is shifted to the right, i.e., prepended by a start-sequence token and fed to the decoder using the decoder_input_ids.

In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the labels.

(For huggingface models the inputs and labels only need to be given, the decoder_input_ids are calculated by the model)

So what we have done is teacher-forced denoised training. In case you are confused.

In the case of teacher forcing for the training data set

“The cute dog walks in the green park”

If the model predicts “ The cute cat” after it reaches cute, the training is not continued with what the model predicts, but what is expected from the labels/targets. “The cute dog” <predict>

And if you are confused between the difference of Masked LM and Causal LM, here is what my smart assistant (ChatGPT) tells

Me -What is the difference between a masked language model and a causal language model?

ChatGPT

A Masked Language Model (MLM) and a Causal Language Model (CLM) are two different types of language models that are used in natural language processing. They differ in how they are trained and the tasks they are typically used for.

Masked Language Model (MLM): MLMs, like BERT, are trained using a method where some percentage of input tokens are masked at random, and the model’s objective is to predict the original value of the masked words, based on the context provided by the non-masked words. This allows the model to focus on the understanding of bidirectional context, meaning it learns to understand the meaning of a word based on all of its surrounding contexts — to the left and right of the word. This is particularly useful for tasks like question answering, named entity recognition, and other tasks where understanding the full context of input is important.
Causal Language Model (CLM): CLMs, like GPT, are trained to predict the next word in a sentence given all the previous words (but not the future words), which is why it’s also called an autoregressive model. This means it learns to understand the context in a unidirectional way (from left to right). This type of model is particularly useful for tasks like text generation, where we want to generate a sequence of words one word at a time.

In summary, the main difference lies in the way these models are trained and use context. MLMs use bidirectional context and are trained to fill in gaps in a text, while CLMs use unidirectional context and are trained to predict the next word in a sequence.

Full code here https://colab.research.google.com/drive/13WwOBS8b_VfpWcS-R2ytpoFCAKPZI27w?usp=sharing

Training/Fine tuning a Masked Language Model (MLM)

De-noised Teacher Forced training example and explanation

Written by Alex Punnen