Masked Language Modeling with Hugging Face Transformers: A Beginner’s Guide

Ganesh Lokare
8 min readFeb 6, 2023

--

What We Will Cover in This Blog

  1. Review of what Masked Language Modeling is and where we use it.
  2. How to use Masked Language Modeling with Hugging face Transformers (just a few lines of code)

The main focus of this blog, using a very high level interface for transformers which is the Hugging face pipeline. Using this interface you will see that we can generate new words for masked words from given text with just 1 or 2 lines of code.

What is Masked Language Modeling

Masked Language Modeling (MLM) is a popular deep learning technique used in Natural Language Processing (NLP) tasks, particularly in the training of Transformer models such as BERT, GPT-2, and RoBERTa.

In MLM, a portion of the input text is “masked” or randomly replaced with a special token (usually [MASK]) and the model is trained to predict the original token based on the context surrounding it. The idea behind this is to train the model to understand the context of words and their relationships with other words in a sentence.

MLM is a self-supervised learning technique, meaning that the model learns to generate text without the need for explicit annotations or labels, but instead using the input text itself as supervision. This makes it a versatile and powerful tool for a wide range of NLP tasks, including text classification, question answering, and text generation.

How Masked Language Modeling works?

Masked Language Modeling (MLM) is a pre-training technique for deep learning models in NLP. It works by randomly masking a portion of the input tokens in a sentence and asking the model to predict the masked tokens. The model is trained on large amounts of text data, so that it can learn to understand the context of words and predict masked tokens based on their surrounding context.

Here is a simple example: Given a sentence, “The cat [MASK] on the roof”, the model would predict the word “sat” as the masked token.

During the training process, the model is updated based on the difference between its predictions and the actual words in the sentence. This pre-training stage helps the model to learn useful contextual representations of words, which can then be fine-tuned for specific NLP tasks. The idea behind MLM is to leverage the large amounts of text data available to learn a general-purpose language model that can be applied to different NLP problems.

Use of Masked Language Modeling

Masked Language Modeling (MLM) has several applications in the field of Natural Language Processing (NLP). Some of the most common applications include:

  1. Question answering: MLM can be used to pre-train the model for question answering tasks, where the model must identify the answer to a question given a context.
  2. Named entity recognition: MLM can be used to pre-train the model for named entity recognition tasks, where the model must identify and categorize named entities in a text, such as people, organizations, and locations.
  3. Text generation: MLM can be used to pre-train the model for text generation tasks, where the model must generate text based on a prompt or seed text.
  4. Machine translation: MLM can be used to pre-train the model for machine translation tasks, where the model must translate text from one language to another.
  5. Article spinning: Article spinning is a technique used to create new variations of existing articles by changing words, phrases, or sentences to create new content that is similar in meaning to the original. This technique is often used for SEO purposes, with the goal of creating multiple versions of an article that can be used to target different keywords or phrases.

Overall, MLM has proven to be a powerful technique for improving the performance of NLP models on a wide range of tasks. By pre-training the model on large amounts of text data, MLM can help the model to learn useful contextual representations of words, which can then be fine-tuned for specific NLP tasks.

Why use Transformers for Masked Language Modeling

Transformers are well-suited for Masked Language Modeling (MLM) pre-training because they have the ability to effectively capture the context of words in a sentence. The Transformer architecture, introduced in 2017, is based on self-attention mechanisms that allow the model to weight the importance of different tokens in a sentence. This makes it possible for the Transformer to capture long-range dependencies between words, which is crucial for understanding the context of a sentence.

When used for MLM pre-training, the Transformer can be trained to predict the masked tokens in a sentence based on the surrounding context. This way, the model can learn to understand the relationships between words in a sentence, which can be useful for a wide range of NLP tasks.

In addition to its ability to capture context, Transformers are also highly parallelizable, making it possible to train large models on vast amounts of text data. This is important for MLM pre-training, as the model needs to be trained on a large corpus of text in order to learn useful representations of language.

Overall, the combination of the Transformer architecture and MLM pre-training has proven to be very effective for NLP tasks, and has become a standard pre-training technique in the field.

Enough theory!!! Let’s code…

# install transformers
!pip install transformers

The command pip install transformers is used to install the transformers package, which provides access to state-of-the-art Transformer-based models for NLP tasks, including sentiment analysis.

Once the transformers package is installed, you can import and use the Transformer-based models in your own projects

from transformers import pipeline
# create pieline for MLM
mlm = pipeline('fill-mask')

The line “from transformers import pipeline” imports the pipeline module from the Transformers library, which provides an easy-to-use interface for common NLP tasks, including MLM. As we have not explicitly supplied any model, by default it will select distilroberta-base.

The line “mlm = pipeline(‘fill-mask’)” creates an instance of the MLM pipeline, which can be used to generate predictions for masked tokens in a sentence. The “fill-mask” argument specifies that the pipeline should be created for the task of MLM.

Once the MLM pipeline has been created, you can use it to generate predictions for masked tokens in a sentence.

mlm("The cat <mask> on the roof")

The above code is calling the instance of the Masked Language Modeling (MLM) pipeline created in the previous code snippet. The argument to the function, “The cat <mask> on the roof”, is a sentence with a masked token, indicated by the “<mask>” placeholder.

When this code is executed, the MLM pipeline will generate predictions for the masked token based on the surrounding context of the sentence. The pipeline will use the pre-trained language model to predict the most likely word to fill the mask based on the context of the sentence.

The function will return a list of predicted tokens and their probabilities, sorted by the model’s confidence in each prediction. The output will look something like this:

In this example, the model predicts that the most likely word to fill the mask is “sleeping” with a high score of 0.2022217. The other predictions, such as “sleeps”, “perched”, “sits”, and “sitting” have lower scores, indicating lower confidence from the model.

Let’s Try it on a custom dataset

!wget -nc https://www.dropbox.com/s/7hb8bwbtjmxovlc/bbc_text_cls.csv?dl=0

The command will download the file from the URL .

# import required libraries
import textwrap
import numpy as np
import pandas as pd
from pprint import pprint

load data using pandas library

df = pd.read_csv('bbc_text_cls.csv?dl=0')

print first few rows of data

df.head()

check the number of unique labels

labels = set(df['labels'])
labels

{ ‘business’, ‘entertainment’, ‘politics’, ‘sport’, ‘ tech’}

As you can see we have 5 unique labels, for our task we will keep data which has ‘business’ label.

texts = df[df['labels'] == 'business']['text']
texts.head()

Select random article

i = np.random.choice(texts.shape[0])
doc = texts.iloc[i]

The code uses numpy's random.choice method to select a random index from the number of rows in the texts pandas DataFrame. The resulting index is stored in the variable i. Then, the code selects the corresponding row in the texts DataFrame using .iloc[i] and stores it in the variable doc for further use.

Print the chosen article which is stored in “doc” variable

print(textwrap.fill(doc, replace_whitespace = False, fix_sentence_endings = True))

Next step is test our model by passing title of chosen article.

Article title is “BBC poll indicates economic gloom”, we will mask the last word and check that will model predict correct or not.

mlm("Trade gap narrows as exports <mask>")
Predictions

As you can see this returns multiple substitutions, all suggestions are equivalent to main word.

Next we will try on first line of the article

text = "Citizens in a majority of nations surveyed in a <mask> World Service poll believe the" + \
"world economy is worsening"
mlm(text)
Predictions

Nice!! Model has predicted correctly ‘BBC’ word with 64.49% and other suggestions also related to main word.

Next you can try with other articles or other sentences, as well as you can write a function that automatically masks and replaces words in a whole document. You might choose which words to replace based on some statistics, e.g. TF-IDF.

Next I plan to write additional blogs on Transformers where I will demonstrate how to fine-tune transformer models on custom datasets, as well as how to create transformer models from scratch and train them on custom datasets.

If you enjoyed this post, be sure to check out some of my other blog posts for more insights and information. I’m constantly exploring new topics and writing about my findings, so there’s always something new and interesting to discover.

--

--