Masked Language Model: All you need to Know

Wiem Souai
UBIAI NLP
Published in
7 min readMar 19, 2024

Join us on an exploration of the Masked Language Model (MLM) and its implementation using TensorFlow, Google’s open-source machine learning framework. This article will delve into the basics of MLM, its diverse applications in natural language processing (NLP), and provide a detailed guide on building and training MLM models with TensorFlow. Our journey will cover topics such as understanding Masked Language Models, setting up the environment, loading datasets (including the Quora Dataset), preprocessing data, conducting training, and performing inference. Get ready to immerse yourself in the world of MLM!

Masked Language Model

Masked Language Modeling (MLM) stands as a pivotal deep learning technique within the realm of Natural Language Processing (NLP), notably for the training of Transformer models such as BERT, GPT-2, and RoBERTa. In MLM, segments of the input text are obscured or substituted with a designated token ([MASK]), prompting the model to forecast the original token based on contextual cues. This methodology facilitates the model’s comprehension of sentence context and word relationships. Being self-supervised, MLM derives learning directly from the input text, rendering it adaptable for a myriad of tasks including text classification, question answering, and text generation.

Environment Preparation

An optional step is to connect to your Hugging Face account so that you can after deploy the fine tuned model to it.

Dataset Loading (Quora Dataset)

Next, we’ll proceed to load a subset of the Quora dataset by employing the `load_dataset` function from the `datasets` library. This subset will encompass the initial 500 examples from the training split. Upon loading, the dataset will be stored within the `quora` variable.

We will partition the Quora dataset into training and testing sets utilizing the `train_test_split` method. To ensure a balanced distribution, 20% of the data will be allocated to the test set, indicated by `test_size=0.2`.

Data Preprocessing

In this step, we’re employing the `AutoTokenizer` class from the `transformers` library to initialize a tokenizer specifically designed for the DistilRoBERTa model. This tokenizer possesses the capability to convert text into tokens suitable for ingestion by the model.

Subsequently, we perform flattening on the Quora dataset, transforming it into a one-dimensional array-like structure. This process facilitates smoother processing or manipulation of the dataset, enhancing efficiency in subsequent operations.

Here, we define a preprocess_function aimed at tokenizing textual data utilizing the previously initialized tokenizer. This function operates on a batch of examples by concatenating the text within each example into a single string and subsequently tokenizing these strings. The `map` method is utilized to apply this function to the Quora dataset, processing it in batches with parallel processing (num_proc=4). Additionally, the remove_columns parameter is specified to indicate the removal of original text columns from the processed dataset.

In the following code snippet, we establish a DataCollatorForLanguageModeling object tailored for training a language model via the Hugging Face Transformers library. By setting mlm_probability=0.15, we configure it to randomly mask 15% of the tokens within each input sequence during the training process.

For example if we tokenize this input text :

We’ll get as output

But if we apply the masking using the previous code. We will be getting as output

Therefore The 2129 token has been replaced with the 103 token which encodes the MASK token.

Another approach for annotation would be utilizing the UBIAI annotation tool. Start by importing the document you want to annotate from the documents section.

Then go to settings and set up a new entity. For this example the entity is going to represent the masked token.

Now all you have to do, is go to the document and start annotating it.

So navigate to the annotation section, and start the annotation process.

UBIAI offers a visual representation of the annotation process, that way all annotated words are going to be visually distinguishable.

Now, all you have to do is to export your annotated data, and Voila !

The UBIAI annotation tool offers a user-friendly interface along with a plethora of features, enabling users to conveniently select tokens for masking. This flexibility empowers users to focus on key aspects, thereby streamlining the annotation process and enhancing overall ease of use.

Training

First, we initialize an AdamW optimizer with a learning rate of 2e-5 and a weight decay rate of 0.01 to train the model. Then, we proceed to initialize a masked language model (MLM) using the TFAutoModelForMaskedLM class from the transformers library, loading the pre-trained weights of the DistilRoBERTa base model.

To prepare TensorFlow datasets (tf_train_set and tf_test_set) for training and evaluation, respectively, we utilize the lm_dataset dataset. For training, we shuffle the data (shuffle=True), while for testing, we keep the order unchanged (shuffle=False). Each batch comprises 16 examples, and the data_collator function is employed to collate batches.

After compiling the model with the specified optimizer (optimizer), we proceed to train the model using the fitmethod. During training, the training data (tf_train_set) is utilized for training, while the validation data (tf_test_set) is used for validation. The model undergoes training for 3 epochs. Additionally, the PushToHubCallback callback is employed to save the trained model and tokenizer to the Hugging Face Hub after training.

Inference

To perform inference, we utilize the pipeline functionality from the Transformers library to create a fill-mask pipeline. This pipeline is capable of filling in masked tokens within a given text. We initialize the mask_filler pipeline with the previously trained MLM model (my_awesome_quora_mlm_model). Upon calling mask_filler with the textand top_k=3 arguments, it predicts the top 3 most probable tokens to fill the <mask> placeholder in the text.

To initialize a tokenizer for the previously trained MLM model (my_awesome_quora_mlm_model), tokenize the input text, and convert it into TensorFlow tensors (inputs), while also identifying the index of the masked token (<mask>) in the tokenized input (mask_token_index), we execute the following steps:

To compute the logits for the masked token in the input text using the model, extract the top 3 most probable tokens to fill the mask from the logits, and replace the mask token with each of the top 3 tokens, we perform the following steps:

So for the text : UBIAI is a <mask> company. We get as output :

Conclusion

In conclusion, this article has offered an in-depth exploration of Masked Language Modeling (MLM) leveraging TensorFlow and the Transformers library. We’ve delved into the intricacies of MLM, showcasing its efficacy in training language models to predict missing words within sentences, thereby advancing various natural language processing tasks. Through practical demonstrations and code snippets, we’ve illustrated the process of implementing and training an MLM model using TensorFlow, fine-tuning it with datasets such as Quora, and leveraging its capabilities for tasks involving text generation and prediction. By equipping readers with practical insights and examples, this article serves as a valuable resource for those seeking to harness the power of MLM in their NLP endeavors.

--

--