Microsoft’s BioGPT: A GPT-based Language Model for Biomedical Text Processing

Tavleen Bajwa
Geek Culture
Published in
8 min readMar 26, 2023

As it has been a while since I last put pen to paper, I thought why not cover this new tool called BioGPT. I have been reading about it for quite some time.

Microsoft has created a language model called BioGPT that is based on the GPT architecture and is intended specifically for the processing of text and biomedical data.

According to Microsoft, their new BioGPT-Large model has an accuracy rate of 81.0%.

Photo by BoliviaInteligente on Unsplash

Is Microsoft’s BioGPT built on the same architecture as OpenAI’s GPT models, including ChatGPT and the new GPT-4?

Yes, BioGPT uses the same transformer-based neural network architecture as OpenAI’s GPT models, which are utilised for natural language processing tasks including text generation, information retrieval, and language translation. Microsoft created BioGPT, which is specifically made to process biomedical language and data, by creating its own version of the GPT model and training it on a sizable corpus of biological data.

Despite sharing the same architecture as OpenAI’s GPT models, BioGPT was trained using a separate collection of data and is tailored for a particular set of tasks specific to the biomedical domain.

Transformers Based ?

A transformer-based neural network is a deep learning model that is often used for natural language processing tasks such as language translation, text summarization, and language understanding.

The transformer design was first described in the 2017 Google research paper “Attention Is All You Need.” The transformer architecture uses a method known as “self-attention” to process the whole sequence of inputs at once, in contrast to standard neural networks that process information sequentially.

Self-attention enables the model to recognise significant correlations between words in the input sequence and use this knowledge to produce outputs that are more accurate. The GPT models from OpenAI and the BERT model from Google are only two examples of language models that have been developed using the transformer architecture.

The Hugging Face BioGPT Interface

Whisper + M2M100 + BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

Built on top of the PyTorch deep learning framework, the Hugging Face BioGPT Interface is an open-source software package and is intended to be simple to use and adaptable for a range of natural language processing tasks.

Several pre-trained models that have been honed for particular tasks, like named entity identification, sentence categorization, and text production, are available through the interface. Many pre-processing tools and operations for managing biomedical text data are also included, including entity tagging and tokenization. Let me demonstrate how it seems. You can see my input, “COVID-19 is,” highlighted in red in the image below. When we select the “Predict” button, the result shown by the green arrow will appear.

Hugging Face BioGPT interface for text generation and mining

Facebook AI’s Whisper and M2M100

Whisper and M2M100 are both generative pre-trained transformer models developed by Facebook AI.

Whisper AI is a speech recognition and transcribing software that turns spoken phrases into written text using artificial intelligence (AI). By doing away with the need to manually transcribe spoken content, it is intended to save time and boost productivity for both individuals and enterprises. It is a massive neural language model that was created with the intention of producing text in several languages. It’s a transformer-based model that was developed using a sizable amount of monolingual text from books, websites, and other sources.

As seen in the image above, in addition to text there is an option for Audio, which is utilised for that purpose, in the input.

On the other hand, M2M100 is a huge multilingual machine translation model that doesn’t use English as an intermediary language and can translate between any combination of 100 languages. In order to produce translation outputs of a high calibre, it is likewise built on the transformer architecture and was trained on a sizable corpus of parallel data. On a number of benchmarks, M2M100 has been demonstrated to perform better than current state-of-the-art machine translation models.

Loading a pretrained model into Jupyter Notebook

  1. Installing the Hugging Face Transformers library using pip:
!pip install transformers

2. Importing Python modules from transformers library

#Provides an easy-to-use API for using pre-trained models for various NLP tasks
from transformers import pipeline, set_seed

#Specific classes for working with the BioGPT model
""" BioGptTokenizer class is used to tokenize text inputs in a format that can be processed by the BioGPT model.
The BioGptForCausalLM class is used to create an instance of the BioGPT model that can be used for language modeling
tasks such as generating text or completing prompts
"""
from transformers import BioGptTokenizer, BioGptForCausalLM

3. Installing torch & sacremoses

!pip install torch
!pip install sacremoses   

"""
Sacremos is used internally for BioGPTTokenizer (It is used in Tokenizing and Normalizing text strings)
It includes Tokenization, Lowercasing, Deaccenting, Unicode Normalization(Unique code pointing to each character)
"""

4. Code Example of text generation using the BioGPT model

#Text generation 
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt") #Loads a pre-trained BioGPT model from the Microsoft model hub

tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

set_seed(42) #Line sets the random seed to ensure that the generated text is reproducible.

generator("COVID-19 is", max_length=20, num_return_sequences=5, do_sample=True)
Output:

[{'generated_text': 'COVID-19 is a disease that spreads worldwide and is currently found in a growing proportion of the population'},
{'generated_text': 'COVID-19 is one of the largest viral epidemics in the world.'},
{'generated_text': 'COVID-19 is a common condition affecting an estimated 1.1 million people in the United States alone.'},
{'generated_text': 'COVID-19 is a pandemic, the incidence has been increased in a manner similar to that in other'},
{'generated_text': 'COVID-19 is transmitted via droplets, air-borne, or airborne transmission.'}]

Line 1: A pre-trained BioGPT model is loaded from the Microsoft model hub using the ‘BioGptForCausalLM.from pretrained(“microsoft/biogpt”)’.

Line 2: Loading Tokenizer

Line 3 : The pipeline function creates a pipeline object for text generation, which uses the loaded BioGPT model and tokenizer.

Line 5: Produces text that begins with the query “COVID-19 is” and has a maximum length of 20 tokens. It then uses the “num return sequences” argument to return 5 distinct generated sequences. The model will generate the text using a sampling technique if the do “sample=True” parameter is set, which can result in more interesting and diverse results.

5. Code example that shows how to encode input text using the BioGPT tokenizer and feed it to the BioGPT model for processing.

#Use model to get features of a given text in Pytorch 

"""
Description: Method call that passes the encoded input to the pre-trained language model for processing.
The input is encoded in a format that the model can understand, which typically involves tokenizing the
input text and converting it to a numerical format that the model can process

"""
from transformers import BioGptTokenizer, BioGptForCausalLM
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
text = "Replace me by any text you like."
encoded_input = tokenizer(text, return_tensors='pt') #tokenizes the input text and returns it in a PyTorch tensor format that the model can process.

"""
The BioGPT model receives the encoded input tensor via the model(**encoded input)
line, which then produces an output tensor. The encoded input dictionary is
unpacked into distinct keyword arguments for the model() method using the ** syntax.

"""
output = model(**encoded_input)
#Returns the number of tokens 
encoded_input

#We have 10 tokens for 7 words

Output: {'input_ids': tensor([[ 2, 1719, 2018, 14815, 23, 420, 4210, 11980, 423, 4]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Depending on the goal, the output tensor can be further processed to produce text, predict emotion, or extract named entities.

6. Code Example to generate text with Beam Search Decoding using BioGPT model

#Beam search decoding 

import torch
from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed

tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
sentence = "COVID-19 is"
inputs = tokenizer(sentence, return_tensors="pt")

set_seed(42)

with torch.no_grad():
"""
Description:
torch.no_grad() is a context manager that temporarily disables gradient calculation during model inference,
which can help save memory and speed up processing

"""

beam_output = model.generate(**inputs, #Unpacks the inputs dictionary & passes it as keyword arguments to the generate() method
min_length=100, #Minimum number of words to generate
max_length=1024, #Maximum number of words to generate
num_beams=5, #Describes the number of beams used in beam search algorithm, generates multiple likely sequences and selects the one with highest likelihood
early_stopping=True #Stops the generation process as soon as the model predicts an end-of-sequence token
)
tokenizer.decode(beam_output[0], skip_special_tokens=True) #Tells the tokenizer to exclude any special tokens that were added during encoding or generation
Output:
'COVID-19 is a global pandemic caused by severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2), the causative agent of coronavirus disease 2019
(COVID-19), which has spread to more than 200 countries and territories,
including the United States (US), Canada, Australia, New Zealand, the United
Kingdom (UK), and the United States of America (USA), as of March 11, 2020,
with more than 800,000 confirmed cases and more than 800,000 deaths.'

Beam Search Decoding

Natural language processing frequently use the Beam Search Decoding algorithm to produce likely token sequences from input text or prompts. To produce cohesive and grammatically sound phrases or paragraphs of text, it is frequently used in conjunction with generative language models like GPT-2, GPT-3, and BioGPT.

The algorithm functions by keeping track of a number of potential token sequences (also known as “beams”) at each stage of the generating process. The beam search algorithm chooses the top-k most likely candidates to go on to at each stage after the model forecasts the probability distribution over the subsequent token. Until a predicted end-of-sequence token or the sequence’s maximum length is reached, this operation is repeated.

The amount of candidate sequences that the algorithm keeps at each step is controlled by the parameter k, sometimes known as the “beam size.” Higher ‘k’ often results in more complex computations and longer creation times, but the quality of the output text is increased. Depending on the particular needs of the text generating task, the parameter might be changed.

Conclusion

In conclusion, BioGPT is a cutting-edge generative language model created especially for the biomedical field. BioGPT has exhibited remarkable performance in a variety of downstream tasks, including text categorization, named entity recognition, and text generation, thanks to its extensive pre-training on a wide range of biomedical literature.

Because BioGPT is open-source, researchers and developers can quickly incorporate its features into their pipelines for natural language processing. A user-friendly interface is provided by the Hugging Face Transformers library so that BioGPT can be accessed and used in a variety of ways.

BioGPT does, however, have some limits, just like any machine learning model. To enhance its efficacy, further study may examine other biomedical data sources and more efficient pre-training methods.

BioGPT is an important development in the field of natural language processing overall, with potential applications in drug discovery, clinical decision support systems, and other biomedical domains.

Resources:

  1. https://huggingface.co/spaces/kadirnar/BioGpt
  2. https://paperswithcode.com/paper/biogpt-generative-pre-trained-transformer-for

--

--

Tavleen Bajwa
Geek Culture

Bioinformatics Data Analyst @MedGenome. Love learning and writing articles on tech related topics in my free time