Unleashing the Power of FastAPI, GPT-2, and Hugging Face: A Guide to Efficient Natural Language Processing
In recent years, natural language processing (NLP) has witnessed remarkable advancements, enabling us to create intelligent applications that understand and generate human-like text. This progress has been made possible by the synergy between powerful tools such as FastAPI, GPT-2, and Hugging Face. In this blog post, we’ll delve into the inner workings of these cutting-edge technologies and explore how they can be combined to create high-performance NLP applications. Strap in as we embark on an exciting journey into the world of FastAPI, GPT-2, and Hugging Face.
Here’s a step-by-step tutorial to write the function for fine-tuning GPT-2 using Hugging Face:
Step 1: Import the necessary libraries
import os
import re
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
Step 2: Define the pre_process_data
function
def pre_process_data(text):
"""Perform necessary pre-processing steps on the text"""
text = text.replace("\n", " ")
text = re.sub(r'[^\w\s]', '', text)
return text
Step 3: Define the fine_tune_model
function
def fine_tune_model(training_text, model_name='gpt2', output_dir='fine_tuned_model'):
"""Fine-tune the GPT-2 model"""
# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
# Tokenize and encode the training text
input_ids = tokenizer.encode(training_text, add_special_tokens=True, return_tensors='pt')
# Fine-tune the model
model.train()
model.config.pad_token_id = tokenizer.eos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.vocab_size = model.config.vocab_size + len(tokenizer.get_added_vocab())
model.resize_token_embeddings(len(tokenizer))
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for _ in range(1):
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Save the fine-tuned model
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
Step 4: Set the OUTPUT_DIRECTORY
for chunked files
OUTPUT_DIRECTORY = 'data_chunks'
Step 5: Get the list of chunked files and process each file
chunk_files = [filename for filename in os.listdir(OUTPUT_DIRECTORY) if filename.endswith('.txt')]
COUNTER = 0
for chunk_file in chunk_files:
chunk_file_path = os.path.join(OUTPUT_DIRECTORY, chunk_file)
COUNTER += 1
with open(chunk_file_path, 'r', encoding='utf-8') as file:
chunk_text = file.read()
preprocessed_text = pre_process_data(chunk_text)
fine_tune_model(preprocessed_text)
print(COUNTER)
This Section guides you through the process of defining the pre_process_data
and fine_tune_model
functions. It also demonstrates how to iterate over the chunked files in the OUTPUT_DIRECTORY
, read the text from each file, preprocess it, and then fine-tune the GPT-2 model using the preprocessed text.
Remember to adjust the input and output directories, as well as any other parameters, to match your specific requirements.
Save all those code into one file that we called models.py
import os
import re
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
def pre_process_data(text):
"Do the preprocess for data's"
# Perform any necessary pre-processing steps on the text
text = text.replace("\n", " ")
text = re.sub(r'[^\w\s]', '', text)
# Return the pre-processed text
return text
def fine_tune_model(training_text, model_name='gpt2', output_dir='fine_tuned_model'):
"""Fine tuning from pretained gpt2 model"""
# Load pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
# Tokenize and encode the training text
input_ids = tokenizer.encode(training_text, add_special_tokens=True, return_tensors='pt')
# Fine-tune the model
model.train()
model.config.pad_token_id = tokenizer.eos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.vocab_size = model.config.vocab_size + len(tokenizer.get_added_vocab())
model.resize_token_embeddings(len(tokenizer))
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for _ in range(1):
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Save the fine-tuned model
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
OUTPUT_DIRECTORY = 'data_chunks'
# Get the list of chunked files
chunk_files = [filename for filename in os.listdir(OUTPUT_DIRECTORY) if filename.endswith('.txt')]
# Process each chunked file
COUNTER = 0
for chunk_file in chunk_files:
chunk_file_path = os.path.join(OUTPUT_DIRECTORY, chunk_file)
COUNTER += 1
# # Read the chunked file
with open(chunk_file_path, 'r', encoding='utf-8') as file:
chunk_text = file.read()
# Preprocess the chunked text
preprocessed_text = pre_process_data(chunk_text)
# Fine-tune the model
fine_tune_model(preprocessed_text)
print(COUNTER)
Before Running above code, we need to prepare our data into 1024 bit of every chunked. Preprocessing your text data is a crucial step before fine-tuning GPT-2 using Hugging Face and FastAPI. By splitting the data into chunks and applying appropriate preprocessing techniques, you can ensure that your model training process is efficient, accurate, and optimized for performance. Understanding the importance of data preprocessing and following the step-by-step guide provided in this article will set you on the path to successful model training with GPT-2.
Importing the necessary libraries:
import os
Setting up the input and output directories:
INPUT_FILE = 'got.txt'
OUTPUT_DIRECTORY = 'data_chunks'
CHUNK_SIZE = 1024
Creating the output directory if it doesn’t exist:
if not os.path.exists(OUTPUT_DIRECTORY):
os.makedirs(OUTPUT_DIRECTORY)
Reading the input file:
with open(INPUT_FILE, 'r', encoding='utf-8') as file:
text = file.read()
Calculating the number of chunks required:
num_chunks = len(text) // CHUNK_SIZE
if len(text) % CHUNK_SIZE != 0:
num_chunks += 1
Splitting the text into chunks and saving them as individual files:
for i in range(num_chunks):
start = i * CHUNK_SIZE
end = (i + 1) * CHUNK_SIZE
chunk = text[start:end]
# Save the chunk to a file
chunk_filename = os.path.join(OUTPUT_DIRECTORY, f'chunk_{i}.txt')
with open(chunk_filename, 'w', encoding='utf-8') as file:
file.write(chunk)
Printing the success message indicating the number of chunks created:
print(f'Successfully split the file into {num_chunks} chunks.')
Write those code and put it in a file called prepare_data.py.
In summary, this code takes an input file (got.txt
) and divides its content into smaller chunks of size 1024 characters. It then saves each chunk as an individual file in the data_chunks
directory. This approach is useful when dealing with large text files that need to be processed in smaller parts or when working with limited computational resources.
Make sure to replace got.txt
with the path to your own input file and adjust the CHUNK_SIZE
and OUTPUT_DIRECTORY
according to your requirements.
By executing this code, you will successfully split your input file into multiple chunks, each saved as a separate file in the designated output directory.
After that you can execute that models.py, you gonna get finetuned models saved inside fine_tuned_model folder.
Next, We are going to create our models consumer that we called runner.py
Importing the necessary libraries:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
Defining the generate
function:
async def generate(prefix, max_length=800, top_k=5, model_dir='fine_tuned_model'):
"""Generate text based on the input prefix"""
# Loading the fine-tuned model and tokenizer:
model = GPT2LMHeadModel.from_pretrained(model_dir)
tokenizer = GPT2Tokenizer.from_pretrained(model_dir)
# Encoding the input prefix:
input_ids = tokenizer.encode(prefix, add_special_tokens=True, return_tensors='pt')
# Generating text based on the prefix:
model.eval()
model.config.pad_token_id = tokenizer.eos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.vocab_size = model.config.vocab_size + len(tokenizer.get_added_vocab())
model.resize_token_embeddings(len(tokenizer))
with torch.no_grad():
output = model.generate(
input_ids,
max_length=max_length,
num_return_sequences=1,
top_k=top_k,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
#Decoding and returning the generated text:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
return generated_text
In summary, this code defines an asynchronous generate
function that takes a prefix
as input. It loads a fine-tuned GPT-2 model and tokenizer, encodes the input prefix, and generates text based on the prefix using the GPT-2 model. The generated text is then decoded and returned as the output.
To use this code, make sure you have the fine-tuned GPT-2 model saved in the fine_tuned_model
directory. Adjust the max_length
and top_k
parameters according to your desired text generation requirements.
By calling the generate
function with an appropriate prefix, you can generate text using the fine-tuned GPT-2 model.
Remember to handle the asynchronous nature of the function appropriately based on the FastAPI async await standard.
Next we are going to use those generate function in our API
Create main.py
from fastapi import FastAPI
from pydantic import BaseModel
from runner import generate
app = FastAPI()
class GenerateRequest(BaseModel):
prefix: str
max_length: int = 800
top_k: int = 5
@app.post("/generate")
async def generate_text(request: GenerateRequest):
"""
Generate text based on the prefix asynchronously
"""
generated_sentence = await generate(request.prefix, request.max_length, request.top_k)
return {"generated_sentence": generated_sentence}
In summary, this code sets up a FastAPI application and creates an endpoint at /generate
that accepts a POST request. The request body should include the prefix
string and optional parameters such as max_length
and top_k
. The generate_text
function asynchronously calls the generate
function we explained earlier, passing the provided request parameters. The generated text is then returned as the API response.
To use this code, make sure you have FastAPI, Pydantic, and the generate
function properly configured and imported. By sending a POST request to /generate
with the required parameters, you can obtain the generated text as the API response.
Feel free to customize the code, such as adding input validation or error handling, based on your specific requirements and use case.
I hope this explanation helps you understand the functionality of the code for creating an API endpoint to generate text!
Every code that i wrote here is already on my github repository, please take a look at this following link