Unleashing the Power of FastAPI, GPT-2, and Hugging Face: A Guide to Efficient Natural Language Processing

Rino Alfian
6 min readJul 7, 2023

In recent years, natural language processing (NLP) has witnessed remarkable advancements, enabling us to create intelligent applications that understand and generate human-like text. This progress has been made possible by the synergy between powerful tools such as FastAPI, GPT-2, and Hugging Face. In this blog post, we’ll delve into the inner workings of these cutting-edge technologies and explore how they can be combined to create high-performance NLP applications. Strap in as we embark on an exciting journey into the world of FastAPI, GPT-2, and Hugging Face.

Here’s a step-by-step tutorial to write the function for fine-tuning GPT-2 using Hugging Face:

Step 1: Import the necessary libraries

import os
import re
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

Step 2: Define the pre_process_data function

def pre_process_data(text):
"""Perform necessary pre-processing steps on the text"""
text = text.replace("\n", " ")
text = re.sub(r'[^\w\s]', '', text)
return text

Step 3: Define the fine_tune_model function

def fine_tune_model(training_text, model_name='gpt2', output_dir='fine_tuned_model'):
"""Fine-tune the GPT-2 model"""
# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Tokenize and encode the training text
input_ids = tokenizer.encode(training_text, add_special_tokens=True, return_tensors='pt')

# Fine-tune the model
model.train()
model.config.pad_token_id = tokenizer.eos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.vocab_size = model.config.vocab_size + len(tokenizer.get_added_vocab())
model.resize_token_embeddings(len(tokenizer))

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

for _ in range(1):
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Save the fine-tuned model
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Step 4: Set the OUTPUT_DIRECTORY for chunked files

OUTPUT_DIRECTORY = 'data_chunks'

Step 5: Get the list of chunked files and process each file

chunk_files = [filename for filename in os.listdir(OUTPUT_DIRECTORY) if filename.endswith('.txt')]
COUNTER = 0

for chunk_file in chunk_files:
chunk_file_path = os.path.join(OUTPUT_DIRECTORY, chunk_file)
COUNTER += 1

with open(chunk_file_path, 'r', encoding='utf-8') as file:
chunk_text = file.read()

preprocessed_text = pre_process_data(chunk_text)
fine_tune_model(preprocessed_text)

print(COUNTER)

This Section guides you through the process of defining the pre_process_data and fine_tune_model functions. It also demonstrates how to iterate over the chunked files in the OUTPUT_DIRECTORY, read the text from each file, preprocess it, and then fine-tune the GPT-2 model using the preprocessed text.

Remember to adjust the input and output directories, as well as any other parameters, to match your specific requirements.

Save all those code into one file that we called models.py

import os
import re
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def pre_process_data(text):
"Do the preprocess for data's"
# Perform any necessary pre-processing steps on the text
text = text.replace("\n", " ")
text = re.sub(r'[^\w\s]', '', text)
# Return the pre-processed text
return text

def fine_tune_model(training_text, model_name='gpt2', output_dir='fine_tuned_model'):
"""Fine tuning from pretained gpt2 model"""
# Load pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
# Tokenize and encode the training text
input_ids = tokenizer.encode(training_text, add_special_tokens=True, return_tensors='pt')
# Fine-tune the model
model.train()
model.config.pad_token_id = tokenizer.eos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.vocab_size = model.config.vocab_size + len(tokenizer.get_added_vocab())
model.resize_token_embeddings(len(tokenizer))
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for _ in range(1):
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Save the fine-tuned model
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

OUTPUT_DIRECTORY = 'data_chunks'

# Get the list of chunked files
chunk_files = [filename for filename in os.listdir(OUTPUT_DIRECTORY) if filename.endswith('.txt')]
# Process each chunked file
COUNTER = 0
for chunk_file in chunk_files:
chunk_file_path = os.path.join(OUTPUT_DIRECTORY, chunk_file)
COUNTER += 1
# # Read the chunked file
with open(chunk_file_path, 'r', encoding='utf-8') as file:
chunk_text = file.read()
# Preprocess the chunked text
preprocessed_text = pre_process_data(chunk_text)
# Fine-tune the model
fine_tune_model(preprocessed_text)
print(COUNTER)

Before Running above code, we need to prepare our data into 1024 bit of every chunked. Preprocessing your text data is a crucial step before fine-tuning GPT-2 using Hugging Face and FastAPI. By splitting the data into chunks and applying appropriate preprocessing techniques, you can ensure that your model training process is efficient, accurate, and optimized for performance. Understanding the importance of data preprocessing and following the step-by-step guide provided in this article will set you on the path to successful model training with GPT-2.

Importing the necessary libraries:

import os

Setting up the input and output directories:

INPUT_FILE = 'got.txt'
OUTPUT_DIRECTORY = 'data_chunks'
CHUNK_SIZE = 1024

Creating the output directory if it doesn’t exist:

if not os.path.exists(OUTPUT_DIRECTORY):
os.makedirs(OUTPUT_DIRECTORY)

Reading the input file:

with open(INPUT_FILE, 'r', encoding='utf-8') as file:
text = file.read()

Calculating the number of chunks required:

num_chunks = len(text) // CHUNK_SIZE
if len(text) % CHUNK_SIZE != 0:
num_chunks += 1

Splitting the text into chunks and saving them as individual files:

for i in range(num_chunks):
start = i * CHUNK_SIZE
end = (i + 1) * CHUNK_SIZE
chunk = text[start:end]
# Save the chunk to a file
chunk_filename = os.path.join(OUTPUT_DIRECTORY, f'chunk_{i}.txt')
with open(chunk_filename, 'w', encoding='utf-8') as file:
file.write(chunk)

Printing the success message indicating the number of chunks created:

print(f'Successfully split the file into {num_chunks} chunks.')

Write those code and put it in a file called prepare_data.py.

In summary, this code takes an input file (got.txt) and divides its content into smaller chunks of size 1024 characters. It then saves each chunk as an individual file in the data_chunks directory. This approach is useful when dealing with large text files that need to be processed in smaller parts or when working with limited computational resources.

Make sure to replace got.txt with the path to your own input file and adjust the CHUNK_SIZE and OUTPUT_DIRECTORY according to your requirements.

By executing this code, you will successfully split your input file into multiple chunks, each saved as a separate file in the designated output directory.

After that you can execute that models.py, you gonna get finetuned models saved inside fine_tuned_model folder.

Next, We are going to create our models consumer that we called runner.py

Importing the necessary libraries:

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

Defining the generate function:

async def generate(prefix, max_length=800, top_k=5, model_dir='fine_tuned_model'):
"""Generate text based on the input prefix"""
# Loading the fine-tuned model and tokenizer:
model = GPT2LMHeadModel.from_pretrained(model_dir)
tokenizer = GPT2Tokenizer.from_pretrained(model_dir)

# Encoding the input prefix:
input_ids = tokenizer.encode(prefix, add_special_tokens=True, return_tensors='pt')

# Generating text based on the prefix:
model.eval()
model.config.pad_token_id = tokenizer.eos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.vocab_size = model.config.vocab_size + len(tokenizer.get_added_vocab())
model.resize_token_embeddings(len(tokenizer))

with torch.no_grad():
output = model.generate(
input_ids,
max_length=max_length,
num_return_sequences=1,
top_k=top_k,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
#Decoding and returning the generated text:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
return generated_text

In summary, this code defines an asynchronous generate function that takes a prefix as input. It loads a fine-tuned GPT-2 model and tokenizer, encodes the input prefix, and generates text based on the prefix using the GPT-2 model. The generated text is then decoded and returned as the output.

To use this code, make sure you have the fine-tuned GPT-2 model saved in the fine_tuned_model directory. Adjust the max_length and top_k parameters according to your desired text generation requirements.

By calling the generate function with an appropriate prefix, you can generate text using the fine-tuned GPT-2 model.

Remember to handle the asynchronous nature of the function appropriately based on the FastAPI async await standard.

Next we are going to use those generate function in our API

Create main.py

from fastapi import FastAPI
from pydantic import BaseModel
from runner import generate

app = FastAPI()

class GenerateRequest(BaseModel):
prefix: str
max_length: int = 800
top_k: int = 5

@app.post("/generate")
async def generate_text(request: GenerateRequest):
"""
Generate text based on the prefix asynchronously
"""
generated_sentence = await generate(request.prefix, request.max_length, request.top_k)
return {"generated_sentence": generated_sentence}

In summary, this code sets up a FastAPI application and creates an endpoint at /generate that accepts a POST request. The request body should include the prefix string and optional parameters such as max_length and top_k. The generate_text function asynchronously calls the generate function we explained earlier, passing the provided request parameters. The generated text is then returned as the API response.

To use this code, make sure you have FastAPI, Pydantic, and the generate function properly configured and imported. By sending a POST request to /generate with the required parameters, you can obtain the generated text as the API response.

Feel free to customize the code, such as adding input validation or error handling, based on your specific requirements and use case.

I hope this explanation helps you understand the functionality of the code for creating an API endpoint to generate text!

Every code that i wrote here is already on my github repository, please take a look at this following link

--

--