Building a Custom Language Model (LLM) for Chatbots: A Practical Guide

Gautam V.
11 min readJul 7, 2023

Introduction

In today’s AI-driven world, generic chatbots are a dime a dozen. But what if you could create a chatbot that truly stands out, one that can dispense the timeless wisdom of Chanakya Neeti? This article is your blueprint for building a custom Language Model (LLM) tailored to this specific domain.

I’ll guide you through the entire process, from data collection to deployment. Whether you’re a seasoned developer or just starting your AI journey, this guide will equip you with the practical knowledge to craft a chatbot that resonates with your audience.

About me- I am Gautam, an AI engineer and an Architect with a passion for natural language processing. Through this article, my goal is to guide you in creating your custom LLM that can provide insightful answers based on the wisdom of Chanakya.

“Blending ancient wisdom with AI development, where the python learns Chanakya’s sayings. It’s a witty convergence of timeless knowledge and cutting-edge tech. Join the laughter-filled journey as we unlock the secrets of both worlds!”

Understanding the Goal

Our focus will include:

  • Building the Custom LLM: Discover the fundamentals of creating a language model, covering data preparation, architectural design, and training techniques.
  • Enhancing with Custom Memory: Implement a memory system that allows your chatbot to retain and utilize past interactions, enriching conversations and providing a more personalized user experience.
  • Personalizing User Interactions: Explore strategies to tailor responses based on user preferences and behaviours, making your chatbot a valuable tool and a companion in thoughtful dialogue.

By the end of this journey, you’ll have a functional chatbot that can provide valuable insights and advice drawn from the wisdom of Chanakya. Let’s set the stage for a chatbot that doesn’t just answer questions but engages users in profound ways!

Install Required Libraries

To build your custom LLM, you’ll need a few essential libraries. These tools will enable you to leverage the power of natural language processing effectively. Here’s what you’ll need to install:

  1. Transformers Library: Developed by Hugging Face, this library provides a high-level API for natural language processing tasks, complete with pre-trained models and tokenizers. Install it by running:
pip install transformers

2. TensorFlow Library: A popular open-source machine learning framework that offers flexibility and efficiency for building and training models. Install TensorFlow using the command:

pip install tensorflow

Data Collection and Preprocessing: Preparing Chanakya Neeti for Training

To build our LLM steeped in Chanakya Neeti’s wisdom, we’ll gather relevant text data from PDFs and refine it for machine learning. This ensures we capture the teachings accurately.

Data Collection from PDFs

Here’s a Python function that retrieves text from a PDF using the pdfplumber library.

import pdfplumber
import requests

def fetch_text_from_pdf(pdf_link):
try:
response = requests.get(pdf_link)
with open('temp.pdf', 'wb') as file:
file.write(response.content)
with pdfplumber.open('temp.pdf') as pdf:
text = ''
for page in pdf.pages:
text += page.extract_text() or ''
return text
except Exception as e:
print(f"Error fetching {pdf_link}: {e}")
return ""

Text Preprocessing

After collecting our data, we need to preprocess it to transform raw text into a polished format for training. This includes removing numbers, punctuation, and stopwords to ensure the focus remains on the teachings of Chanakya Neeti.

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
text = text.encode('ascii', 'ignore').decode('utf-8')
text = text.lower()
text = re.sub(r'\d+', '', text)
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'[^\w\s]', '', text)

tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
custom_stopwords = {'chanakya', 'neeti', 'chapter'}
stop_words.update(custom_stopwords)

tokens = [token for token in tokens if token not in stop_words]

return " ".join(tokens)

# Test preprocessing on a sample text
sample_text = """
Chanakya Neeti, Chapter 1:
A person should not be too honest.
"""
processed_text = preprocess_text(sample_text)
print(processed_text)

Collecting and Preprocessing Data from Multiple Sources

Now, let’s automate the gathering and preprocessing of multiple PDF files, culminating in a polished corpus ready for model training.

corpus = []
books = [
{"title": "Chanakya Neeti", "pdf_link": "<URL>"},
{"title": "Chanakya Neeti Darpan", "pdf_link": "<URL>"},
]

for book in books:
pdf_link = book["pdf_link"]
text = fetch_text_from_pdf(pdf_link)
if text:
processed_text = preprocess_text(text)
corpus.append(processed_text)

for i, sample in enumerate(corpus[:3], 1):
print(f"Sample {i}: {sample[:500]}...\n")

By collecting and preprocessing text data from various PDFs, we’ve a robust framework for generating a clean corpus. This disciplined approach ensures our data is normalized, tokenized, and free from extraneous noise, paving the way for effective model training.

Training the Custom LLM

Now that you have a clean, preprocessed dataset, it’s time to train your LLM using TensorFlow. Here’s a step-by-step guide:

1. Prepare Training Data

Ensure your preprocessed text data is formatted correctly for TensorFlow. Each piece of text should be tokenized and properly aligned with its target outputs.

2. Define LLM Architecture

We’ll use TensorFlow’s Keras API to build a recurrent neural network (RNN) architecture featuring an LSTM layer, ideal for sequence prediction tasks. This setup will allow the model to capture temporal dependencies in the input data. Here’s an example implementation:

import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Assuming corpus is your list of preprocessed texts
X = [text.split() for text in corpus] # Tokenized input texts
y = [text[1:] for text in X] # Target output (shifted input for training)

# Convert words to integer indices
word_index = {word: idx + 1 for idx, word in enumerate(set([word for text in X for word in text]))}
X = [[word_index[word] for word in text] for text in X]

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Set hyperparameters
vocab_size = len(word_index) + 1 # Plus one for padding
embedding_dim = 128
max_seq_length = 50
lstm_units = 256
output_units = vocab_size

# Pad sequences to ensure uniform input size
X_train = pad_sequences(X_train, maxlen=max_seq_length)
X_val = pad_sequences(X_val, maxlen=max_seq_length)

# Build the model
model = Sequential([
Embedding(vocab_size, embedding_dim, input_length=max_seq_length),
LSTM(lstm_units),
Dense(output_units, activation='softmax')
])

# Compile and train the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)

# Save the trained model
model.save('custom_llm_model.h5')

3. Customize and Experiment

Feel free to modify the provided code to explore different hyperparameters — such as embedding dimensions and LSTM units — as well as model architectures, like adding additional layers or adjusting the activation functions. Additionally, you can tweak the training settings, including the number of epochs and batch sizes. Just remember to adjust X_train, y_train, X_val, and y_val accordingly.

Evaluation and Testing

After training our custom LLM, it’s essential to evaluate its performance to ensure it generates accurate and coherent responses. This section outlines the key metrics and methods for effective assessment.

Evaluation Metrics

  • Perplexity: This metric measures how well the model’s predicted probability distribution matches the actual distribution of words. A lower perplexity indicates better performance.
  • BLEU Score: The BLEU (Bilingual Evaluation Understudy) score evaluates the quality of generated text by comparing it to reference texts. Higher scores suggest closer alignment to the reference output. Note that BLEU has limitations and may not capture the full nuance of human-generated responses.

Testing the Model

  1. Generate Text Samples: Use the trained custom LLM to generate text based on a given input prompt.
  2. Manual Review: Assess generated responses for accuracy, coherence, style, and alignment with Chanakya Neeti’s wisdom. While BLEU and perplexity provide quantitative insights, manual review is vital for capturing subtleties.
  3. Iterate and Fine-Tune: Based on evaluation results, refine the model to enhance performance.
  4. Additional Testing Strategies: Consider user studies or A/B testing to evaluate user satisfaction and the chatbot’s effectiveness in practical contexts.
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
from nltk.translate.bleu_score import sentence_bleu
import numpy as np
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)

# Function to evaluate the model
def evaluate_model(input_text, reference_text, max_length=50, num_return_sequences=5):
try:
# Load the fine-tuned LLM and tokenizer
model = TFGPT2LMHeadModel.from_pretrained("./fine_tuned_llm")
tokenizer = TFGPT2LMHeadModel.from_pretrained("gpt2")

# Generate text samples
input_ids = tokenizer.encode(input_text, return_tensors="tf")
output_ids = model.generate(input_ids, max_length=max_length, num_return_sequences=num_return_sequences)

# Decode generated text
generated_text = [tokenizer.decode(ids, skip_special_tokens=True) for ids in output_ids]
logging.info("Generated Text: %s", generated_text)

# Calculate BLEU score
bleu_scores = [sentence_bleu([ref.split()], gen.split()) for ref, gen in zip(reference_text, generated_text)]

# Calculate perplexity
input_ids = tokenizer.encode(reference_text[0], return_tensors="tf")
logits = model(input_ids)[0]
cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = np.exp(cross_entropy(input_ids, logits).numpy())

return generated_text, bleu_scores, perplexity

except Exception as e:
logging.error("An error occurred during evaluation: %s", e)
return None, None, None

# Example usage
input_text = "What is the meaning of life?"
reference_text = ["The meaning of life is to be happy."]
generated_text, bleu_scores, perplexity = evaluate_model(input_text, reference_text)

if generated_text is not None:
print("Generated Text:")
for text in generated_text:
print(text)
print("BLEU Scores:", bleu_scores)
print("Perplexity:", perplexity)

By following this evaluation process and integrating user feedback, you can refine your custom LLM to better resonate with the teachings of Chanakya.

Fine-tuning and Optimization

To fine-tune our custom Large Language Model (LLM), we load a pre-trained model like GPT-2 and unfreeze the last few layers for training. This process allows the model to adapt while retaining its foundational knowledge.

Steps for Fine-tuning

  1. Load the Pre-trained Model: Use the transformers library to load GPT-2 as the base model.
  2. Unfreeze Layers: Unfreeze the last few layers to enable fine-tuning on the custom dataset.
  3. Define the Optimizer: Choose the Adam optimizer with a learning rate of 1e−51e-51e−5. Note that this may need adjustment based on experimentation and dataset characteristics.
  4. Compile the Model: Compile with sparse categorical cross-entropy as the loss function.
  5. Train the Model: Train the model on the prepared datasets.
  6. Save the Fine-tuned Model: Save the model for future use.
import tensorflow as tf
from transformers import TFGPT2LMHeadModel

# Load the pre-trained GPT-2 model
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

# Unfreeze specific layers for fine-tuning
for layer in model.layers[-6:]:
layer.trainable = True

# Define the optimizer and learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5) # Learning rate may require tuning

# Compile the model
model.compile(optimizer=optimizer, loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

# Train the model on the custom dataset
model.fit(train_dataset, epochs=10, validation_data=val_dataset)

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_llm")

While Adam is commonly used, consider exploring alternative optimizers like SGD with momentum based on the specific task and dataset. By following these steps, we ensure our custom LLM is optimized to generate responses aligned with the insights of Chanakya Neeti.

Deployment and User Interface

This section will explore methods for deploying our fine-tuned LLM and creating a user interface to interact with it. We’ll utilize Next.js, TypeScript, and Google Material UI for the front end, while Python and Flask for the back end.

Set Up the Backend Server

# Install Flask and other required libraries
pip install flask

Backend (Python and Flask) — app.py:

from flask import Flask, request, jsonify
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

app = Flask(__name__)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

@app.route("/chatbot", methods=["POST"])
def chatbot():
message = request.json["message"]
input_ids = tokenizer.encode(message, return_tensors="tf")
output_ids = model.generate(input_ids, max_length=100, num_return_sequences=1)
response = tokenizer.decode(output_ids[0])

return jsonify({"response": response})

if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)

Frontend (Next.js, Material UI, and TypeScript) — pages/index.tsx:

import { useState } from "react";
import axios from "axios";
import { Box, TextField, Button } from "@material-ui/core";

const Chatbot = () => {
const [inputMessage, setInputMessage] = useState("");
const [chatMessages, setChatMessages] = useState([]);

const handleInputChange = (event) => {
setInputMessage(event.target.value);
};

const sendMessage = async () => {
if (inputMessage.trim() !== "") {
setChatMessages((prevMessages) => [
...prevMessages,
{ content: inputMessage, sender: "user" },
]);
setInputMessage("");

try {
const response = await axios.post("/chatbot", { message: inputMessage });
const botResponse = response.data.response;

setChatMessages((prevMessages) => [
...prevMessages,
{ content: botResponse, sender: "bot" },
]);
} catch (error) {
console.error("Error:", error);
}
}
};

return (
<Box>
<Box>
{chatMessages.map((message, index) => (
<Box
key={index}
display="flex"
justifyContent={message.sender === "user" ? "flex-end" : "flex-start"}
>
<Box
bgcolor={message.sender === "user" ? "#f0f0f0" : "#d3d3d3"}
borderRadius="10px"
p={1}
m={1}
maxWidth="70%"
>
{message.content}
</Box>
</Box>
))}
</Box>
<Box display="flex" alignItems="center">
<TextField
variant="outlined"
label="Type your message"
value={inputMessage}
onChange={handleInputChange}
/>
<Button variant="contained" color="primary" onClick={sendMessage}>
Send
</Button>
</Box>
</Box>
);
};

export default Chatbot;

To run the application:

  1. Save the backend code in app.py and the frontend code in pages/index.tsx.
  2. Install the required npm dependencies using, npm install next react react-dom @mui/material axios.
  3. Start the Flask backend server by running python app.py in the terminal.
  4. Start the Next.js development server by running npm run dev in another terminal.
  5. Access the application by visiting http://localhost:3000 in your browser.

With this code, you’ll have a working application where UI allows you to enter input text, generate text using the fine-tuned LLM, and view the generated text.

Sample Conversation Between User and Our Model

User: Can you share a valuable lesson from Chanakya Neeti?
LLM: “A person should not be too honest. Honest people are often taken advantage of.”

User: What advice can you provide on building solid relationships?
LLM: “Never share your secrets with anybody. It can lead to your downfall.”

User: Can you share a quote on the importance of Education?
LLM: “Education is the best friend. It brings respect and success.”

User: Can you guide making wise decisions?
LLM: “Learn from the mistakes of others to avoid making them yourself.”

User: Can you share a lesson on leadership?
LLM: “Appear strong and confident, even if you don’t possess absolute power.”

User: What advice do you have on managing wealth?
LLM: “Be cautious and smart in financial matters to protect your wealth.”

User: Thank you for these insights from Chanakya Neeti.
LLM: You’re welcome! Feel free to ask if you have more questions.

Possible Next Steps

As AI technology evolves, your chatbot application can explore exciting new developments. Consider these next steps:

  1. Advanced NLP Techniques: Incorporate sentiment analysis, named entity recognition, and language translation to deepen understanding and response accuracy.
  2. Voice Interaction: Implement voice recognition and synthesis for accessible, user-friendly voice-based interactions.
  3. Multi-Platform Deployment: Expand reach by deploying on mobile devices, smart speakers, and messaging apps.
  4. Machine Learning Enhancements: Continuously train and fine-tune your chatbot using user feedback and real-time data to boost performance and adaptability.
  5. User Personalization: Create user profiles and personalized recommendations for a tailored chatbot experience.

Ethical Considerations

As you develop your chatbot, it’s crucial to prioritize ethical considerations, particularly regarding user data and response generation:

  • Bias Mitigation: Continuously monitor and address potential biases in your chatbot’s responses to ensure fairness and inclusivity.
  • Data Privacy: Prioritize data protection and comply with relevant regulations to maintain user trust.

By pursuing these steps and addressing ethical concerns, you’ll stay ahead in AI advancements, crafting a chatbot that offers valuable assistance and a seamless user experience.

Important Note: Always consider legal and copyright aspects when using external materials. Ensure you have the necessary permissions, comply with data protection laws, and provide proper attribution.

Happy coding!

References

Here’s a detailed breakdown of the references used:

1. Deep Learning Libraries and Frameworks:

2. Data Collection and Preprocessing:

3. Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016): https://www.deeplearningbook.org/ (Comprehensive textbook covering deep learning concepts and hyperparameter tuning)

4. Web Development Frameworks (Optional):

5. API Design Patterns by Joshua Roberts (2013): https://www.oreilly.com/online-learning/integration-docs/search.html (Guide to designing well-structured and efficient APIs)

6. Chanakya Neeti Texts: https://www.gutenberg.org/ (This is a general reference for historical texts and public domain books. You can search for specific translations or editions of Chanakya Neeti here)

--

--