Building an Interactive Streaming Chatbot with Langchain, Transformers, and Gradio
Introduction:
In this article, we will focus on creating a simple streaming chatbot using Langchain, Transformers, and Gradio. We’ll break down the process into four steps:
- Loading a large language model
- Implementing a Langchain integration
- Adding a Gradio interface and testing without streaming
- Adding the streaming capability
Loading a large language model:
Transformers — provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.
To load the bloom model from the Hugging Face repository using the Transformer’s AutoModelForCausalLM module, we can write a reusable method:
from transformers import AutoModelForCausalLM, AutoTokenizer
def initialize_model_and_tokenizer(model_name="bigscience/bloom-1b7"):
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
return model, tokenizer
We can use this method to load the model and tokenizer:
model, tokenizer = initialize_model_and_tokenizer()
Implementing a Langchain integration
langchain — LangChain is a framework for developing applications powered by language models
To create a Langchain LLM (Large Language Model), we can use the Langchain module’s CustomLLM class:
from langchain.llms.base import LLM
class CustomLLM(LLM):
def _call(self, prompt, stop=None, run_manager=None) -> str:
inputs = tokenizer(prompt, return_tensors="pt")
result = model.generate(input_ids=inputs.input_ids, max_new_tokens=20)
result = tokenizer.decode(result[0])
return result
@property
def _llm_type(self) -> str:
return "custom"
llm = CustomLLM()
To create an LLM chain, we need the LLM and the prompt template. Let’s set a simple QA prompt and create a prompt template object:
from langchain import PromptTemplate
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
Now, we can create an llm_chain object with the LLM and prompt template:
from langchain import LLMChain
llm_chain = LLMChain(prompt=prompt, llm=llm)
Adding a Gradio interface
gradio — Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!
Let's create a simple Gradio chatbot interface with a text box and a submit button. Users can enter their queries in the text box and click the submit button. On the submit button action, we'll invoke the llm_chain and render the output in the UI:
import gradio as gr
with gr.Blocks() as demo:
chatbot = gr.Chatbot()
msg = gr.Textbox()
clear = gr.Button("Clear")
llm_chain, llm = init_chain(model, tokenizer)
def user(user_message, history):
return "", history + [[user_message, None]]
def bot(history):
print("Question: ", history[-1][0])
bot_message = llm_chain.run(question=history[-1][0])
print("Response: ", bot_message)
history[-1][1] = ""
history[-1][1] += bot_message
return history
msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(bot, chatbot, chatbot)
clear.click(lambda: None, None, chatbot, queue=False)
demo.queue()
demo.launch()
Demo
Adding the streaming capability:
To keep the user engaged and show immediate responses, let’s add the streaming support. We’ll use the TextIteratorStreamer from the Transformers library to achieve this:
TextIteratorStreamer — Streamer that stores print-ready text in a queue, to be used by a downstream application as an iterator.
from threading import Thread
from typing import Optional
from transformers import TextIteratorStreamer
class CustomLLM(LLM):
streamer: Optional[TextIteratorStreamer] = None
def _call(self, prompt, stop=None, run_manager=None) -> str:
self.streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, Timeout=5)
inputs = tokenizer(prompt, return_tensors="pt")
kwargs = dict(input_ids=inputs["input_ids"], streamer=self.streamer, max_new_tokens=20)
thread = Thread(target=model.generate, kwargs=kwargs)
thread.start()
return ""
@property
def _llm_type(self) -> str:
return "custom"
Next, let’s update the UI to read from the stream and update the interface in real-time:
import gradio as gr
with gr.Blocks() as demo:
chatbot = gr.Chatbot()
msg = gr.Textbox()
clear = gr.Button("Clear")
llm_chain, llm = init_chain(model, tokenizer)
def user(user_message, history):
return "", history + [[user_message, None]]
def bot(history):
print("Question: ", history[-1][0])
llm_chain.run(question=history[-1][0])
history[-1][1] = ""
for character in llm.streamer:
print(character)
history[-1][1] += character
yield history
msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(bot, chatbot, chatbot)
clear.click(lambda: None, None, chatbot, queue=False)
demo.queue()
demo.launch()
Demo:
Once the model generates the word, it immediately appears in the UI.
Conclusion:
By following these steps, we have successfully built a streaming chatbot using Langchain, Transformers, and Gradio. The chatbot can provide real-time responses to user queries, making the conversation more engaging and interactive.
References:
- Custom LLM — https://python.langchain.com/docs/modules/model_io/models/llms/how_to/custom_llm
- Gradio streaming chatbot— https://www.gradio.app/guides/creating-a-chatbot#add-streaming-to-your-chatbot
- TextIteratorStreamer — https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextIteratorStreamer
Demo Scripts:
- QA with streaming — https://colab.research.google.com/drive/1MWtLsTQXOKJgm86zI5sBOgbKeY_WWzCq?usp=sharing
- QA without streaming — https://colab.research.google.com/drive/1Lv3YyFn4JHTa2FgafnVPm3Cjw2Tg4sr2?usp=sharing