Building an Interactive Streaming Chatbot with Langchain, Transformers, and Gradio

Introduction:

Shrinath Suresh
3 min readJul 12, 2023

In this article, we will focus on creating a simple streaming chatbot using Langchain, Transformers, and Gradio. We’ll break down the process into four steps:

  1. Loading a large language model
  2. Implementing a Langchain integration
  3. Adding a Gradio interface and testing without streaming
  4. Adding the streaming capability

Loading a large language model:

Transformers — provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

To load the bloom model from the Hugging Face repository using the Transformer’s AutoModelForCausalLM module, we can write a reusable method:

from transformers import AutoModelForCausalLM, AutoTokenizer

def initialize_model_and_tokenizer(model_name="bigscience/bloom-1b7"):
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
return model, tokenizer

We can use this method to load the model and tokenizer:

model, tokenizer = initialize_model_and_tokenizer()

Implementing a Langchain integration

langchain — LangChain is a framework for developing applications powered by language models

To create a Langchain LLM (Large Language Model), we can use the Langchain module’s CustomLLM class:

from langchain.llms.base import LLM

class CustomLLM(LLM):
def _call(self, prompt, stop=None, run_manager=None) -> str:
inputs = tokenizer(prompt, return_tensors="pt")
result = model.generate(input_ids=inputs.input_ids, max_new_tokens=20)
result = tokenizer.decode(result[0])
return result

@property
def _llm_type(self) -> str:
return "custom"

llm = CustomLLM()

To create an LLM chain, we need the LLM and the prompt template. Let’s set a simple QA prompt and create a prompt template object:

from langchain import PromptTemplate

template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

Now, we can create an llm_chain object with the LLM and prompt template:

from langchain import LLMChain

llm_chain = LLMChain(prompt=prompt, llm=llm)

Adding a Gradio interface

gradio — Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!

Let's create a simple Gradio chatbot interface with a text box and a submit button. Users can enter their queries in the text box and click the submit button. On the submit button action, we'll invoke the llm_chain and render the output in the UI:

import gradio as gr

with gr.Blocks() as demo:
chatbot = gr.Chatbot()
msg = gr.Textbox()
clear = gr.Button("Clear")
llm_chain, llm = init_chain(model, tokenizer)

def user(user_message, history):
return "", history + [[user_message, None]]

def bot(history):
print("Question: ", history[-1][0])
bot_message = llm_chain.run(question=history[-1][0])
print("Response: ", bot_message)
history[-1][1] = ""
history[-1][1] += bot_message
return history

msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(bot, chatbot, chatbot)
clear.click(lambda: None, None, chatbot, queue=False)

demo.queue()
demo.launch()

Demo

Adding the streaming capability:

To keep the user engaged and show immediate responses, let’s add the streaming support. We’ll use the TextIteratorStreamer from the Transformers library to achieve this:

TextIteratorStreamer — Streamer that stores print-ready text in a queue, to be used by a downstream application as an iterator.

from threading import Thread
from typing import Optional
from transformers import TextIteratorStreamer

class CustomLLM(LLM):
streamer: Optional[TextIteratorStreamer] = None

def _call(self, prompt, stop=None, run_manager=None) -> str:
self.streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, Timeout=5)
inputs = tokenizer(prompt, return_tensors="pt")
kwargs = dict(input_ids=inputs["input_ids"], streamer=self.streamer, max_new_tokens=20)
thread = Thread(target=model.generate, kwargs=kwargs)
thread.start()
return ""

@property
def _llm_type(self) -> str:
return "custom"

Next, let’s update the UI to read from the stream and update the interface in real-time:

import gradio as gr

with gr.Blocks() as demo:
chatbot = gr.Chatbot()
msg = gr.Textbox()
clear = gr.Button("Clear")
llm_chain, llm = init_chain(model, tokenizer)

def user(user_message, history):
return "", history + [[user_message, None]]

def bot(history):
print("Question: ", history[-1][0])
llm_chain.run(question=history[-1][0])
history[-1][1] = ""
for character in llm.streamer:
print(character)
history[-1][1] += character
yield history

msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(bot, chatbot, chatbot)
clear.click(lambda: None, None, chatbot, queue=False)

demo.queue()
demo.launch()

Demo:

Once the model generates the word, it immediately appears in the UI.

Conclusion:

By following these steps, we have successfully built a streaming chatbot using Langchain, Transformers, and Gradio. The chatbot can provide real-time responses to user queries, making the conversation more engaging and interactive.

References:

  1. Custom LLM — https://python.langchain.com/docs/modules/model_io/models/llms/how_to/custom_llm
  2. Gradio streaming chatbot— https://www.gradio.app/guides/creating-a-chatbot#add-streaming-to-your-chatbot
  3. TextIteratorStreamer — https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextIteratorStreamer

Demo Scripts:

  1. QA with streaming — https://colab.research.google.com/drive/1MWtLsTQXOKJgm86zI5sBOgbKeY_WWzCq?usp=sharing
  2. QA without streaming — https://colab.research.google.com/drive/1Lv3YyFn4JHTa2FgafnVPm3Cjw2Tg4sr2?usp=sharing

--

--