How to stream data output from Large Language Models (LLMs): A Comprehensive Guide

2 min readAug 8, 2023

The whirlwind of AI advancement surrounds us, presenting a rapid transformation. In a mere span of four months, our concerns shifted from a potential AI winter to the apprehension of AI becoming an influential force in every facet of our existence.

Each day unveils a fresh AI application, stretching the limits of what’s achievable. While we were still coming to terms with ChatGPT, the emergence of LangChain swiftly elevated the realm of automation to new heights.

While Large Language Models (LLMs), typically employed in on-premises setups or trained on custom datasets, often exhibit prolonged processing durations, particularly in tasks such as QA retrieval systems, envision a scenario where measures are employed to create an appearance of swift operation. What if we were to engineer an artful solution that artfully portrays minimal time consumption to users? The answer lies in the adept orchestration of real-time data streaming, seamlessly presented at the forefront of the user experience.

Within this blog, a code snippet awaits your discovery — an invaluable tool enabling you to harness LangChain’s StreamingStdOutCallbackHandler function for the seamless real-time streaming of output data by using OpenAI’s LLM.

Certain Large Language Models (LLMs) are designed to offer a streaming response mechanism. With this approach, there’s no need to wait for the complete response before commencing processing; you can initiate the processing as soon as the initial fragments are accessible. This feature proves invaluable when aiming to showcase the evolving response to the user in real-time during the generation process, or when necessitating concurrent processing alongside response generation.

from langchain.llms import OpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler


llm = OpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0)
resp = llm("How do I make Pancakes?")

The above code will call OpenAI’s GPT-3.5 model and stream it using the StreamingStdOutCallbackHandler function which is present in Langchain.

Right Now, Langchain support streaming for a broad range of LLM implementations, including but not limited to OpenAI, ChatOpenAI, ChatAnthropic, Hugging Face Text Generation Inference, and Replicate.

The LLM completes users’ requests and sends the response back.

Summary

The field of generative AI is still in its early stages and continuously evolving. This document encapsulates the core requirements for realizing the potential of data streaming and the foundational technologies that support the entire ecosystem.

The next document will provide a methodology to connect your vector databases like ChromaDB and Pinecone to OpenAI and retrieve data from your Custom Datasets.

All of the above data is provided by Langchain, OpenAI and their documentation. You can find more here.

How to stream data output from Large Language Models (LLMs): A Comprehensive Guide

Summary

Written by Sivateja