Streaming and longer context lengths for LLMs on Workers AI

Cloudflare
Cloudflare
Published in
2 min readNov 14, 2023

Workers AI is our serverless GPU-powered inference platform running on top of Cloudflare’s global network. It provides a growing catalog of off-the-shelf models that run seamlessly with Workers and enable developers to build powerful and scalable AI applications in minutes. We’ve already seen developers doing amazing things with Workers AI, and we can’t wait to see what they do as we continue to expand the platform. To that end, today we’re excited to announce some of our most-requested new features: streaming responses for all Large Language Models (LLMs) on Workers AI, larger context and sequence windows, and a full-precision Llama-2 model variant.

If you’ve used ChatGPT before, then you’re familiar with the benefits of response streaming, where responses flow in token by token. LLMs work internally by generating responses sequentially using a process of repeated inference — the full output of a LLM model is essentially a sequence of hundreds or thousands of individual prediction tasks. For this reason, while it only takes a few milliseconds to generate a single token, generating the full response takes longer, on the order of seconds. The good news is we can start displaying the response as soon as the first tokens are generated, and append each additional token until the response is complete. This yields a much better experience for the end user — displaying text incrementally as it’s generated not only provides instant responsiveness, but also gives the end-user time to read and interpret the text.

As of today, you can now use response streaming for any LLM model in our catalog, including the very popular Llama-2 model. Here’s how it works…

Read the complete post on our blog.

Originally published at https://blog.cloudflare.com on November 14, 2023.

--

--