AI in its Native Habitat: WebLLM, WebSD, and MLC-LLM

Team Octo
OctoAI
Published in
4 min readMay 8, 2023

The future is hybrid edge-cloud AI

As of this week, if you are one of the 2.6 billion active Chrome users and are up to date (version 113), you can now run the latest generative AI models at Web-Stable-Diffusion and Web-large-language-model with nothing more than your laptop and web browser.

Web Stable Diffusion —image generation in airplane mode

Enabling anyone with a web browser to run bleeding edge models like Stable Diffusion, LLaMA and Vicuna on the edge is a game-changer for ML deployment, as it allows these workloads to run on a wider variety of hardware. Whereas Web-SD and Web-LLM are executed in the browser, MLC-LLM features universal deployment, in which models run natively on the device hardware. This includes iOS devices, Android phones, laptops, and other consumer devices.

WebLLM in action — no Internet connection required

This groundbreaking work is the result of a collaboration between the Catalyst research group at Carnegie Mellon University (CMU), OctoML and the entire Apache TVM open source community. Together, our work enables modern generative AI models to run inside web browsers at near-native speeds using the new WebGPU standard and natively on consumer devices, including iOS devices and Android.

MLC-LLM running on iPhone. Also available on Android.

The magic is made possible by a technology near-and-dear to us: Apache TVM. TVM is an open-source deep-learning compiler framework that empowers engineers to optimize and run computations efficiently, on any hardware backend. Originally created by our Co-Founders, Octonauts are heavy contributors to Apache TVM, and it powers many of the OctoML platform’s most unique features.

What does it take to run large models locally?

There are many key challenges to be solved in order to achieve this goal. First of all, we need to enable support for a diverse set of hardware and GPU devices. Secondly, we will need to bring a key set of optimizations to enable those models to fit into the limited GPU memory budget of consumer devices. These can be accomplished using TVM Unity, an exciting development from the Apache TVM community:

  • We first leverage the common TVM unity compilation flow to build and optimize the computation techniques like kernel fusion and quantization support at (fp16 and int4) to save memory bandwidth and reduce user download times.
  • The TVM unity compiler then generates GPU kernels for different backend. We generate WebGPU Shading Language(WGSL) for the web browsers, metal shader for iPhone and vulkan for linux/windows devices. The compilation flow allows us to generate kernels on different backend without extra engineering backend.
  • A TypeScript version of the TVM Unity runtime then invokes these kernels in the appropriate order inside the browser. For native devices, the native TVM runtime can be invoked from Swift, Python, C++, Rust, Java or whatever the application requires.

Why local models?

This breakthrough is incredibly exciting for the future of ML applications. There will often be instances where data privacy is paramount, latency requirements are stringent, or available computing resources are idle. In these situations, this technology can help reduce costs and enable novel experiences that were previously unattainable.

But there are still many benefits to running workloads in the cloud including:

  • Always-on availability
  • Access to powerful hardware
  • Control over hardware selection
  • Centralization of large shared assets which simplifies storage and access control

What does hybrid edge-cloud AI look like?

So how can we marry the benefits of cloud and edge ML computing to get the best of both worlds?

We have a few ideas that we’re working on and wanted to share an early glimpse for feedback here:

Cost mitigation — by splitting a workload into portions for edge and cloud, we can reduce the compute burden of running everything in the cloud while not over-burdening the edge device. Examples of this include:

  • A cascading classification system where a simple classifier runs on the edge to detect the easy and common cases but passes more difficult inputs on to a larger, more powerful cloud model. This form of ensemble learning provides better accuracy while reducing bandwidth and cloud compute consumption.

More robust user experiences — if the same model can be run locally as remotely in the cloud, then whenever a user goes offline, they can continue to interact with the application’s intelligence via the local GPU and battery.

Privacy and customization — by running some of the workload locally, users can benefit from increased privacy and customization than needing to upload everything to the cloud. An example of this includes running fine-tuning locally on your personal/private data, then only sharing the checkpoints or LoRA weights with the remote service so it can adapt to your use case without having access to all of the data.

OctoML makes AI accessible and sustainable. That means flexibility to run the models you want on the hardware you want, at a cost that enables you to build a thriving business. Efficient compilation and advanced acceleration techniques are what make this vision possible. The OctoML compute service makes fast, cost-efficient compute available to developers working with generative AI in the cloud thanks to the magic of acceleration. Sign up for early access to try it today.

We’re excited about the potential for integrating local models with our cloud service and want to hear from you about any hybrid edge/cloud AI need needs you have. Please reach out directly, or catch us on May 23rd where we’re hosting Text/Image developer meetup at our Seattle HQ.

--

--

Team Octo
OctoAI
Editor for

Thoughts on machine learning, app dev, and the future of AI from the engineers at octo.ai