Look ma, no WiFi! Gemma on Android and iPhone and more local LLM updates from MLC

Team Octo
OctoAI
Published in
3 min readFeb 23, 2024

Now you can run Gemma2B on your phone

iPhone and Android users can try out Google Gemma 2B on mobile devices courtesy of MLC-LLM. The lightweight, 2B parameter version of Gemma outputs 20 tokens/sec. Here is a demo of 4-bit quantized Gemma-2b model running on Samsung S23.

Your Android can now pen an epic poem about Pittsburgh, no WiFi required.

When Google released Gemma two days ago, the MLC community got excited to try it out. MLC is an open source project supported by Carnegie Mellon University, OctoAI and a broad community of contributors.

The MLC community had Gemma 2B running locally in under a day, thanks in part to the new MLC SLM compilation flow. SLM is a Pythonic compiler stack that features PyTorch-style model definition, quantization, and graph/operator-level optimization based on Apache TVM led by compiler lead Junru Shao and collaborators at OctoAI. SLM compiles models to run across a wide range of platforms and hardware backends, such as consumer/server class GPUs from all vendors, iOS/Android phones, web browsers, and more.

Keep an eye out for upcoming demos to showcase its capabilities!

Your iPhone can now bring you to tears with an ode to ML

Why LLMs at the edge?

The concept of running models on the edge is a fascinating one, particularly when used in conjunction with larger models operating in the cloud in a hybrid cloud/local setup. This appeals to business users for several reasons:

  1. It addresses latency and connectivity issues (since the speed of light remains a fundamental limitation)
  2. It can help alleviate some of the compute pressure on the cloud
  3. Running sensitive components locally can help mitigate privacy and governance complexities

MLC-LLM is one of the core technologies underpinning OctoAI’s tech stack. As CEO Luis Ceze puts it: “Hardware portability for LLMs not only enables more efficient use of cloud resources, it also enables new use cases. Chain local Phi-2 or Gemma 2B with largeLLMs running on OctoAI cloud and you have a magic model cocktail!”

What’s next for local LLMs?

MLC will continue to quickly onboard as many new open source models as we can. In recent months, we’ve added Phi-2, Mixtral, and now Gemma, with more coming soon. We’re excited about the prospect of running fine-tunes to create hyper-fast, hyper-local, hyper-custom text-gen experiences. Fine-tuned OSS model variants are outperforming even the largest proprietary models in human quality evaluations, making them attractive, low-cost options for many use cases.

The MLC chat beta for iPhone features a number of other models, including Mixtral, Phi-2, and Llama2. Try it out here: https://testflight.apple.com/join/57zd7oxa

Or check out our repo: http://github.com/mlc-ai/mlc-llm

Try OctoAI for free today at octoai.cloud

--

--

Team Octo
OctoAI
Editor for

Thoughts on machine learning, app dev, and the future of AI from the engineers at octo.ai