Accelerating Llama on Lima, with WASI-NN RPC

Published in

nttlabs

5 min readJun 19, 2024

WasmEdge v0.14 was released last month, with our contribution for exposing WASI-NN (WebAssembly System Interface API for Neural Networks) over gRPC.

The WASI-NN RPC is useful for accelerating LLM workloads (e.g., Llama) on virtual machines (e.g., Lima) that do not support virtualizing GPUs.

On Apple M2 Pro, Llama 2 can run 22.3 times faster. (0.66 tokens/s → 14.73 tokens/s)

Note: “Lima” in this context refers to <https://lima-vm.io> (VM), not to <https://gitlab.freedesktop.org/lima> (Mali GPU driver).

Problem: GPUs are inaccessible from VMs

Lima is a tool that creates a Linux virtual machine with a simple command line interface. Lima was originally made for running containerd including nerdctl (contaiNERD CTL) on macOS. However, Lima has gained popularity for other use cases as well.

Lima is now a CNCF project 🎉

Lima, the Linux virtual machine for running containerd on macOS, is now accepted in the CNCF Sandbox (Sep 13) 🎉.

medium.com

For macOS hosts, Lima supports two backends: QEMU and Virtualization.framework. The lack of the support for GPUs in these backends has been a huge burden for users who want to efficiently run AI workloads such as Llama inside Lima.

Solution: WASI-NN as the high-level RPC for neural networks on GPUs

Implementing GPU passthrough in these VM backends is not a straightforward task. Instead, we chose to implement an RPC subsystem that delegates neural network computations to a host process (WASI-NN RPC Server) with direct access to the host GPUs.

The RPC is built on top of gRPC and directly mapped to the WITX specification of the WASI-NN API.

// gRPC
message SetInputRequest {
  uint32 resource_handle = 1;
  uint32 index = 2;
  Tensor tensor = 3;
}

message ComputeRequest{
  uint32 resource_handle = 1;
}

message GetOutputRequest {
  uint32 resource_handle = 1;
  uint32 index = 2;
}

message GetOutputResult {
  bytes data = 1;
}

service GraphExecutionContextResource {
  rpc SetInput(SetInputRequest) returns (google.protobuf.Empty) {};
  rpc Compute(ComputeRequest) returns (google.protobuf.Empty) {};
  rpc GetOutput(GetOutputRequest) returns (GetOutputResult) {};
}

;; WITX
(@interface func (export "set_input")
  (param $context $graph_execution_context)
  (param $index u32)
  (param $tensor $tensor)
  (result $error (expected (error $nn_errno)))
)

(@interface func (export "compute")
  (param $context $graph_execution_context)
  (result $error (expected (error $nn_errno)))
)

(@interface func (export "get_output")
  (param $context $graph_execution_context)
  (param $index u32)
  (param $out_buffer (@witx pointer u8))
  (param $out_buffer_max_size $buffer_size)
  (result $error (expected $buffer_size (error $nn_errno)))
)

The RPC client is implemented in WasmEdge. The RPC itself is agnostic to WASM and can be implemented by non-WASM applications too.

Why does WASM matter here?

Actually, it really doesn’t. WASM appears here simply because:

the WASI-NN API provides a quite simple abstraction for neural networks
the WasmEdge implementation of WASI-NN already covers several backends such as PyTorch and GGML, with the support for Apple Metal.

Alternatively, Dawn Wire (RPC for WebGPU) could be adopted instead of WASM and WASI-NN, but it would incur a higher implementation cost due to the difference in abstraction levels.

Demo: 22 times faster

Launching Lima

An instance of Lima virtual machine can be created as follows:

# Host (macOS)
brew install lima
limactl start --vm-type=vz
lima

As of the time of writing this, the brew command installs Lima v0.22 with Ubuntu 24.04 as the default VM template.

The --vm-type=vz flag in the limactl start command specifies Virtualization.framework (vz) as the VM driver. This flag is optional, but recommended for better performance and stability.

Installing WasmEdge onto the Lima guest

After running the lima command to open a shell for the VM, run the following commands to install WasmEdge inside the guest:

# Guest (Linux)
sudo apt-get install -y cmake libgrpc++-dev liblld-dev libopenblas-dev libopenblas64-dev llvm ninja-build pkg-config protobuf-compiler-grpc

git clone https://github.com/WasmEdge/WasmEdge.git 
cd WasmEdge
git checkout 0.14.0

cmake -S. -B ./build -GNinja \
  -DCMAKE_BUILD_TYPE=Release \
  -DWASMEDGE_PLUGIN_WASI_NN_BACKEND=GGML \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_BLAS=ON \
  -DWASMEDGE_BUILD_WASI_NN_RPC=ON
cmake --build ./build
sudo cmake --install ./build

Running Llama on Lima, without the acceleration

Inside the Lima VM, Llama can be executed with WasmEdge as follows:

# Guest (Linux)
curl -OSL https://github.com/second-state/WasmEdge-WASINN-examples/raw/da18b35c3c911a40a5d2784947ce78610ce51daf/wasmedge-ggml/nnrpc/wasmedge-ggml-nnrpc.wasm
curl -OSL https://huggingface.co/wasmedge/llama2/resolve/23de599453ce999ab1dc650bd01f6298af38eb18/llama-2-7b-chat-q5_k_m.gguf

wasmedge \
  --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
  --env enable_log=true \
  wasmedge-ggml-nnrpc.wasm default

The license and acceptable use policy for the llama-2-7b-chat-q5_k_m.gguf file can be found at <https://huggingface.co/wasmedge/llama2/tree/23de599>.
Llama was chosen to be executed inside Lima as a pun; it is possible to use other GGUF-formatted models as well.

In the terminal, you can chat with the model, but it is quite slow (0.66 tokens per second on Apple M2 Pro) due to the lack of access to the host GPUs:

USER:                                                                                                                                      
What is the capital city of Peru?
[...]
eval time =   13535.83 ms /     9 runs   ( 1503.98 ms per token,     0.66 tokens per second)
[...]
ASSISTANT:
The capital city of Peru is Lima.</s>

It may even appear to hang, as the model’s output is not printed until text generation is complete. This issue is being addressed in <https://github.com/WasmEdge/WasmEdge/pull/3386> by implementing the WASI-NN Streaming Extension.

Installing WASI-NN RPC server onto the macOS host

The next step is to install WasmEdge along with the WASI-NN RPC server onto the macOS host, so that the guest can delegate the LLM inference computations to the host with the access to the GPUs.

# Host (macOS)
brew install cmake grpc llvm@16 ninja pkg-config

git clone https://github.com/WasmEdge/WasmEdge.git 
cd WasmEdge
git checkout 0.14.0

export LLVM_DIR="${HOMEBREW_PREFIX}/opt/llvm@16/lib/cmake"
export CC="${HOMEBREW_PREFIX}/opt/llvm@16/bin/clang"
export CXX="${HOMEBREW_PREFIX}/opt/llvm@16/bin/clang++"
cmake -S. -B ./build -GNinja \
  -DCMAKE_BUILD_TYPE=Release \
  -DWASMEDGE_PLUGIN_WASI_NN_BACKEND=GGML \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_METAL=ON \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_BLAS=OFF \
  -DWASMEDGE_BUILD_WASI_NN_RPC=ON
cmake --build ./build
sudo cmake --install ./build

The WASI-NN RPC server listens on a UNIX domain socket on the host. The socket can be forwarded to the guest with ssh -R <GUESTPATH>:<HOSTPATH>:

# Host (macOS)
curl -OSL https://huggingface.co/wasmedge/llama2/resolve/23de599453ce999ab1dc650bd01f6298af38eb18/llama-2-7b-chat-q5_k_m.gguf

wasi_nn_rpcserver \
  --nn-rpc-uri unix://$HOME/nn.sock \
  --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf

ssh -F $HOME/.lima/default/ssh.config -R /home/${USER}.linux/nn.sock:$HOME/nn.sock lima-default

Running Llama on Lima, with the acceleration

WasmEdge running inside the Lima instance can now connect to the WASI-NN RPC server socket with the --nn-rpc-uri flag:

# Guest (Linux)
wasmedge \
  --nn-rpc-uri unix://$HOME/nn.sock \
  --env enable_log=true \
  wasmedge-ggml-nnrpc.wasm default

# Before
eval time =   13535.83 ms /     9 runs   ( 1503.98 ms per token,     0.66 tokens per second)

# After
eval time =     611.14 ms /     9 runs   (   67.90 ms per token,    14.73 tokens per second)

On Apple M2 Pro, the performance is improved from 0.66 tokens per second to 14.73 tokens per second. (22.3 times faster)

Future: wRPC

In the future, WASI-NN RPC maybe replaced by wRPC. wRPC is a fairly new Bytecode Alliance project that aims to define the standard for the distributed communication model of WASM components. wRPC could potentially be useful for exposing other host resources, such as biometric authenticators, to Lima as well.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities in the fields of containers, WASM, LLM, etc. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

私たちNTTは、コンテナ、WASM、LLMなどの領域でのオープンソースコミュニティで共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください: <https://www.rd.ntt/sic/recruit/>