Getting started with Candle 🕯️

4 min readMay 16, 2024

Rust based Machine Learning framework which makes serverless inference possible.

Introduction

Candle is new Rust based Machine Learning framework. In this post we will go through candles github repo and see how can you use it in your next project.

First of all you need to clone the candle repo to use it for inference.
you can find the repo here.

git clone https://github.com/huggingface/candle.git
cd candle

On the github repo there are many awesome resources to get you started with candle as your next ML framework.

Running our First Model in Candle

In the repo the only things which is useful to us for now is candle-examples.
Inside the example folder you will get to see all available model for inference.

This folder holds all the necessary code to run LLMs via command line. Choose the one you want to try. Each model has its own readme file which explains the use case and commands to run the model.

Now lets run our very first model with candle via command line.

cargo run --example mistral --release --features cuda \
-- --prompt 'Write hello world code in Rust' --sample-len 150 --quantized

Note: if you don’t have cuda installed/support then simply remove the
— features flag or specify mkl for intel and accelerate/mps for apple chips.

After you run this command it will take some time to install all necessary dependencies for this model and compile the binary. Once its done you can run this command with different prompts in an instant. Rust takes its sweet time installing and compiling when you run it for the first time.

In this command you can specify more arguments to alter the generation of model.

Lets understand the command first.
If you are new to Rust then, cargo run is Rust command used to run the binary file and by default it runs in debug mode which is too slow.

Here we mentioned that we want to run this command in --release model.
- -example mistral tells complier that we want to run mistral folder’s binary.
you can specify any other models directory name and it will work fine.
- -features cuda specify the hardware acceleration we want to use to make our inference runs faster. If you didn’t use this flag then by default it will use cpu device and will not use any hardware acceleration.

For M1 chips you can specify either metal or accelerate (specifying both will choose metal). For intel chips you can use mkl.

Until now we were specifying the args related to candle repo now we will see the binary specific args. To pass binary specific args we use
double hyphens - - after our features arg.

- -prompt arg will pass the prompt to generate text.
- -sample-len is used to set the maximum numbers of tokens to be generated.
- -quantized is used to run quantized model weights supports the GGUF format.

Now we will discuss some commands which are not used but are helpful.
- -model-id will take HF cards model name and download the user specific model for inference.
- -weight-files can be used as supplementary to the above command if model weights are downloaded locally.

To see all available arguments for inferencing model , you can run the below command. It will list out all arguments.

cargo run --example mistral --release -- --help

And this is our very first generation with the candle 🥳.

Write hello world code in Rust

```rust
fn main() {
    println!("Hello, world!");
}
```

## 1. Variables and Data Types

### 1.1. Variables

Variables are used to store data values. They can be declared using the 
`let` keyword followed by a name and an optional type annotation. 
The type of the variable is inferred from its initial value if no explicit 
type is specified.

```rust
let x = 5; // implicitly typed as i32 (integer)
let y: f64 = 3.14; // explicitly typed as floating-point number
let z = true; // implicitly typed

150 tokens generated (6.71 token/s)

You can run other models in the same manner specifying models folder name after --example arg. Give it a try by running LLAMA model on your own and see how works out for you.

There’s more !

There is this awesome binary called tensor-tools which lets quantize your own model with ease.
Tensor tools can be used to quantize and inspect model layers and weights dtypes.

This command tells you, your weight files layer names with their DTYPE.

cargo run --release --bin tensor-tools -- ls model-weights-path

To quantize your model use below command.

cargo run --release --bin tensor-tools -- quantize --quantization q4_0  \
model.safetensors --out-file model_q4_0.gguf

To see more options you can check out the main.rs of tensor-tools.

In the above command
- -quantization q4_0 specify the quantization technique followed by the model weights which are to be quantized.

- -out-file model_q4_0.gguf specifies the path to save the quantized model weights with .gguf file format.

In this post we saw how you can get started with the new candle framework for machine learning written in Rust.

Rust is becoming the industry standard language because of its safety and speed. Candle use this safety and speed and provide simple yet powerful tool to deploy LLMs.

If you have any queries regarding Candle or you want to contribute new models you can join the Huggingface discord channel.
To join the discord channel click here.

Getting started with Candle 🕯️

Introduction

Running our First Model in Candle

There’s more !

Written by Cursor