Install quantized Poro LLM on Google GCP and GPU: Step-by-Step Guide

Timo Laine
4 min readFeb 21, 2024

--

Poro LLM is a 34B parameter decoder-only transformer pretrained in Finnish, English and code. The latest fully trained checkpoint was recently published on Huggingface. In this article, we show how to run a quantized version of Poro on GCP VertexAI and GPU. In this example, the quantized model uses TheBloke’s 700B checkpoint, and when the quantized 1000B version is available, it is just a parameter change in the code.

GPU Quota

We assume that GCP billing and an active GCP project have been set up. Then we setup the VertexAI and the instance. The prerequisite for using the VertexAI GPU is that the user must request an increase in the GPU quota from Google. The quantized model requires about 40 GB of GPU memory and this means, for example, 2 L4 or T4 NVIDIA GPUs. Initially, the user has a quota for only 1 GPU. The GCP screen for quota increase is “Quotas and System limits”. There you can make a request, for example, for “NVIDIA L4 GPUs”. Choose the region that is closest to you or have free capacity. Quota request processing can take 1–2 business days.

We now assume that the quota request has been approved.

VertexAI Instance

Navigate to VertexAI home page and the Workbench tab. Create a new User-Managed Notebook.

Choose a region where you have a 2 GPU quota. The required environment is Debian 11 and Python 3 (Intel MKL and Cuda 11.8). You need around 40GB of GPU memory, which means for example 2 NVIDIA L4 GPUs and a machine type with around 40GB of memory. Some details and screenshots can be found on Github.

Now the instance has been created and running.

Python environment

A common problem when setting up a Python environment is finding the right and compatible Python libraries, including those that support CUDA and GPU. The approach here is that we clone the code and the Python environment from Github, create a new Python kernel based on this environment, and then run a Jupyter notebook on that kernel.

The Python environment requires, for example, the following Python libraries

pip install torch, transformers, optimum, auto-gptq, accelerate, jupyter, ipykernel

Open the Jupyter notebook and select the Terminal from the view.

Clone the repository, create the Conda environment, install the Python libraries and create the Poro kernel. Details can be found in the instructions on Github

Testing of the model

Now we are ready to execute the cells in Poro LLM model notebook. First, we can test that torch and CUDA are installed.

The model revision that is used is “gptq-4bit-32g-actorder_True”.

We define the prompt and the prompt template.

The input or the prompt is tokenized.

The completion is the following.

An alternative way is to use the transformer’s pipeline function.

Poro LLM is specially trained in the Finnish language and therefore we try the same prompt in Finnish.

The model.generate output is the following.

The pipeline gives the following output.

One can try different prompts and vary parameters temperature, top_p and top_k. With this configuration, it takes a few seconds to create a completion.

Conclusions

When testing, one may notice that the result is not always consistent. The main reason is that the quantized model used in this example is based on the 700B checkpoint (training 70% complete) and the results should be interpreted with caution. When the model is 100% trained, we expect to get much better results. Generally we think the Poro model is very interesting, it works well on GCP and we expect to generate a lot of interest in the market.

--

--

Timo Laine

Enthusiastic about Generative AI and Data Science. 20 x Cloud Certified