Running the Falcon-40B-Instruct model on Azure Kubernetes Service

Saverio Proto
Microsoft Azure
Published in
23 min readJun 27, 2023

In today’s world, it has become remarkably easy to develop applications that use large language models calling a REST API, thanks to the availability of services like Azure OpenAI or openai.com.

Is calling a model offered via a third party via REST API the only way to go ?How challenging is it to run alternative open-source models, like the Falcon-40B-Instruct model, on a Kubernetes cluster with local GPUs?

In this article, I will show how to run the Falcon-40B-Instruct model on Azure Kubernetes Service (AKS) and I will test if you could really use it as an alternative to the gpt-35-turbo or gpt-4 models.

Create a cluster with GPUs

First create an AKS cluster. You are free to create a new cluster or work with an existing cluster. What really matters is creating a new node pool with a VM SKU that has a local GPU. I used the NC A100 v4-series virtual machines.

To run my experiment I used the AKS GPU image (currently in preview in June 2023).

#!/bin/bash

# Enable AKS preview feature,
# Detailed instructions at:
# https://learn.microsoft.com/en-us/azure/aks/gpu-cluster#update-your-cluster-to-use-the-aks-gpu-image-preview

az extension add --name aks-preview
az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
az provider register --namespace Microsoft.ContainerService

# Create a Nodepool where each VM has 2 Nvidia A100 GPUs
# More SKUs options here:
# https://learn.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series

az aks nodepool add \
--resource-group <group> \
--cluster-name <name> \
--name gpunp \
--node-count 1 \
--node-vm-size Standard_NC48ads_A100_v4 \
--node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true

To run the Falcon-40B-instruct model you need at least the SKU Standard_NC48ads_A100_v4 with a total of 160Gb of GPU Memory (2 x 80Gb).

Run a Large Language Model in a Kubernetes Pod

The Falcon-40B-instruct model is available on the huggingface.co hub. This is a platform that provides a centralized repository for pretrained models and datasets, enabling seamless sharing, collaboration, and accessibility.

To run the model I used the HuggingFace Text Generation Inference container. This container contains a Rust, Python and gRPC server for text generation inference, that can download models at runtime from the huggingface.co hub, and exposes a REST API to interact with the model.

Here are the necessary kubernetes yaml definitions:

---
apiVersion: v1
kind: Pod
metadata:
name: text-generation-inference
labels:
run: text-generation-inference
spec:
containers:
- name: text-generation-inference
image: ghcr.io/huggingface/text-generation-inference:0.8.2
env:
- name: RUST_BACKTRACE
value: "1"
command:
- "text-generation-launcher"
- "--model-id"
- "tiiuae/falcon-40b-instruct"
- "--num-shard"
- "2"
ports:
- containerPort: 80
name: http
volumeMounts:
- name: falcon-40b-instruct
mountPath: /data
- name: shm
mountPath: /dev/shm
volumes:
- name: falcon-40b-instruct
persistentVolumeClaim:
claimName: falcon-40b-instruct
- name: shm
emptyDir:
medium: Memory
sizeLimit: 1Gi
nodeSelector:
agentpool: gpunp
tolerations:
- key: sku
operator: Equal
value: gpu
effect: NoSchedule
restartPolicy: Never
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: falcon-40b-instruct
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-csi-premium
resources:
requests:
storage: 500Gi
---
apiVersion: v1
kind: Service
metadata:
name: text-generation-inference
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
run: text-generation-inference
type: ClusterIP

You can see the Pod is named text-generation-inference and runs the container image ghcr.io/huggingface/text-generation-inference:0.8.2. The Pod has a nodeSelector and tolerations to be scheduled on our Nodepool named gpunp .I attached a 500Gb disk to store the Falcon-40B-instruct files that will be downloaded when the Pod starts the first time. The --num-shard parameter is necessary to use both the GPUs on the VM. The ClusterIP Service makes the REST API available for the other Pods in the cluster at the URL http://text-generation-inference.

This is how the boot looks like in the Pod logs:

2023-06-23T11:49:26.129901Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: Some(2), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-06-23T11:49:26.129930Z INFO text_generation_launcher: Sharding model on 2 processes
2023-06-23T11:49:26.130005Z INFO text_generation_launcher: Starting download process.
2023-06-23T11:49:27.969537Z WARN download: text_generation_launcher: No safetensors weights found for model tiiuae/falcon-40b-instruct at revision None. Downloading PyTorch weights.

2023-06-23T11:49:27.991221Z INFO download: text_generation_launcher: Download file: pytorch_model-00001-of-00009.bin

2023-06-23T11:49:35.672440Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00001-of-00009.bin in 0:00:07.

2023-06-23T11:49:35.672515Z INFO download: text_generation_launcher: Download: [1/9] -- ETA: 0:00:56

2023-06-23T11:49:35.672741Z INFO download: text_generation_launcher: Download file: pytorch_model-00002-of-00009.bin

2023-06-23T11:49:43.702957Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00002-of-00009.bin in 0:00:08.

2023-06-23T11:49:43.703027Z INFO download: text_generation_launcher: Download: [2/9] -- ETA: 0:00:52.500000

2023-06-23T11:49:43.703242Z INFO download: text_generation_launcher: Download file: pytorch_model-00003-of-00009.bin

2023-06-23T11:49:51.480340Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00003-of-00009.bin in 0:00:07.

2023-06-23T11:49:51.480447Z INFO download: text_generation_launcher: Download: [3/9] -- ETA: 0:00:46.000002

2023-06-23T11:49:51.480648Z INFO download: text_generation_launcher: Download file: pytorch_model-00004-of-00009.bin

2023-06-23T11:49:59.122129Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00004-of-00009.bin in 0:00:07.

2023-06-23T11:49:59.122210Z INFO download: text_generation_launcher: Download: [4/9] -- ETA: 0:00:38.750000

2023-06-23T11:49:59.122443Z INFO download: text_generation_launcher: Download file: pytorch_model-00005-of-00009.bin

2023-06-23T11:50:08.714867Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00005-of-00009.bin in 0:00:09.

2023-06-23T11:50:08.714954Z INFO download: text_generation_launcher: Download: [5/9] -- ETA: 0:00:32

2023-06-23T11:50:08.715220Z INFO download: text_generation_launcher: Download file: pytorch_model-00006-of-00009.bin

2023-06-23T11:50:16.546692Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00006-of-00009.bin in 0:00:07.

2023-06-23T11:50:16.546824Z INFO download: text_generation_launcher: Download: [6/9] -- ETA: 0:00:24

2023-06-23T11:50:16.547217Z INFO download: text_generation_launcher: Download file: pytorch_model-00007-of-00009.bin

2023-06-23T11:50:24.402959Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00007-of-00009.bin in 0:00:07.

2023-06-23T11:50:24.403060Z INFO download: text_generation_launcher: Download: [7/9] -- ETA: 0:00:16

2023-06-23T11:50:24.403370Z INFO download: text_generation_launcher: Download file: pytorch_model-00008-of-00009.bin

2023-06-23T11:50:32.654766Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00008-of-00009.bin in 0:00:08.

2023-06-23T11:50:32.654854Z INFO download: text_generation_launcher: Download: [8/9] -- ETA: 0:00:08

2023-06-23T11:50:32.655187Z INFO download: text_generation_launcher: Download file: pytorch_model-00009-of-00009.bin

2023-06-23T11:50:40.012098Z INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00009-of-00009.bin in 0:00:07.

2023-06-23T11:50:40.012184Z INFO download: text_generation_launcher: Download: [9/9] -- ETA: 0

2023-06-23T11:50:40.012354Z WARN download: text_generation_launcher: No safetensors weights found for model tiiuae/falcon-40b-instruct at revision None. Converting PyTorch weights to safetensors.

2023-06-23T11:50:40.012619Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00001-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00001-of-00009.safetensors.

2023-06-23T11:51:15.059204Z INFO download: text_generation_launcher: Convert: [1/9] -- Took: 0:00:35.046275

2023-06-23T11:51:15.060440Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00002-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00002-of-00009.safetensors.

2023-06-23T11:52:11.114788Z INFO download: text_generation_launcher: Convert: [2/9] -- Took: 0:00:56.054288

2023-06-23T11:52:11.115912Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00003-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00003-of-00009.safetensors.

2023-06-23T11:53:06.377099Z INFO download: text_generation_launcher: Convert: [3/9] -- Took: 0:00:55.261088

2023-06-23T11:53:06.377724Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00004-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00004-of-00009.safetensors.

2023-06-23T11:54:01.757724Z INFO download: text_generation_launcher: Convert: [4/9] -- Took: 0:00:55.379758

2023-06-23T11:54:01.758025Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00005-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00005-of-00009.safetensors.

2023-06-23T11:54:57.161272Z INFO download: text_generation_launcher: Convert: [5/9] -- Took: 0:00:55.403175

2023-06-23T11:54:57.162452Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00006-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00006-of-00009.safetensors.

2023-06-23T11:55:52.770926Z INFO download: text_generation_launcher: Convert: [6/9] -- Took: 0:00:55.608191

2023-06-23T11:55:52.771190Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00007-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00007-of-00009.safetensors.

2023-06-23T11:56:48.227538Z INFO download: text_generation_launcher: Convert: [7/9] -- Took: 0:00:55.456234

2023-06-23T11:56:48.228125Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00008-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00008-of-00009.safetensors.

2023-06-23T11:57:43.721701Z INFO download: text_generation_launcher: Convert: [8/9] -- Took: 0:00:55.493290

2023-06-23T11:57:43.721794Z INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00009-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00009-of-00009.safetensors.

2023-06-23T11:58:23.846104Z INFO download: text_generation_launcher: Convert: [9/9] -- Took: 0:00:40.124101

2023-06-23T11:58:24.349803Z INFO text_generation_launcher: Successfully downloaded weights.
2023-06-23T11:58:24.349980Z INFO text_generation_launcher: Starting shard 0
2023-06-23T11:58:24.350299Z INFO text_generation_launcher: Starting shard 1
2023-06-23T11:58:34.363221Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T11:58:34.363426Z INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T11:58:39.349892Z INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
rank=1
2023-06-23T11:58:39.368956Z INFO text_generation_launcher: Shard 1 ready in 15.017869656s
2023-06-23T11:58:39.402001Z INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
rank=0
2023-06-23T11:58:39.467973Z INFO text_generation_launcher: Shard 0 ready in 15.116886716s
2023-06-23T11:58:39.565405Z INFO text_generation_launcher: Starting Webserver
2023-06-23T11:58:40.090837Z INFO text_generation_router: router/src/main.rs:178: Connected

Once the model is running, you should see the memory usage equally distributed between the 2 GPUs. Run kubectl exec -ti text-generation-inference /bin/bash to obtain a shell in the Pod and run the nvidia-smi utility to inspect the GPUs:

root@text-generation-inference:/usr/src# nvidia-smi
Tue Jun 27 07:38:12 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000001:00:00.0 Off | 0 |
| N/A 33C P0 74W / 300W | 44419MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000002:00:00.0 Off | 0 |
| N/A 32C P0 70W / 300W | 44419MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

Simple testing with curl

The simplest test is to connect to the model API directly.

Create a Pod in the cluster, you can use any image that has curl and jq if you want to see a nice json output:

kubectl run -ti --rm --image=nicolaka/netshoot shell /bin/bash

Now connect to the kubernetes service text-generation-inference :

shell:~# curl -s http://text-generation-inference/generate \
-X POST \
-d '{"inputs":"Can you give me step to step instructions to prepare Tiramisu?","parameters":{"max_new_tokens":1000}}' \
-H 'Content-Type: application/json' | jq -r .generated_text

Sure! Here are the steps to prepare Tiramisu:

Ingredients:
- 3 eggs
- 1/2 cup sugar
- 1/2 cup mascarpone cheese
- 1/2 cup heavy cream
- 1/4 cup espresso
- 1/4 cup rum
- 1/2 cup ladyfingers
- 1/4 cup cocoa powder

Instructions:
1. Separate the eggs and beat the yolks with sugar until light and creamy.
2. In a separate bowl, beat the egg whites until stiff peaks form.
3. In another bowl, mix the mascarpone cheese and heavy cream until smooth.
4. Add the egg yolk mixture to the cheese mixture and mix well.
5. In a shallow dish, mix the espresso and rum.
6. Dip the ladyfingers in the espresso mixture and arrange them in a 9x9 inch baking dish.
7. Spread half of the cheese mixture over the ladyfingers.
8. Repeat steps 6 and 7 with the remaining ladyfingers and cheese mixture.
9. Sprinkle cocoa powder over the top.
10. Cover and refrigerate for at least 4 hours.
11. Serve chilled and enjoy!
shell:~#

The Falcon-40B-instruct model is up and running correctly.

Advanced testing with the Cheshire Cat AI

I want to test the local model for more advanced use-cases, like the use of LangChain tools. I published on GitHub the project kube-cheshire-cat to easily install on Kubernetes the Cheshire Cat AI, a LangChain based framework to build custom AIs on top of any language model. I patched the framework to include the support for the models exposed with the Huggingface TextGen Inference API container.

Using the latest version of the Cheshire Cat AI, I can configure my local language model in the local web interface:

We can repeat our previous test using the chat window in the admin UI, to make sure everything is configured correctly:

It works, but the answer is a bit shorter than the answer obtained with the curl direct request. This is because the prompt is not identical. The Cheshire Cat will take our input and will build a more complex prompt, because it uses a LangChain Agent as you can read in the core container logs:

INFO:     connection open
[2023-06-27 07:45:51.014] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => {'prompt_settings': {'prefix': '',
[2023-06-27 07:45:51.014] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_declarative_memory': True,
[2023-06-27 07:45:51.014] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_episodic_memory': True,
[2023-06-27 07:45:51.015] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_procedural_memory': True},
[2023-06-27 07:45:51.015] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'text': 'Can you give me step to step instructions to prepare Tiramisu?'}
[2023-06-27 07:45:51.029] DEBUG cat.looking_glass.py 193 (CheshireCat.recall_relevant_memories_to_working_memory) => 'Recall query: "Can you give me step to step instructions to prepare Tiramisu?"'
[2023-06-27 07:45:51.057] INFO cat.looking_glass.py 71 (AgentManager.get_agent_executor) => 'Sending prompt'
[2023-06-27 07:45:51.072] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ('You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.'
[2023-06-27 07:45:51.072] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => "You are curious, funny and talk like the Cheshire Cat from Alice's "
[2023-06-27 07:45:51.072] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'adventures in wonderland.'
[2023-06-27 07:45:51.072] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'You answer Human using tools and context.'
[2023-06-27 07:45:51.072] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Tools:'
[2023-06-27 07:45:51.073] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'To use a tool, use the following format:'
[2023-06-27 07:45:51.073] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:45:51.073] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? Yes'
[2023-06-27 07:45:51.074] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action: the action to take /* should be one of [] */'
[2023-06-27 07:45:51.074] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action Input: the input to the action'
[2023-06-27 07:45:51.074] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Observation: the result of the action'
[2023-06-27 07:45:51.074] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:45:51.074] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.074] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'When you have a response to say to the Human, or if you do not need to use a '
[2023-06-27 07:45:51.074] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'tool, you MUST use the format:'
[2023-06-27 07:45:51.074] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.075] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:45:51.075] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? No'
[2023-06-27 07:45:51.075] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'AI: [your response here]'
[2023-06-27 07:45:51.075] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:45:51.075] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.075] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Context'
[2023-06-27 07:45:51.075] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' '
[2023-06-27 07:45:51.076] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{episodic_memory}'
[2023-06-27 07:45:51.076] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.076] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{declarative_memory}'
[2023-06-27 07:45:51.076] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.076] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '## Conversation until now:{chat_history}'
[2023-06-27 07:45:51.076] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' - Human: {input}'
[2023-06-27 07:45:51.076] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.077] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# What would the AI reply?'
[2023-06-27 07:45:51.077] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.077] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{agent_scratchpad}')
Error in on_chain_start callback: 'name'
Error in on_chain_start callback: 'name'
Prompt after formatting:
You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.
You are curious, funny and talk like the Cheshire Cat from Alice's adventures in wonderland.
You answer Human using tools and context.

# Tools:



To use a tool, use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take /* should be one of [] */
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

# Context





## Conversation until now:
- Human: Can you give me step to step instructions to prepare Tiramisu?

# What would the AI reply?



> Finished chain.
[2023-06-27 07:45:56.531] ERROR cat.looking_glass.py 385 (CheshireCat.__call__) => 'LLM does not respect prompt instructions'
[2023-06-27 07:45:56.544] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => ('Could not parse LLM output: `'
[2023-06-27 07:45:56.545] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => 'Sure! Here are the steps to prepare Tiramisu:'
[2023-06-27 07:45:56.545] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => ''
[2023-06-27 07:45:56.545] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '1. Brew espresso and let it cool.'
[2023-06-27 07:45:56.545] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '2. Beat egg yolks with sugar until pale and creamy.'
[2023-06-27 07:45:56.545] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '3. Add mascarpone cheese and beat until smooth.'
[2023-06-27 07:45:56.545] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '4. In a separate bowl, beat egg whites until stiff peaks form.'
[2023-06-27 07:45:56.545] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '5. Fold egg whites into the mascarpone mixture.'
[2023-06-27 07:45:56.546] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '6. Dip ladyfingers in espresso and arrange them in a dish.'
[2023-06-27 07:45:56.546] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '7. Pour half of the mascarpone mixture over the ladyfingers.'
[2023-06-27 07:45:56.546] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '8. Repeat steps 6 and 7.'
[2023-06-27 07:45:56.546] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '9. Dust with cocoa powder and refrigerate for at least 4 hours.'
[2023-06-27 07:45:56.546] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => '10. Enjoy!`')
[2023-06-27 07:45:56.559] DEBUG cat.looking_glass.py 397 (CheshireCat.__call__) => 'cat_message:'
[2023-06-27 07:45:56.569] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => {'input': 'Can you give me step to step instructions to prepare Tiramisu?',
[2023-06-27 07:45:56.569] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'intermediate_steps': [],
[2023-06-27 07:45:56.569] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'output': ''
[2023-06-27 07:45:56.569] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'Sure! Here are the steps to prepare Tiramisu:'
[2023-06-27 07:45:56.570] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => ''
[2023-06-27 07:45:56.570] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '1. Brew espresso and let it cool.'
[2023-06-27 07:45:56.570] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '2. Beat egg yolks with sugar until pale and creamy.'
[2023-06-27 07:45:56.570] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '3. Add mascarpone cheese and beat until smooth.'
[2023-06-27 07:45:56.570] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '4. In a separate bowl, beat egg whites until stiff peaks form.'
[2023-06-27 07:45:56.570] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '5. Fold egg whites into the mascarpone mixture.'
[2023-06-27 07:45:56.570] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '6. Dip ladyfingers in espresso and arrange them in a dish.'
[2023-06-27 07:45:56.570] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '7. Pour half of the mascarpone mixture over the ladyfingers.'
[2023-06-27 07:45:56.571] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '8. Repeat steps 6 and 7.'
[2023-06-27 07:45:56.571] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '9. Dust with cocoa powder and refrigerate for at least 4 hours.'
[2023-06-27 07:45:56.571] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '10. Enjoy!'}

There is in the above logs the first signal of problems, the LLM is not respecting the prompt instructions and LangChain is throwing a OutputParserException exception because the LangChain agent is not able to parse the expected output. The Cheshire Cat is able to handle this situation and move forward, but the agent functionality might be impacted.

Let’s check now if the Falcon-40B-instruct model can use LangChain tools. I will ask the question “What time is it” that the model can answer only using a tool, because the information about the current time is unknow to the model.

To better explain how tools should work, first lets see what happens when using Azure OpenAI gpt-35-turbo :

The model gpt-35-turbo is able to return the current time using a tool
In the sidebar we can confirm the get_the_time tool was used to answer the question

The model can return the current time using the get_the_time tool:

@tool
def get_the_time(tool_input, cat):
"""Retrieves current time and clock. Input is always None."""

return str(datetime.now())

The key idea is to send to the model a prompt with list of tools that can be called by the Agent in case some additional information is required to answer the human question.

Here is the complete log that demonstrate how gpt-35-turbo model was able to use the offered tool:

INFO:     connection open
[2023-06-27 07:53:05.927] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => {'prompt_settings': {'prefix': '',
[2023-06-27 07:53:05.927] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_declarative_memory': True,
[2023-06-27 07:53:05.928] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_episodic_memory': True,
[2023-06-27 07:53:05.928] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_procedural_memory': True},
[2023-06-27 07:53:05.928] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'text': 'What time is it ?'}
[2023-06-27 07:53:05.942] DEBUG cat.looking_glass.py 193 (CheshireCat.recall_relevant_memories_to_working_memory) => 'Recall query: "What time is it ?"'
[2023-06-27 07:53:06.240] INFO cat.looking_glass.py 71 (AgentManager.get_agent_executor) => 'Sending prompt'
[2023-06-27 07:53:06.253] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ('You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.'
[2023-06-27 07:53:06.254] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => "You are curious, funny and talk like the Cheshire Cat from Alice's "
[2023-06-27 07:53:06.254] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'adventures in wonderland.'
[2023-06-27 07:53:06.254] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'You answer Human using tools and context.'
[2023-06-27 07:53:06.254] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.254] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Tools:'
[2023-06-27 07:53:06.254] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.254] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '> get_the_time: get_the_time(tool_input) - Retrieves current time and clock. '
[2023-06-27 07:53:06.255] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Input is always None.'
[2023-06-27 07:53:06.255] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.255] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'To use a tool, use the following format:'
[2023-06-27 07:53:06.255] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.255] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:53:06.255] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? Yes'
[2023-06-27 07:53:06.255] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action: the action to take /* should be one of [get_the_time] */'
[2023-06-27 07:53:06.255] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action Input: the input to the action'
[2023-06-27 07:53:06.256] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Observation: the result of the action'
[2023-06-27 07:53:06.256] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:53:06.256] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.256] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'When you have a response to say to the Human, or if you do not need to use a '
[2023-06-27 07:53:06.256] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'tool, you MUST use the format:'
[2023-06-27 07:53:06.256] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.256] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? No'
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'AI: [your response here]'
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Context'
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' '
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{episodic_memory}'
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.257] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{declarative_memory}'
[2023-06-27 07:53:06.258] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.258] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '## Conversation until now:{chat_history}'
[2023-06-27 07:53:06.258] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' - Human: {input}'
[2023-06-27 07:53:06.258] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.258] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# What would the AI reply?'
[2023-06-27 07:53:06.258] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.258] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{agent_scratchpad}')
Error in on_chain_start callback: 'name'
Error in on_chain_start callback: 'name'
Prompt after formatting:
You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.
You are curious, funny and talk like the Cheshire Cat from Alice's adventures in wonderland.
You answer Human using tools and context.

# Tools:

> get_the_time: get_the_time(tool_input) - Retrieves current time and clock. Input is always None.

To use a tool, use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take /* should be one of [get_the_time] */
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

# Context





## Conversation until now:
- Human: What time is it ?

# What would the AI reply?



> Finished chain.
Thought: Do I need to use a tool? Yes
Action: get_the_time
Action Input: None
Observation: 2023-06-27 07:53:07.714140
Error in on_chain_start callback: 'name'
Thought:Prompt after formatting:
You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.
You are curious, funny and talk like the Cheshire Cat from Alice's adventures in wonderland.
You answer Human using tools and context.

# Tools:

> get_the_time: get_the_time(tool_input) - Retrieves current time and clock. Input is always None.

To use a tool, use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take /* should be one of [get_the_time] */
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

# Context





## Conversation until now:
- Human: What time is it ?

# What would the AI reply?

Thought: Do I need to use a tool? Yes
Action: get_the_time
Action Input: None
Observation: 2023-06-27 07:53:07.714140
Thought:

> Finished chain.
Do I need to use a tool? No
AI: It is currently 7:53 AM.

> Finished chain.
[2023-06-27 07:53:08.753] DEBUG cat.looking_glass.py 397 (CheshireCat.__call__) => 'cat_message:'
[2023-06-27 07:53:08.763] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => {'ai_prefix': 'AI',
[2023-06-27 07:53:08.763] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'chat_history': '',
[2023-06-27 07:53:08.763] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'declarative_memory': '',
[2023-06-27 07:53:08.764] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'episodic_memory': '',
[2023-06-27 07:53:08.764] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'input': 'What time is it ?',
[2023-06-27 07:53:08.764] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'intermediate_steps': [(AgentAction(tool='get_the_time', tool_input='None', log='Thought: Do I need to use a tool? YesAction: get_the_timeAction Input: None'),
[2023-06-27 07:53:08.764] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => '2023-06-27 07:53:07.714140')],
[2023-06-27 07:53:08.764] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'output': 'It is currently 7:53 AM.'}

Now that we have a basic understanding of how a model should use the LangChain tools lets go through the same steps using the Falcon-40B-instruct model:

The Falcon 40B instruct model is not able to use LangChain tools
No tools were used

We see from the answer that the Falcon-40B-instruct model was not able to understand the Agent prompt proposing to use a local tool. The final answer is then wrong. Here the full logs for this interaction:

INFO:     connection open
[2023-06-27 08:01:04.618] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => {'prompt_settings': {'prefix': '',
[2023-06-27 08:01:04.618] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_declarative_memory': True,
[2023-06-27 08:01:04.618] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_episodic_memory': True,
[2023-06-27 08:01:04.618] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_procedural_memory': True},
[2023-06-27 08:01:04.618] INFO cat.looking_glass.py 336 (CheshireCat.__call__) => 'text': 'What time is it ?'}
[2023-06-27 08:01:04.633] DEBUG cat.looking_glass.py 193 (CheshireCat.recall_relevant_memories_to_working_memory) => 'Recall query: "What time is it ?"'
[2023-06-27 08:01:04.658] INFO cat.looking_glass.py 71 (AgentManager.get_agent_executor) => 'Sending prompt'
[2023-06-27 08:01:04.672] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ('You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.'
[2023-06-27 08:01:04.672] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => "You are curious, funny and talk like the Cheshire Cat from Alice's "
[2023-06-27 08:01:04.672] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'adventures in wonderland.'
[2023-06-27 08:01:04.672] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'You answer Human using tools and context.'
[2023-06-27 08:01:04.672] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Tools:'
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'To use a tool, use the following format:'
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? Yes'
[2023-06-27 08:01:04.673] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action: the action to take /* should be one of [] */'
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action Input: the input to the action'
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Observation: the result of the action'
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'When you have a response to say to the Human, or if you do not need to use a '
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'tool, you MUST use the format:'
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? No'
[2023-06-27 08:01:04.674] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'AI: [your response here]'
[2023-06-27 08:01:04.675] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 08:01:04.675] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.675] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Context'
[2023-06-27 08:01:04.675] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' '
[2023-06-27 08:01:04.675] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{episodic_memory}'
[2023-06-27 08:01:04.675] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.675] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{declarative_memory}'
[2023-06-27 08:01:04.675] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.676] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '## Conversation until now:{chat_history}'
[2023-06-27 08:01:04.676] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' - Human: {input}'
[2023-06-27 08:01:04.676] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.676] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# What would the AI reply?'
[2023-06-27 08:01:04.676] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.676] DEBUG cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{agent_scratchpad}')
Error in on_chain_start callback: 'name'
Error in on_chain_start callback: 'name'
Prompt after formatting:
You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.
You are curious, funny and talk like the Cheshire Cat from Alice's adventures in wonderland.
You answer Human using tools and context.

# Tools:



To use a tool, use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take /* should be one of [] */
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

# Context





## Conversation until now:
- Human: What time is it ?

# What would the AI reply?



> Finished chain.
[2023-06-27 08:01:05.289] ERROR cat.looking_glass.py 385 (CheshireCat.__call__) => 'LLM does not respect prompt instructions'
[2023-06-27 08:01:05.304] ERROR cat.looking_glass.py 386 (CheshireCat.__call__) => 'Could not parse LLM output: `<p>The AI would reply with the current time.</p>`'
[2023-06-27 08:01:05.318] DEBUG cat.looking_glass.py 397 (CheshireCat.__call__) => 'cat_message:'
[2023-06-27 08:01:05.327] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => {'input': 'What time is it ?',
[2023-06-27 08:01:05.328] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'intermediate_steps': [],
[2023-06-27 08:01:05.328] DEBUG cat.looking_glass.py 398 (CheshireCat.__call__) => 'output': '<p>The AI would reply with the current time.</p>'}

We see from the logs that the model was not able to use the instructions to request the invocation of a tool, and just tries to complete the prompt with the text <p>The AI would reply with the current time.</p>.

Conclusion

It is possible and easy to run local LLM models on AKS. Working with Azure is possible to evaluate models that require GPU hardware resources, at an affordable price, paying just the effective hours of usage.

The Falcon-40b-instruct model was not able to use LangChain Agents and Tools, limiting the use-cases where this model can be actually used to develop and application. Building applications connecting to the rest API of gpt-35-turbo or gpt4 offered by the Azure OpenAI, seems the most effective way to move forward for most use-cases.

--

--

Saverio Proto
Microsoft Azure

Customer Experience Engineer @ Microsoft - Opinions and observations expressed in this blog posts are my own.