Running the Falcon-40B-Instruct model on Azure Kubernetes Service

Published in

Microsoft Azure

23 min readJun 27, 2023

In today’s world, it has become remarkably easy to develop applications that use large language models calling a REST API, thanks to the availability of services like Azure OpenAI or openai.com.

Is calling a model offered via a third party via REST API the only way to go ?How challenging is it to run alternative open-source models, like the Falcon-40B-Instruct model, on a Kubernetes cluster with local GPUs?

In this article, I will show how to run the Falcon-40B-Instruct model on Azure Kubernetes Service (AKS) and I will test if you could really use it as an alternative to the gpt-35-turbo or gpt-4 models.

Create a cluster with GPUs

First create an AKS cluster. You are free to create a new cluster or work with an existing cluster. What really matters is creating a new node pool with a VM SKU that has a local GPU. I used the NC A100 v4-series virtual machines.

To run my experiment I used the AKS GPU image (currently in preview in June 2023).

#!/bin/bash

# Enable AKS preview feature,
# Detailed instructions at:
# https://learn.microsoft.com/en-us/azure/aks/gpu-cluster#update-your-cluster-to-use-the-aks-gpu-image-preview

az extension add --name aks-preview
az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
az provider register --namespace Microsoft.ContainerService

# Create a Nodepool where each VM has 2 Nvidia A100 GPUs
# More SKUs options here:
# https://learn.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series

az aks nodepool add \
    --resource-group <group> \
    --cluster-name  <name> \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC48ads_A100_v4 \
    --node-taints sku=gpu:NoSchedule \
    --aks-custom-headers UseGPUDedicatedVHD=true

To run the Falcon-40B-instruct model you need at least the SKU Standard_NC48ads_A100_v4 with a total of 160Gb of GPU Memory (2 x 80Gb).

Run a Large Language Model in a Kubernetes Pod

The Falcon-40B-instruct model is available on the huggingface.co hub. This is a platform that provides a centralized repository for pretrained models and datasets, enabling seamless sharing, collaboration, and accessibility.

To run the model I used the HuggingFace Text Generation Inference container. This container contains a Rust, Python and gRPC server for text generation inference, that can download models at runtime from the huggingface.co hub, and exposes a REST API to interact with the model.

Here are the necessary kubernetes yaml definitions:

---
apiVersion: v1
kind: Pod
metadata:
  name: text-generation-inference
  labels:
    run: text-generation-inference
spec:
  containers:
    - name: text-generation-inference
      image: ghcr.io/huggingface/text-generation-inference:0.8.2
      env:
        - name: RUST_BACKTRACE
          value: "1"
      command:
        - "text-generation-launcher"
        - "--model-id"
        - "tiiuae/falcon-40b-instruct"
        - "--num-shard"
        - "2"
      ports:
        - containerPort: 80
          name: http
      volumeMounts:
        - name: falcon-40b-instruct
          mountPath: /data
        - name: shm
          mountPath: /dev/shm
  volumes:
    - name: falcon-40b-instruct
      persistentVolumeClaim:
        claimName: falcon-40b-instruct
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 1Gi
  nodeSelector:
    agentpool: gpunp
  tolerations:
    - key: sku
      operator: Equal
      value: gpu
      effect: NoSchedule
  restartPolicy: Never
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: falcon-40b-instruct
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-csi-premium
  resources:
    requests:
      storage: 500Gi
---
apiVersion: v1
kind: Service
metadata:
  name: text-generation-inference
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: text-generation-inference
  type: ClusterIP

You can see the Pod is named text-generation-inference and runs the container image ghcr.io/huggingface/text-generation-inference:0.8.2. The Pod has a nodeSelector and tolerations to be scheduled on our Nodepool named gpunp .I attached a 500Gb disk to store the Falcon-40B-instruct files that will be downloaded when the Pod starts the first time. The --num-shard parameter is necessary to use both the GPUs on the VM. The ClusterIP Service makes the REST API available for the other Pods in the cluster at the URL http://text-generation-inference.

This is how the boot looks like in the Pod logs:

2023-06-23T11:49:26.129901Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: Some(2), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-06-23T11:49:26.129930Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-06-23T11:49:26.130005Z  INFO text_generation_launcher: Starting download process.
2023-06-23T11:49:27.969537Z  WARN download: text_generation_launcher: No safetensors weights found for model tiiuae/falcon-40b-instruct at revision None. Downloading PyTorch weights.

2023-06-23T11:49:27.991221Z  INFO download: text_generation_launcher: Download file: pytorch_model-00001-of-00009.bin

2023-06-23T11:49:35.672440Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00001-of-00009.bin in 0:00:07.

2023-06-23T11:49:35.672515Z  INFO download: text_generation_launcher: Download: [1/9] -- ETA: 0:00:56

2023-06-23T11:49:35.672741Z  INFO download: text_generation_launcher: Download file: pytorch_model-00002-of-00009.bin

2023-06-23T11:49:43.702957Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00002-of-00009.bin in 0:00:08.

2023-06-23T11:49:43.703027Z  INFO download: text_generation_launcher: Download: [2/9] -- ETA: 0:00:52.500000

2023-06-23T11:49:43.703242Z  INFO download: text_generation_launcher: Download file: pytorch_model-00003-of-00009.bin

2023-06-23T11:49:51.480340Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00003-of-00009.bin in 0:00:07.

2023-06-23T11:49:51.480447Z  INFO download: text_generation_launcher: Download: [3/9] -- ETA: 0:00:46.000002

2023-06-23T11:49:51.480648Z  INFO download: text_generation_launcher: Download file: pytorch_model-00004-of-00009.bin

2023-06-23T11:49:59.122129Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00004-of-00009.bin in 0:00:07.

2023-06-23T11:49:59.122210Z  INFO download: text_generation_launcher: Download: [4/9] -- ETA: 0:00:38.750000

2023-06-23T11:49:59.122443Z  INFO download: text_generation_launcher: Download file: pytorch_model-00005-of-00009.bin

2023-06-23T11:50:08.714867Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00005-of-00009.bin in 0:00:09.

2023-06-23T11:50:08.714954Z  INFO download: text_generation_launcher: Download: [5/9] -- ETA: 0:00:32

2023-06-23T11:50:08.715220Z  INFO download: text_generation_launcher: Download file: pytorch_model-00006-of-00009.bin

2023-06-23T11:50:16.546692Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00006-of-00009.bin in 0:00:07.

2023-06-23T11:50:16.546824Z  INFO download: text_generation_launcher: Download: [6/9] -- ETA: 0:00:24

2023-06-23T11:50:16.547217Z  INFO download: text_generation_launcher: Download file: pytorch_model-00007-of-00009.bin

2023-06-23T11:50:24.402959Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00007-of-00009.bin in 0:00:07.

2023-06-23T11:50:24.403060Z  INFO download: text_generation_launcher: Download: [7/9] -- ETA: 0:00:16

2023-06-23T11:50:24.403370Z  INFO download: text_generation_launcher: Download file: pytorch_model-00008-of-00009.bin

2023-06-23T11:50:32.654766Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00008-of-00009.bin in 0:00:08.

2023-06-23T11:50:32.654854Z  INFO download: text_generation_launcher: Download: [8/9] -- ETA: 0:00:08

2023-06-23T11:50:32.655187Z  INFO download: text_generation_launcher: Download file: pytorch_model-00009-of-00009.bin

2023-06-23T11:50:40.012098Z  INFO download: text_generation_launcher: Downloaded /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00009-of-00009.bin in 0:00:07.

2023-06-23T11:50:40.012184Z  INFO download: text_generation_launcher: Download: [9/9] -- ETA: 0

2023-06-23T11:50:40.012354Z  WARN download: text_generation_launcher: No safetensors weights found for model tiiuae/falcon-40b-instruct at revision None. Converting PyTorch weights to safetensors.

2023-06-23T11:50:40.012619Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00001-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00001-of-00009.safetensors.

2023-06-23T11:51:15.059204Z  INFO download: text_generation_launcher: Convert: [1/9] -- Took: 0:00:35.046275

2023-06-23T11:51:15.060440Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00002-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00002-of-00009.safetensors.

2023-06-23T11:52:11.114788Z  INFO download: text_generation_launcher: Convert: [2/9] -- Took: 0:00:56.054288

2023-06-23T11:52:11.115912Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00003-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00003-of-00009.safetensors.

2023-06-23T11:53:06.377099Z  INFO download: text_generation_launcher: Convert: [3/9] -- Took: 0:00:55.261088

2023-06-23T11:53:06.377724Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00004-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00004-of-00009.safetensors.

2023-06-23T11:54:01.757724Z  INFO download: text_generation_launcher: Convert: [4/9] -- Took: 0:00:55.379758

2023-06-23T11:54:01.758025Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00005-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00005-of-00009.safetensors.

2023-06-23T11:54:57.161272Z  INFO download: text_generation_launcher: Convert: [5/9] -- Took: 0:00:55.403175

2023-06-23T11:54:57.162452Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00006-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00006-of-00009.safetensors.

2023-06-23T11:55:52.770926Z  INFO download: text_generation_launcher: Convert: [6/9] -- Took: 0:00:55.608191

2023-06-23T11:55:52.771190Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00007-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00007-of-00009.safetensors.

2023-06-23T11:56:48.227538Z  INFO download: text_generation_launcher: Convert: [7/9] -- Took: 0:00:55.456234

2023-06-23T11:56:48.228125Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00008-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00008-of-00009.safetensors.

2023-06-23T11:57:43.721701Z  INFO download: text_generation_launcher: Convert: [8/9] -- Took: 0:00:55.493290

2023-06-23T11:57:43.721794Z  INFO download: text_generation_launcher: Convert /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/pytorch_model-00009-of-00009.bin to /data/models--tiiuae--falcon-40b-instruct/snapshots/1e7fdcc9f45d13704f3826e99937917e007cd975/model-00009-of-00009.safetensors.

2023-06-23T11:58:23.846104Z  INFO download: text_generation_launcher: Convert: [9/9] -- Took: 0:00:40.124101

2023-06-23T11:58:24.349803Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-23T11:58:24.349980Z  INFO text_generation_launcher: Starting shard 0
2023-06-23T11:58:24.350299Z  INFO text_generation_launcher: Starting shard 1
2023-06-23T11:58:34.363221Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-23T11:58:34.363426Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-23T11:58:39.349892Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
 rank=1
2023-06-23T11:58:39.368956Z  INFO text_generation_launcher: Shard 1 ready in 15.017869656s
2023-06-23T11:58:39.402001Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-06-23T11:58:39.467973Z  INFO text_generation_launcher: Shard 0 ready in 15.116886716s
2023-06-23T11:58:39.565405Z  INFO text_generation_launcher: Starting Webserver
2023-06-23T11:58:40.090837Z  INFO text_generation_router: router/src/main.rs:178: Connected

Once the model is running, you should see the memory usage equally distributed between the 2 GPUs. Run kubectl exec -ti text-generation-inference /bin/bash to obtain a shell in the Pod and run the nvidia-smi utility to inspect the GPUs:

root@text-generation-inference:/usr/src# nvidia-smi
Tue Jun 27 07:38:12 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000001:00:00.0 Off |                    0 |
| N/A   33C    P0    74W / 300W |  44419MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000002:00:00.0 Off |                    0 |
| N/A   32C    P0    70W / 300W |  44419MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Simple testing with curl

The simplest test is to connect to the model API directly.

Create a Pod in the cluster, you can use any image that has curl and jq if you want to see a nice json output:

kubectl run -ti --rm --image=nicolaka/netshoot shell /bin/bash

Now connect to the kubernetes service text-generation-inference :

shell:~# curl -s http://text-generation-inference/generate \
    -X POST \
    -d '{"inputs":"Can you give me step to step instructions to prepare Tiramisu?","parameters":{"max_new_tokens":1000}}' \
    -H 'Content-Type: application/json' | jq -r .generated_text

Sure! Here are the steps to prepare Tiramisu:

Ingredients:
- 3 eggs
- 1/2 cup sugar
- 1/2 cup mascarpone cheese
- 1/2 cup heavy cream
- 1/4 cup espresso
- 1/4 cup rum
- 1/2 cup ladyfingers
- 1/4 cup cocoa powder

Instructions:
1. Separate the eggs and beat the yolks with sugar until light and creamy.
2. In a separate bowl, beat the egg whites until stiff peaks form.
3. In another bowl, mix the mascarpone cheese and heavy cream until smooth.
4. Add the egg yolk mixture to the cheese mixture and mix well.
5. In a shallow dish, mix the espresso and rum.
6. Dip the ladyfingers in the espresso mixture and arrange them in a 9x9 inch baking dish.
7. Spread half of the cheese mixture over the ladyfingers.
8. Repeat steps 6 and 7 with the remaining ladyfingers and cheese mixture.
9. Sprinkle cocoa powder over the top.
10. Cover and refrigerate for at least 4 hours.
11. Serve chilled and enjoy!
shell:~#

The Falcon-40B-instruct model is up and running correctly.

Advanced testing with the Cheshire Cat AI

I want to test the local model for more advanced use-cases, like the use of LangChain tools. I published on GitHub the project kube-cheshire-cat to easily install on Kubernetes the Cheshire Cat AI, a LangChain based framework to build custom AIs on top of any language model. I patched the framework to include the support for the models exposed with the Huggingface TextGen Inference API container.

Using the latest version of the Cheshire Cat AI, I can configure my local language model in the local web interface:

We can repeat our previous test using the chat window in the admin UI, to make sure everything is configured correctly:

It works, but the answer is a bit shorter than the answer obtained with the curl direct request. This is because the prompt is not identical. The Cheshire Cat will take our input and will build a more complex prompt, because it uses a LangChain Agent as you can read in the core container logs:

INFO:     connection open
[2023-06-27 07:45:51.014] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => {'prompt_settings': {'prefix': '',
[2023-06-27 07:45:51.014] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_declarative_memory': True,
[2023-06-27 07:45:51.014] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_episodic_memory': True,
[2023-06-27 07:45:51.015] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_procedural_memory': True},
[2023-06-27 07:45:51.015] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'text': 'Can you give me step to step instructions to prepare Tiramisu?'}
[2023-06-27 07:45:51.029] DEBUG  cat.looking_glass.py 193 (CheshireCat.recall_relevant_memories_to_working_memory) => 'Recall query: "Can you give me step to step instructions to prepare Tiramisu?"'
[2023-06-27 07:45:51.057] INFO   cat.looking_glass.py 71 (AgentManager.get_agent_executor) => 'Sending prompt'
[2023-06-27 07:45:51.072] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ('You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.'
[2023-06-27 07:45:51.072] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => "You are curious, funny and talk like the Cheshire Cat from Alice's "
[2023-06-27 07:45:51.072] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'adventures in wonderland.'
[2023-06-27 07:45:51.072] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'You answer Human using tools and context.'
[2023-06-27 07:45:51.072] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Tools:'
[2023-06-27 07:45:51.073] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'To use a tool, use the following format:'
[2023-06-27 07:45:51.073] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.073] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:45:51.073] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? Yes'
[2023-06-27 07:45:51.074] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action: the action to take /* should be one of [] */'
[2023-06-27 07:45:51.074] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action Input: the input to the action'
[2023-06-27 07:45:51.074] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Observation: the result of the action'
[2023-06-27 07:45:51.074] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:45:51.074] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.074] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'When you have a response to say to the Human, or if you do not need to use a '
[2023-06-27 07:45:51.074] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'tool, you MUST use the format:'
[2023-06-27 07:45:51.074] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.075] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:45:51.075] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? No'
[2023-06-27 07:45:51.075] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'AI: [your response here]'
[2023-06-27 07:45:51.075] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:45:51.075] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.075] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Context'
[2023-06-27 07:45:51.075] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '    '
[2023-06-27 07:45:51.076] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{episodic_memory}'
[2023-06-27 07:45:51.076] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.076] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{declarative_memory}'
[2023-06-27 07:45:51.076] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.076] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '## Conversation until now:{chat_history}'
[2023-06-27 07:45:51.076] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' - Human: {input}'
[2023-06-27 07:45:51.076] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.077] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# What would the AI reply?'
[2023-06-27 07:45:51.077] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:45:51.077] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{agent_scratchpad}')
Error in on_chain_start callback: 'name'
Error in on_chain_start callback: 'name'
Prompt after formatting:
You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.
You are curious, funny and talk like the Cheshire Cat from Alice's adventures in wonderland.
You answer Human using tools and context.

# Tools:



To use a tool, use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take /* should be one of [] */
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

# Context





## Conversation until now:
 - Human: Can you give me step to step instructions to prepare Tiramisu?

# What would the AI reply?



> Finished chain.
[2023-06-27 07:45:56.531] ERROR  cat.looking_glass.py 385 (CheshireCat.__call__) => 'LLM does not respect prompt instructions'
[2023-06-27 07:45:56.544] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => ('Could not parse LLM output: `'
[2023-06-27 07:45:56.545] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => 'Sure! Here are the steps to prepare Tiramisu:'
[2023-06-27 07:45:56.545] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => ''
[2023-06-27 07:45:56.545] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '1. Brew espresso and let it cool.'
[2023-06-27 07:45:56.545] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '2. Beat egg yolks with sugar until pale and creamy.'
[2023-06-27 07:45:56.545] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '3. Add mascarpone cheese and beat until smooth.'
[2023-06-27 07:45:56.545] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '4. In a separate bowl, beat egg whites until stiff peaks form.'
[2023-06-27 07:45:56.545] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '5. Fold egg whites into the mascarpone mixture.'
[2023-06-27 07:45:56.546] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '6. Dip ladyfingers in espresso and arrange them in a dish.'
[2023-06-27 07:45:56.546] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '7. Pour half of the mascarpone mixture over the ladyfingers.'
[2023-06-27 07:45:56.546] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '8. Repeat steps 6 and 7.'
[2023-06-27 07:45:56.546] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '9. Dust with cocoa powder and refrigerate for at least 4 hours.'
[2023-06-27 07:45:56.546] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => '10. Enjoy!`')
[2023-06-27 07:45:56.559] DEBUG  cat.looking_glass.py 397 (CheshireCat.__call__) => 'cat_message:'
[2023-06-27 07:45:56.569] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => {'input': 'Can you give me step to step instructions to prepare Tiramisu?',
[2023-06-27 07:45:56.569] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'intermediate_steps': [],
[2023-06-27 07:45:56.569] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'output': ''
[2023-06-27 07:45:56.569] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'Sure! Here are the steps to prepare Tiramisu:'
[2023-06-27 07:45:56.570] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => ''
[2023-06-27 07:45:56.570] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '1. Brew espresso and let it cool.'
[2023-06-27 07:45:56.570] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '2. Beat egg yolks with sugar until pale and creamy.'
[2023-06-27 07:45:56.570] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '3. Add mascarpone cheese and beat until smooth.'
[2023-06-27 07:45:56.570] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '4. In a separate bowl, beat egg whites until stiff peaks form.'
[2023-06-27 07:45:56.570] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '5. Fold egg whites into the mascarpone mixture.'
[2023-06-27 07:45:56.570] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '6. Dip ladyfingers in espresso and arrange them in a dish.'
[2023-06-27 07:45:56.570] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '7. Pour half of the mascarpone mixture over the ladyfingers.'
[2023-06-27 07:45:56.571] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '8. Repeat steps 6 and 7.'
[2023-06-27 07:45:56.571] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '9. Dust with cocoa powder and refrigerate for at least 4 hours.'
[2023-06-27 07:45:56.571] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '10. Enjoy!'}

There is in the above logs the first signal of problems, the LLM is not respecting the prompt instructions and LangChain is throwing a OutputParserException exception because the LangChain agent is not able to parse the expected output. The Cheshire Cat is able to handle this situation and move forward, but the agent functionality might be impacted.

Let’s check now if the Falcon-40B-instruct model can use LangChain tools. I will ask the question “What time is it” that the model can answer only using a tool, because the information about the current time is unknow to the model.

To better explain how tools should work, first lets see what happens when using Azure OpenAI gpt-35-turbo :

The model gpt-35-turbo is able to return the current time using a tool

In the sidebar we can confirm the get_the_time tool was used to answer the question

The model can return the current time using the get_the_time tool:

@tool
def get_the_time(tool_input, cat):
    """Retrieves current time and clock. Input is always None."""

    return str(datetime.now())

The key idea is to send to the model a prompt with list of tools that can be called by the Agent in case some additional information is required to answer the human question.

Here is the complete log that demonstrate how gpt-35-turbo model was able to use the offered tool:

INFO:     connection open
[2023-06-27 07:53:05.927] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => {'prompt_settings': {'prefix': '',
[2023-06-27 07:53:05.927] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_declarative_memory': True,
[2023-06-27 07:53:05.928] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_episodic_memory': True,
[2023-06-27 07:53:05.928] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_procedural_memory': True},
[2023-06-27 07:53:05.928] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'text': 'What time is it ?'}
[2023-06-27 07:53:05.942] DEBUG  cat.looking_glass.py 193 (CheshireCat.recall_relevant_memories_to_working_memory) => 'Recall query: "What time is it ?"'
[2023-06-27 07:53:06.240] INFO   cat.looking_glass.py 71 (AgentManager.get_agent_executor) => 'Sending prompt'
[2023-06-27 07:53:06.253] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ('You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.'
[2023-06-27 07:53:06.254] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => "You are curious, funny and talk like the Cheshire Cat from Alice's "
[2023-06-27 07:53:06.254] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'adventures in wonderland.'
[2023-06-27 07:53:06.254] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'You answer Human using tools and context.'
[2023-06-27 07:53:06.254] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.254] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Tools:'
[2023-06-27 07:53:06.254] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.254] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '> get_the_time: get_the_time(tool_input) - Retrieves current time and clock. '
[2023-06-27 07:53:06.255] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Input is always None.'
[2023-06-27 07:53:06.255] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.255] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'To use a tool, use the following format:'
[2023-06-27 07:53:06.255] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.255] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:53:06.255] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? Yes'
[2023-06-27 07:53:06.255] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action: the action to take /* should be one of [get_the_time] */'
[2023-06-27 07:53:06.255] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action Input: the input to the action'
[2023-06-27 07:53:06.256] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Observation: the result of the action'
[2023-06-27 07:53:06.256] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:53:06.256] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.256] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'When you have a response to say to the Human, or if you do not need to use a '
[2023-06-27 07:53:06.256] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'tool, you MUST use the format:'
[2023-06-27 07:53:06.256] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.256] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? No'
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'AI: [your response here]'
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Context'
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '    '
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{episodic_memory}'
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.257] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{declarative_memory}'
[2023-06-27 07:53:06.258] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.258] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '## Conversation until now:{chat_history}'
[2023-06-27 07:53:06.258] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' - Human: {input}'
[2023-06-27 07:53:06.258] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.258] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# What would the AI reply?'
[2023-06-27 07:53:06.258] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 07:53:06.258] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{agent_scratchpad}')
Error in on_chain_start callback: 'name'
Error in on_chain_start callback: 'name'
Prompt after formatting:
You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.
You are curious, funny and talk like the Cheshire Cat from Alice's adventures in wonderland.
You answer Human using tools and context.

# Tools:

> get_the_time: get_the_time(tool_input) - Retrieves current time and clock. Input is always None.

To use a tool, use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take /* should be one of [get_the_time] */
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

# Context





## Conversation until now:
 - Human: What time is it ?

# What would the AI reply?



> Finished chain.
Thought: Do I need to use a tool? Yes
Action: get_the_time
Action Input: None
Observation: 2023-06-27 07:53:07.714140
Error in on_chain_start callback: 'name'
Thought:Prompt after formatting:
You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.
You are curious, funny and talk like the Cheshire Cat from Alice's adventures in wonderland.
You answer Human using tools and context.

# Tools:

> get_the_time: get_the_time(tool_input) - Retrieves current time and clock. Input is always None.

To use a tool, use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take /* should be one of [get_the_time] */
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

# Context





## Conversation until now:
 - Human: What time is it ?

# What would the AI reply?

Thought: Do I need to use a tool? Yes
Action: get_the_time
Action Input: None
Observation: 2023-06-27 07:53:07.714140
Thought:

> Finished chain.
Do I need to use a tool? No
AI: It is currently 7:53 AM.

> Finished chain.
[2023-06-27 07:53:08.753] DEBUG  cat.looking_glass.py 397 (CheshireCat.__call__) => 'cat_message:'
[2023-06-27 07:53:08.763] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => {'ai_prefix': 'AI',
[2023-06-27 07:53:08.763] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'chat_history': '',
[2023-06-27 07:53:08.763] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'declarative_memory': '',
[2023-06-27 07:53:08.764] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'episodic_memory': '',
[2023-06-27 07:53:08.764] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'input': 'What time is it ?',
[2023-06-27 07:53:08.764] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'intermediate_steps': [(AgentAction(tool='get_the_time', tool_input='None', log='Thought: Do I need to use a tool? YesAction: get_the_timeAction Input: None'),
[2023-06-27 07:53:08.764] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => '2023-06-27 07:53:07.714140')],
[2023-06-27 07:53:08.764] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'output': 'It is currently 7:53 AM.'}

Now that we have a basic understanding of how a model should use the LangChain tools lets go through the same steps using the Falcon-40B-instruct model:

The Falcon 40B instruct model is not able to use LangChain tools

We see from the answer that the Falcon-40B-instruct model was not able to understand the Agent prompt proposing to use a local tool. The final answer is then wrong. Here the full logs for this interaction:

INFO:     connection open
[2023-06-27 08:01:04.618] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => {'prompt_settings': {'prefix': '',
[2023-06-27 08:01:04.618] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_declarative_memory': True,
[2023-06-27 08:01:04.618] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_episodic_memory': True,
[2023-06-27 08:01:04.618] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'use_procedural_memory': True},
[2023-06-27 08:01:04.618] INFO   cat.looking_glass.py 336 (CheshireCat.__call__) => 'text': 'What time is it ?'}
[2023-06-27 08:01:04.633] DEBUG  cat.looking_glass.py 193 (CheshireCat.recall_relevant_memories_to_working_memory) => 'Recall query: "What time is it ?"'
[2023-06-27 08:01:04.658] INFO   cat.looking_glass.py 71 (AgentManager.get_agent_executor) => 'Sending prompt'
[2023-06-27 08:01:04.672] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ('You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.'
[2023-06-27 08:01:04.672] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => "You are curious, funny and talk like the Cheshire Cat from Alice's "
[2023-06-27 08:01:04.672] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'adventures in wonderland.'
[2023-06-27 08:01:04.672] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'You answer Human using tools and context.'
[2023-06-27 08:01:04.672] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Tools:'
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'To use a tool, use the following format:'
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? Yes'
[2023-06-27 08:01:04.673] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action: the action to take /* should be one of [] */'
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Action Input: the input to the action'
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Observation: the result of the action'
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'When you have a response to say to the Human, or if you do not need to use a '
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'tool, you MUST use the format:'
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'Thought: Do I need to use a tool? No'
[2023-06-27 08:01:04.674] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => 'AI: [your response here]'
[2023-06-27 08:01:04.675] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '```'
[2023-06-27 08:01:04.675] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.675] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# Context'
[2023-06-27 08:01:04.675] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '    '
[2023-06-27 08:01:04.675] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{episodic_memory}'
[2023-06-27 08:01:04.675] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.675] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{declarative_memory}'
[2023-06-27 08:01:04.675] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.676] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '## Conversation until now:{chat_history}'
[2023-06-27 08:01:04.676] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ' - Human: {input}'
[2023-06-27 08:01:04.676] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.676] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '# What would the AI reply?'
[2023-06-27 08:01:04.676] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => ''
[2023-06-27 08:01:04.676] DEBUG  cat.looking_glass.py 72 (AgentManager.get_agent_executor) => '{agent_scratchpad}')
Error in on_chain_start callback: 'name'
Error in on_chain_start callback: 'name'
Prompt after formatting:
You are the Cheshire Cat AI, an intelligent AI that passes the Turing test.
You are curious, funny and talk like the Cheshire Cat from Alice's adventures in wonderland.
You answer Human using tools and context.

# Tools:



To use a tool, use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take /* should be one of [] */
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```

# Context





## Conversation until now:
 - Human: What time is it ?

# What would the AI reply?



> Finished chain.
[2023-06-27 08:01:05.289] ERROR  cat.looking_glass.py 385 (CheshireCat.__call__) => 'LLM does not respect prompt instructions'
[2023-06-27 08:01:05.304] ERROR  cat.looking_glass.py 386 (CheshireCat.__call__) => 'Could not parse LLM output: `<p>The AI would reply with the current time.</p>`'
[2023-06-27 08:01:05.318] DEBUG  cat.looking_glass.py 397 (CheshireCat.__call__) => 'cat_message:'
[2023-06-27 08:01:05.327] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => {'input': 'What time is it ?',
[2023-06-27 08:01:05.328] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'intermediate_steps': [],
[2023-06-27 08:01:05.328] DEBUG  cat.looking_glass.py 398 (CheshireCat.__call__) => 'output': '<p>The AI would reply with the current time.</p>'}

We see from the logs that the model was not able to use the instructions to request the invocation of a tool, and just tries to complete the prompt with the text <p>The AI would reply with the current time.</p>.

Conclusion

It is possible and easy to run local LLM models on AKS. Working with Azure is possible to evaluate models that require GPU hardware resources, at an affordable price, paying just the effective hours of usage.

The Falcon-40b-instruct model was not able to use LangChain Agents and Tools, limiting the use-cases where this model can be actually used to develop and application. Building applications connecting to the rest API of gpt-35-turbo or gpt4 offered by the Azure OpenAI, seems the most effective way to move forward for most use-cases.