VLM & Toys: An autonomous toy car with Generative AI (version 2)

10 min readJul 9, 2024

Generative AI, another AI approach to drive an autonomous toy car.
This article follows my previous one based on Computer Vision :
https://medium.com/totalenergies-digital-factory/ai-toys-how-to-explain-artificial-intelligence-to-children-with-autonomous-toy-car-e749e9931048

A new approach based on language

Autonomous cars are one of the most challenging and promising areas of artificial intelligence (AI). They require a combination of techniques, such as visual perception, scheduling, control, navigation and communication, to enable a vehicle to move without human intervention in complex and dynamic environments.

In a previous article, I presented a common approach to enabling cars to navigate automatically, computer vision, an approach based on splitting the problem into specialised sub-modules, with object, colour and shape detection. This approach has the advantage of being easy to optimise and improve, but it also suffers from several disadvantages, such as the need to have annotated data, or being difficult to scale to other objects.
In this article, we propose another approach, more recent and more innovative, based on the use of language as an interface between vision and action. This approach, called VLM for Vision Language Models, involves training an AI model that can understand and generate text from images, and use it to guide autonomous driving.

What is a VLM?

A VLM is a type of AI model that combines two areas:
Computer vision & Natural Language Processing.
Vision Language Models (VLM) is a field of artificial intelligence that combines language understanding and visual recognition. VLM models are able to process multimodal data, i.e. data that contains both images and text, and perform tasks such as generating descriptions, searching for images, summarising scenes, answering visual questions or performing visual translation. VLM models use deep learning techniques to learn common or aligned representations between visual and linguistic modalities, and to exploit the semantic and syntactic relationships between them.

An approach that is part of Generative AI, a branch of AI that aims to create new content, such as text, images, music, or videos, based on some input or data.

A VLM generally consists of two main parts: an encoder and a decoder.

the encoder is responsible for transforming an image into a vector representation, called embedding, which captures the relevant visual characteristics.
the decoder is responsible for transforming an embedding into a sequence of words, or vice versa, using a process that focuses on the most important parts of the image or text.

They are trained on large quantities of data containing images and text annotations, enabling the model to learn to associate visual features with linguistic expressions.

One of the main challenges in developing VLMs is integrating visual and textual modalities in a coherent and efficient way.
As you can see, we’re not going to train a VLM, but rather use some that have already been trained and made available.
To do this, we’re going to use Ollama.

Prerequisite

The Raspberry Pi will be the brain of the mechanism.

Raspberry Pi 5 with enought RAM to fit models and enabling it to be executed in real time for an autonomous car. If you want to run it for other uses, a Raspberry 4 may be sufficient (for VLM < 2b).

For that we will need to install some libraries :

sudo apt-get update
curl -fsSL https://ollama.com/install.sh | sh

Ollama is a cutting-edge technology solution designed for artificial intelligence and natural language processing enthusiasts. It provides a robust and flexible platform for running advanced language models locally, allowing users to benefit from the power of AI without the need for specialised hardware.

How to use Ollama ?

In the universe of VLMs that run locally and on the Raspberry Pi, I’ve been able to test few solutions, but the simplest, most powerful and easiest to use is Ollama.
Ollama hosts its own list of models to which you have access.

You can download these models locally on your Raspberry Pi, then interact with them via a command line. Alternatively, when you run the model, Ollama also runs an inference server hosted on port 11434 (by default) which you can interact with via APIs and other libraries.

How do you choose which model to use on your raspberry pi ? As you will see, the characteristics of your raspberry pi will quickly limit the use of models with a large number of parameters.

Perhaps before choosing your model, a quick reminder of what a parameter is, what a weight is, etc.

The differences between models

VLM models use internal parameters to learn natural language features and rules. These parameters are adjusted during training based on the data provided to the model. Parameters can take the form of weights, biases or scaling factors that influence the behaviour and output of the model. The number of parameters in an VLM & LLM model indicates its complexity and potential to produce quality text.
Some VLM models have hundreds of millions or billions of parameters, enabling them to handle a wide variety of tasks and linguistic domains.

All the models are available there :
https://ollama.com/library

I tested various wide language models (VLM) available on Ollama, in order to assess their ability to generate quality text on a Raspberry Pi:
llava-phi3
llava-llama3
LLaVA

All of them cannot run on a rasberry pi, the only one possible is the moondream2.
Moondream2 is a small visual language model (VLM) designed to run efficiently on low memory devices. It can be used to analyse various types of image. However, it has limitations due to its small size and for other reasons.
It is not as powerful as the larger, more powerful models shown above. Numerous iterations are required to adapt the prompt to the input data and thus meet the need.Feel free to test other models

How to interact with LLM models: the prompt

A prompt is a written instruction given to the language model, regardless of the choice of model. The prompt is used to guide the VLM/LLM model in generating the desired content, indicating the purpose, subject, tone, style, format, etc. of the text to be produced.
A good prompt is therefore essential for obtaining a satisfactory response from the AI.
- A prompt must include clear and precise written instructions. Ambiguous wording, questions that are too broad or too vague, and contradictory or incomplete requests should be avoided. Simple, direct vocabulary should also be used, with no jargon or abbreviations.
- A prompt should provide sufficient context for the VLM/LLM model to understand the subject, audience, purpose and framework of the content to be generated. Context can include background information, reference sources, specific data, concrete examples, etc. Context helps the VLM/LLM to select relevant information, avoid factual errors, adapt tone and level of language, respect ethical or legal constraints, etc.
- A prompt must clearly define the task that the model is going to perform, i.e. the type of content to be generated, its length, structure, format, etc. The task must be formulated explicitly, in the form of an order, a question, a challenge, etc. The task must be achievable, i.e. within the capabilities of the VLM/LLM, who is not an expert in every field, nor a soothsayer, nor a magician.
- A prompt must indicate the format of the content to be generated, i.e. the way in which the text will be presented, organised, highlighted, etc. The format may include elements of typography, page layout, mark-up, punctuation, etc. The format must be consistent with the task, subject, audience and medium of the content to be generated. The format must be specified in the prompt, or deduced from the text preceding or following the prompt (for example, if the prompt is inserted into an existing document).
- A prompt should provide examples to the template, if possible, to illustrate the expected result, or to show what should be avoided. Examples can be extracts from similar texts, models, samples, evaluations, etc.

These guidelines are not strictly sequential, and do not need to be applied systematically. Rather, they should be seen as best practice, which can vary according to the needs, preferences and creativity of the user. The key is to create a prompt that is adapted to the situation, that clearly expresses the request, and that enables the VLM/LLM to generate the most appropriate content possible.
Don’t hesitate to iterate on your prompt so that you end up with the best formulation.

For example, for our use case :
“We would like in French in 10 words the description of the picture with the path /florianbergamasco/picture.png”
A prompt might look something like this:
“Explain to a 7-year-old in French in 500 words why the sky is blue?”
1. Task = “the description of the picture” .
2. Persona = NA.
3. Format = “in French in 10 words”.
4. Context = “the path /florianbergamasco/picture.png”.

How can a VLM be used for autonomous driving?

The idea of using a VLM for autonomous driving is to replace the traditional computer vision pipeline with a language loop, in which the model receives an image of the road, describes it in simple terms, and generates a driving command based on the description. The driving command can be expressed as text, symbols or numerical values, depending on the desired level of abstraction.
For example, if
- if the model receives an image showing an arrow, the objective will be to tilt the wheels in the direction indicated by the model.
- if the model receives an image showing a stop sign or a red light, the car will have to stop.
The main advantage of this approach is that it reduces the complexity of the autonomous driving problem by using language as a means of simplifying and structuring visual information without any specific learning.
Language also provides a natural way of communicating with human users, explaining the vehicle’s intentions and actions, or asking for clarification or instructions in the event of uncertainty or ambiguity.

How does it work in practice?

1. Open CV

python3-opencv which is an open source library that includes several hundred computer vision algorithms. It allows you to use the webcam, process the images and transmit them to the model.

In order to use these libraries, you will need to run the following command to install the Python library:

sudo python3 -m pip install tflite-runtime --break-system-packages
sudo apt-get install python3-opencv

And then the python script :


import cv2

path = "/home/florianbergamasco_RaspPi10/Desktop/AutonomousCar_2/"

# initialize the camera 
# If you have multiple camera connected with  
# current device, assign a value in cam_port  
# variable according to that 


cam_port = 0
cam = cv2.VideoCapture(cam_port) 
time.sleep(3)#wait to run the cam

previous_value=0

img_counter=0
while 1: 

    # reading the input using the camera 
    result, image = cam.read() 
   
    # If image will detected without any error,  
    # show result 
    if result:
        img_name = "opencv_{}.png".format(img_counter)

        # We changed the size of the picture to be read by the VLM
        image=cv2.resize(image,(375,375))


        cv2.imwrite(path+img_name, frame)
     
        img_path = path+img_name
        img_counter += 1



    # If captured image is corrupted, moving to else part 
    else: 
        print("No image detected. Please! try again")

    time.sleep(1)#wait 1sec

2. OLLAMA — VLM

arg1="ollama"
arg2="run"
arg3="moondream" 
arg4="describe the direction of the arrow in one word, left, back, upward, down of the picture: "+img_path

result = subprocess.run([arg1,arg2,arg3,arg4], capture_output=True, text=True).stdout.strip("\n")

but… I noticed that the moondream model wasn’t stable with a detailed prompt, so I’m going to have to simplify the explanations, which will lengthen the model’s outputs.

arg1="ollama"
arg2="run"
arg3="moondream" 
arg4="describe the picture and the arrow : "+img_path

result = subprocess.run([arg1,arg2,arg3,arg4], capture_output=True, text=True).stdout.strip("\n")

Always iterate on the prompt to find the right format.

Somes examples :

The aim now is to translate the model outputs into a series of actions for the autonomous car:
- turn right
- turn left
- move forward
- move backwards
- stop
- forward

This is the python code based on my previous article :


import cv2
import os
import subprocess

path = "/home/florianbergamasco_RaspPi10/Desktop/AutonomousCar_2/"


# initialize the camera 
# If you have multiple camera connected with  
# current device, assign a value in cam_port  
# variable according to that 

cam_port = 0
cam = cv2.VideoCapture(cam_port) 
time.sleep(3)#wait to run the cam

previous_value=0

img_counter=0
while 1: 

    # reading the input using the camera 
    result, image = cam.read() 
   
    # If image will detected without any error,  
    # show result 
    if result:
        img_name = "opencv_{}.png".format(img_counter)

        # We changed the size of the picture to be read by the VLM
        image=cv2.resize(image,(375,375))


        cv2.imwrite(path+img_name, frame)
     
        img_path = path+img_name
        img_counter += 1

        arg1="ollama"
        arg2="run"
        arg3="moondream" 
        arg4="describe arrow : "+img_path

        result = subprocess.run([arg1,arg2,arg3,arg4], capture_output=True, text=True).stdout.strip("\n")
        print(result)

        if(result.upper().split('LEFT'))>1:
            ##Action
        angle = angle_to_percent (60) #angle left
        tourne_servoMotor(angle)

        if(result.upper().split('RIGHT'))>1:
            ##Action
        angle = angle_to_percent (140) #angle right
        tourne_servoMotor(angle)

        if(result.upper().split('UPWARDS'))>1:
            ##Action
        angle = angle_to_percent (100) #angle forward
        tourne_servoMotor(angle)

        if(result.upper().split('BACK'))>1:
            ##Action
        angle = angle_to_percent (100) #angle forward
        tourne_servoMotor(angle)
            #stop & reverse the mortor
        arretComplet()
        sens2()


        if(result.upper().split('STOP'))>1 or (result.upper().split('HALT'))>1:
            ##Action
        angle = angle_to_percent (100) #angle forward
        tourne_servoMotor(angle)
            #stop the mortor
        arretComplet()

        if(result.upper().split('GREEN'))>1 or (result.upper().split('SAFE'))1:
            ##Action
        angle = angle_to_percent (100) #angle forward
        tourne_servoMotor(angle)
            #start the mortor
       sens1()



    # If captured image is corrupted, moving to else part 
    else: 
        print("No image detected. Please! try again")

    time.sleep(1)#wait 1sec

Conclusion

I hope this article has been useful and inspiring for you to use VLM in your own projects, these technologies open up a huge range of possibilities !
Please don’t hesitate to contact me if you have any comments, questions or suggestions for improvement.
Have fun with your autonomous Lego car version 2 ^^!