Showcasing Code Llama, GPT-3.5 Instruct, and GPT-4 for generating data visualisations

Integrating an open-source large language model into the Streamlit app Chat2VIS to compare data visualisations generated using natural language text

10 min readOct 17, 2023

Generative AI is moving at lightning speed ⚡️, and you don’t want to blink. New LLMs brimming with exciting features consistently seize the headlines of my news feeds.

With Chat2VIS, you can use natural language to prompt up to 5 LLMs to generate Python code to build plots from a dataset. (Learn more about why I built Chat2VIS from the journal article.)

I wanted to put 3 of the latest LLMs to the test, comparing their performance in generating code for various visualisations. From creating bar charts and time series data, to handling misspelled words and ambiguous prompts, I uncover how each model responds.

The results provide interesting insights into the strengths and limitations of these models, with a focus on Code Llama’s potential and the benefits of GPT-3.5 Instruct and GPT-4.

Why I chose to compare Code Llama and OpenAI models

Open source models, like Code Llama, are free to use, and easier to fine-tune on your own data. OpenAI models are easy to use “out of the box”, with some now available to fine-tune, but come with a cost. Historically, I’ve faced challenges using open source models in Chat2VIS, finding they often misunderstood the request or failed completely to generate accurate Python code.

When Code Llama was released, an LLM tuned for code generation, I was keen to see how it compared to the OpenAI models. I’ve been impressed with Code Llama, which shows great potential for this task without incurring the cost associated with the OpenAI models.

I have opted for the “Instruct” fine-tuned variation of Code Llama. It aligns well with the existing prompt style, which issues instructions in natural language followed by the beginning of a Python code script.

Let’s see how it stacks up against OpenAI’s GPT-4 and their recent release of GPT-3.5 Instruct.

Quick overview of Chat2VIS

Before we begin comparing LLMs, here’s a look again at how Chat2VIS works (check out the full Streamlit or Medium blog posts to learn more.)

Chat2VIS Architecture (Image by author published on Streamlit)

6 case studies using Chat2VIS to compare Code Llama vs. GPT-3.5 Instruct and GPT-4

Using Chat2VIS, I tested how each model performed based on 6 different scenarios. Follow these instructions to generate your HuggingFace token (no credit required) for Code Llama. Acquire an OpenAI API key here and add some credit to your account. I’ll walk you through all the examples from the Chat2VIS published article, this time using GPT-4, the new GPT-3.5 Instruct model, and Code Llama.

For each example, choose the dataset from the sidebar radio button options, select the models using the checkboxes, and enter your API keys for OpenAI and HuggingFace.

Remember, as discussed in the published article, the non-deterministic
nature of the LLMs can lead to variability in the generated Python code and subsequent plot generation even when an identical prompt is resubmitted. So your plots may differ from the examples I have demonstrated here.

Case study 1. Generate code for a bar chart

This example uses the pre-loaded “Department Store” dataset.

Run the following query: “What is the highest price of product, grouped by product type? Show a bar chart, and display by the names in desc.”

Kudos to all three models for producing the same results! (Even though they may have different labels and titles.)

GPT-4 ✅
Code Llama ✅
GPT-3.5 Instruct ✅

Case study 2: Generate code for time series

Using the “Energy Production” dataset, run the query: “What is the trend of oil production since 2004?”

Impressive! All three models generated almost identical plots, showing data from 2004 onwards.

GPT-4 ✅
Code Llama ✅
GPT-3.5 Instruct ✅

Case Study 3: Plotting request with an unspecified chart type

Here, I’m using the pre-loaded “Colleges” dataset in the sidebar radio button.

Run the query: “Show debt and earnings for Public and Private colleges.”

GPT-4 ✅
Code Llama 🤔
During the initial runs of this example, I discovered that Code Llama had some limitations similar to other legacy OpenAI models. It repeatedly attempted to generate scatter plot code assigning invalid values to the function’s c parameter, as also mentioned in this article. As a result, the code failed to execute. To improve its success rate, I made a slight adjustment to the prompt (for the exact wording, delve into the prompt engineering within this code).
GPT-3.5 Instruct 🤔 plotted average values, maybe not quite as informative as the other models.

Case study 4. Parsing complex requests

Let’s examine a more complex example where the models need to select a subset of the data. Using the Customers & Products dataset, run the query: “Show the number of products with price higher than 1000 or lower than 500 for each product name in a bar chart, and could you rank y-axis in descending order?”

GPT-4 succeeded in this case ✅
GPT-3.5 Instruct produced an empty plot ❌
It’s surprising that GPT-3.5 Instruct didn’t succeed, as this query has previously worked for ChatGPT-3.5, GPT-3, and Codex.
Code Llama also failed ❌ for several reasons.
It did not filter the data to include only prices higher than $1000 or lower than $500 nor did it sort the data as requested.
I encountered these kinds of limitations frequently while exploring Code Llama’s capabilities.

Case study 5. Misspelled prompts

Using the “Movies” dataset, let’s see how Code Llama handles misspelled words. Run the query: “draw the numbr of movie by gener.”

Look at that! Each model overlooked my spelling mistakes!

GPT-4 ✅
GPT-3.5 Instruct ✅
Code Llama ✅
While it didn’t sort the results in the same order as the OpenAI models, the prompt didn’t specify any sorting.
Code Llama, that uninformative legend is not very helpful!

Case study 6. Ambiguous prompts

Continuing with the “Movies” dataset, let’s submit the single word “tomatoes” and observe how the models process it.

GPT-4 ✅
Code Llama ✅
GPT-3.5 Instruct ❌
This model did not identify a relevant “tomato” visualisation for the movie data set.

How to integrate Code Llama and GPT-3.5 Instruct into Chat2VIS

Let’s examine the code to integrate Code Llama and GPT-3.5 Instruct into the Streamlit app (I discussed GPT-4 in my previous Medium post). Let’s use Langchain to execute Code Llama on the HuggingFace Hub.

To get started, install the langchain and huggingface_hub Python packages in your environment:

pip install langchain
pip install huggingface_hub

Let’s make a few adjustments to the Streamlit interface:

Add “Code Llama” into the title using st.markdown.
Add two columns — one for the OpenAI key and one for the HuggingFace key. Use the help argument of the text_input widget to indicate which models require which type of API key.
Create the text box for the user’s question and add the “Go…” button.

st.markdown("<h1 style='text-align: center; font-weight:bold; font-family:comic sans ms; padding-top: 0rem;'> Chat2VIS</h1>", unsafe_allow_html=True)
st.markdown("<h2 style='text-align: center;padding-top: 0rem;'>Creating Visualisations using Natural Language with ChatGPT and Code Llama</h2>", unsafe_allow_html=True)

key_col1,key_col2 = st.columns(2)
openai_key = key_col1.text_input(label = ":key: OpenAI Key:", help="Required for ChatGPT-4, ChatGPT-3.5, GPT-3.",type="password")
hf_key = key_col2.text_input(label = ":hugging_face: HuggingFace Key:",help="Required for Code Llama", type="password")question = st.text_area(":eyes: What would you like to visualise?",height=10)
go_btn = st.button("Go...")

Now, the interface appears as follows:

Next, update the dictionary of available models to include GPT-3.5 Instruct and Code Llama, and construct the list of model checkboxes accordingly.

available_models = {"ChatGPT-4": "gpt-4","ChatGPT-3.5": "gpt-3.5-turbo","GPT-3": "text-davinci-003",
                     "GPT-3.5 Instruct": "gpt-3.5-turbo-instruct","Code Llama":"CodeLlama-34b-Instruct-hf"}

st.write(":brain: Choose your model(s):")
use_model = {}
for model_desc,model_name in available_models.items():
    label = f"{model_desc} ({model_name})"
    key = f"key_{model_desc}"
    use_model[model_desc] = st.checkbox(label,value=True,key=key)

Now, the new models will appear in your checkbox list:

Next, create a list of the models that the user has selected:

# Make a list of the models which have been selected
selected_models = [model_name for model_name, choose_model in use_model.items() if choose_model]
model_count = len(selected_models)

It’s time to initiate the process and click “Go…”!

The script will only run if one or more models are selected. If at least one of the OpenAI models is chosen, check whether the user has entered an OpenAI API key (starting with sk-). If Code Llama is selected, check whether the user has entered a HuggingFace API key (starting with hf_).

Columns will be dynamically created on the interface for the correct number of plots, and the LLM prompt will be prepared for submission to the chosen models (for more details, refer to this code and this article):

# Execute chatbot query
if go_btn and model_count > 0:
    api_keys_entered = True
    # Check API keys are entered.
    if  "ChatGPT-4" in selected_models or "ChatGPT-3.5" in selected_models or "GPT-3" in selected_models or "GPT-3.5 Instruct" in selected_models:
        if not openai_key.startswith('sk-'):
            st.error("Please enter a valid OpenAI API key.")
            api_keys_entered = False
    if "Code Llama" in selected_models:
        if not hf_key.startswith('hf_'):
            st.error("Please enter a valid HuggingFace API key.")
            api_keys_entered = False
    if api_keys_entered:
        # Place for plots depending on how many models
        plots = st.columns(model_count)
        ...

After constructing the prompt, it’s sent to each model using the run_request function. Extend this function to include the new models. To do this, import the langchain methods in addition to the openai library:

import openai
from langchain import HuggingFaceHub, LLMChain,PromptTemplate

Within the run_request function, the key parameter refers to the user's OpenAI API key. To accommodate the HuggingFace key add an additional parameter alt_key. Then create the function as follows:

Add the first if statement to handle ChatGPT 3.5 & 4 models. GPT-4 tends to be more verbose and includes comments in the script without using the # character. To address this, modify the system's role to only include code in the script and exclude comments.
Add the second if statement to cover the legacy GPT-3 model. Since the new GPT-3.5 Instruct model also uses the same Completion endpoint as GPT-3, include it in this if statement.
Add a third if statement to run Code Llama from HuggingFace using basic LangChain commands.

def run_request(question_to_ask, model_type, key, alt_key):
    if model_type == "gpt-4" or model_type == "gpt-3.5-turbo":
        # Run OpenAI ChatCompletion API
        task = "Generate Python Code Script."
        if model_type == "gpt-4":
            # Ensure GPT-4 does not include additional comments
            task = task + " The script should only include code, no comments."
        openai.api_key = key
        response = openai.ChatCompletion.create(model=model_type,
            messages=[{"role":"system","content":task},{"role":"user","content":question_to_ask}])
        llm_response = response["choices"][0]["message"]["content"]
    elif model_type == "text-davinci-003" or model_type == "gpt-3.5-turbo-instruct":
        # Run OpenAI Completion API
        openai.api_key = key
        response = openai.Completion.create(engine=model_type,prompt=question_to_ask,temperature=0,max_tokens=500,
                    top_p=1.0,frequency_penalty=0.0,presence_penalty=0.0,stop=["plt.show()"])
        llm_response = response["choices"][0]["text"] 
    else:
        # Hugging Face model
        llm = HuggingFaceHub(huggingfacehub_api_token = alt_key, 
              repo_id="codellama/" + model_type, model_kwargs={"temperature":0.1, "max_new_tokens":500})
        llm_prompt = PromptTemplate.from_template(question_to_ask)
        llm_chain = LLMChain(llm=llm,prompt=llm_prompt,verbose=True)
        llm_response = llm_chain.predict()
    return llm_response

HuggingFaceHub provides access to the models within the HuggingFace Hub platform. Initialise the object using the Hugging Face API token from the Streamlit interface. It requires the full repository name repo_id=codellama/CodeLlama-34b-Instruct-hf. To ensure limited creativity in the Python code generation, set the temperature to a low value of 0.1. A token limit of 500 should be sufficient to produce the required code.
PromptTemplate allows for manipulation of LLM prompts, such as replacing placeholders and keywords within the user's query. I have already dynamically created the prompt (question_to_ask), so it is a simple task to create the prompt template object.
LLMChain is a fundamental chain for interacting with LLMs. Construct it from the HuggingFaceHub and PromptTemplate objects. Set the verbose=True option to observe the prompt's output on the console. Then execute the predict function to submit the prompt to the model and return the resulting response.

The complete script to generate the visualisation for each model is created by amalgamating the code section from the initial prompt with the script returned from the run_request function. Subsequently, each model's script is executed and rendered in a column on the interface using st.pyplot.

3. Can we improve success with open-source models

I have compared the performance of Code Llama, GPT-3.5 Instruct, and ChatGPT-4 using examples from published literature showcasing ChatGPT-3.5, GPT-3, and Codex.

Initial experiments show promise, but the OpenAI models still outperform Code Llama in several scenarios. I encourage you to experiment and share your opinions.

In the future, I plan to enhance the prompt further and explore various other prompting techniques to potentially improve Code Llama’s accuracy. Although I want to avoid overcomplicating the instructions, I acknowledge its potential for improvement.

For this task, considering my prompting style, ChatGPT-4 is my preferred choice. However, taking into consideration the comparable results of ChatGPT-3.5 in the journal article and previous blog, together with the lower cost of the GPT-3.5 models (costs here), I would ultimately still choose ChatGPT-3.5. Nonetheless, it may be worthwhile to fine-tune a Code Llama model for data visualisations to further explore its capabilities, as it offers a cost-effective solution for Chat2VIS.

Wrapping up

Thank you for reading my post! I’ve shown you how to integrate into Chat2VIS GPT-3.5 Instruct and Code Llama from HuggingFace using the Langchain framework. I’ve discussed how to use these models in the Streamlit app and demonstrated their performance through various case studies.

I’d love to hear your opinions and the outcomes of your experiments. If you have any questions, please contact me on LinkedIn.

(Edited blog with content originally published by the author on Streamlit Tutorials and LLMs)