Pick Your Fighter : Easy Model Comparisons with Streamlit & Cortex

There can only be one. Except when there are many.

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

6 min readApr 22, 2024

It’s the reality of working in the Generative AI space that the moment I hit ‘publish’ on this article, it’ll immediately become out of date. This week alone, we’ve seen the release of Llama 3 (8b & 70b), the technical write-up of Mixtral-8x22b, the general release of Reka-core, as well as Snowflake’s state-of-the-art open source embedding models: Snowflake-Arctic. It’s equal parts exhilarating and exhausting.

The issue with all this choice is that the question of “which model should I use for…?” becomes increasingly nuanced. Even smaller models are highly capable these days. However, benchmarking well is one thing, performing accurately at a given real world task is another. So, how do you pick? Equally, if you were previously using a Llama 2 model, how can you ensure that swapping to a Llama 3 results in net positive gains?

To rapidly solve this challenge — and most of life’s problems — I find myself turning to Streamlit-in-Snowflake. Combined with Cortex’s unbelievable ease of use, comparing models on a like-for-like prompt becomes a breeze. The best part? Once again, achieved in very few lines of simple code:

import streamlit as st
from snowflake.snowpark.context import get_active_session
session = get_active_session()
st.set_page_config(layout="wide")

st.title("Cortex Model Comparison")
st.write("""[docs.streamlit.io](https://docs.streamlit.io).""")

@st.cache_data #don't issue a new query if nothing else changes
def cortex_query(model, prompt,temp):
    prompt = prompt.replace("'","\\'") #account for pesky quotes
 
    q = """SELECT SNOWFLAKE.CORTEX.COMPLETE('%s',
    [{'role': 'user',
    'content': '%s'}],
    {'temperature': %d}) as resp,
    TRIM(GET(resp:choices,0):messages,'" ') as response,
    resp:usage:total_tokens::string as total_tokens
    ;""" % (model, prompt, temp)
    
    exc_q = session.sql(q).to_pandas()
    return exc_q

form = st.form("prompt_compare")

prompt = form.text_area("Enter your prompt:")
submitted = form.form_submit_button("Submit")

col1, col2, spc, col3, col4 = st.columns([2,2, 1, 2,2])

model1 = col1.selectbox("Select the first model:",("mistral-large","reka-flash","mixtral-8x7b","mistral-7b","Gemma-7b","llama2-70b-chat"))
temp1 = col2.slider("Select temp:",0.0,1.0,0.2)

model2 = col3.selectbox("Select the second model:",("reka-flash","mistral-large","mixtral-8x7b","mistral-7b","Gemma-7b","llama2-70b-chat"))
temp2 = col4.slider("Select temp:",0.0,1.0,0.3)

chat1,spc2,chat2 = st.columns([4,1,4])

with chat1:
    if submitted:
        with st.chat_message("1"):
            reply1 = cortex_query(model1,prompt,temp1)
            st.markdown(reply1['RESPONSE'][0])
            st.info('Total tokens: ' + reply1['TOTAL_TOKENS'][0])

with chat2:
    if submitted:
        with st.chat_message("2"):
            reply2 = cortex_query(model2,prompt,temp2)
            st.write(reply2['RESPONSE'][0])
            st.info('Total tokens: ' + reply2['TOTAL_TOKENS'][0])

Firstly, what should I look for?

Before I explain the code, let’s run through when this app would be particularly helpful.

Beyond accuracy, it is crucial to test how a foundational model responds to a given prompt structure. For example, the Mistral collection of models will respect and add weight to shouting at it in ALL CAPS far more than other models. Llama, on the other hand, appears to prefer neatly structured XML tags such as <context> “Hello” </context>. Each time you swap a model for another or upgrade, checking the validity of the previous prompt is highly valuable.

An increasingly notable character of models (especially as they merge on accuracy) is the tonality of it’s response. Jokes aside, models appear to house their own unique personality. Perhaps unsurprising, given that they’re built on data and fine-tuned by humans, that there will a tendency to reflect the manner in which they’ve been built.

The last component, particularly important in generation tasks, is temperature. You can think of it as how much creative license you’re willing to give a model in its response. Set low (0.0 to 0.5) and replies will favour consistency and predictability. Set high (0.8 to 1), you’ll find responses are far more creative but random. Any higher and you’ll find models begin to resemble near drunken behaviour, sprouting next to nonsense.

To run through an example, I thought I’d ask two models Gemma-7b and Mixtral-8x7b to describe the wonderful city of Brighton (biased) in the ‘style of a Londoner’. There’s enough ambiguity in the second part of the ask to make this interesting. How exactly do the models define “a Londoner” — will it be more Dick van Dyke fumbling around in Mary Poppins? Or more of a reserved banking type?

Either way, let’s find out.

Gemma-7b out here full of lies and ready to pick fights

From a tonality point of view, Gemma-7b was far more in line with expectations when asked for a “Londoner” response. The cockney accent shone through. However, the derogatory comment on “chavs and tourists” — as well as misinformation on donkey rides (banned for several years now) — did make me pause for thought. At least I can give it credit for microbreweries and seagulls.

Mixtral, on the other hand, is far more measured in its response. Wordier, more complimentary, and placed specific references to Brighton’s wonderful independent retailers, pesky seagulls (again), and pebbly beach. The fact that Mixtral is a larger mixture-of-experts model allowed it to outmaneuver Gemma’s smaller knowledge base in this case.

So, what if we raise the temperature of Mixtral? Let’s give the model all the creative license it requires and see what happens.

Temperature appears to be correlated with spiciness.

“If you’re in to that sort of thing”, eh? This time, Mixtral appears to have embraced the British personality I demanded. Written as if it’s having a proper moan, where every compliment is buried deep within complaints and emboldened with the odd “bloody” for good measure.

Importantly, regardless of how enjoyable Brighton is, comparing these outputs takes seconds. A simple parameter adjustment, prompt change, or filter selection and our Streamlit app quickly provides that adjusted response. Let’s dive under the hood to round things off.

The techy bits

The creation of this app is fantastically simple, requiring roughly ~40 lines of Python. Using Streamlit-in-Snowflake means I can focus solely on the code (and not the infrastructure, hosting, security, etc) whereas Snowflake Cortex means, for inference, I simply pick a model and pass it through a prompt.

The core of the work comes down to this code segment:

@st.cache_data #don't issue a new query if nothing else changes
def cortex_query(model, prompt,temp):
    prompt = prompt.replace("'","\\'") #account for pesky quotes
 
    q = """SELECT SNOWFLAKE.CORTEX.COMPLETE('%s',
    [{'role': 'user',
    'content': '%s'}],
    {'temperature': %d}) as resp,
    TRIM(GET(resp:choices,0):messages,'" ') as response,
    resp:usage:total_tokens::string as total_tokens
    ;""" % (model, prompt, temp)
    
    exc_q = session.sql(q).to_pandas()
    return exc_q

At the top of the code block, you’ll notice that the cortex_query function has been decorated with @st.cache_data. This ensures that each time we call the function with identical parameters, Streamlit reuses the session cache instead of reissuing a query to the Cortex service. In the example above, when adjusting the temperature for the Mixtral-8x7b model, the cached result for Gemma-7b is reused, and only one new query is generated.

The other point of note is the Cortex COMPLETE query. A typical query structure follows a COMPLETE({model}, {prompt}) set up, where the output is the response to the query. For our app, with the requirement to adjust temperature, the query follows a slightly different pattern: COMPLETE({model}, [prompt array], {options}). This structure allows you to pass multiple prompts, such as history or system prompts, and define parameters like max_tokens (documentation here).

Calling the Cortex service in this manner returns a detailed JSON object with the response alongside information on when the call was made, to which model, and how many tokens were used in the process. To streamline the process, I’m leveraging Snowflake’s native support for semi-structured data to collect the raw response and token data. With that — you’ve completed the foundation!

Wrap up

Streamlit-in-Snowflake combined with Cortex’s ease of use provides a simple and effective way to compare models on a like-for-like prompt basis. By testing models on various prompts and adjusting parameters such as temperature, you can quickly determine the best model for your specific task — ultimately saving you time.

To expand on this in the future, given Streamlit’s iterative nature, it makes sense to build on this by defining system prompts separately. Or, perhaps appending the input with sample table data for extraction/manipulation based tasks. Go nuts — and as Gemma said, watch out for seagulls.

Pick Your Fighter : Easy Model Comparisons with Streamlit & Cortex

There can only be one. Except when there are many.

Firstly, what should I look for?

The techy bits

Wrap up

Written by Tom Christian