Prompt Engineering the GPT-4 model and evaluating LLM response using a Streamlit app

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

4 min readOct 25, 2023

Everyone is excited about the power of Generative AI and Large Language Models. Every data and AI team is experimenting with LLM models. But the rate at which new models, methods and tools are released everyday, it is almost impossible to keep up with.

Worry not! Here is an overview of what LLMs are and the many different ways they are built.

How to build LLMs today?

There are three major approaches to building LLMs today. In the order of increasing difficulty, cost and complexity:

1. Prompt engineering
2. Retrieval Augmented Generation (RAG)
3. Training proprietary LLM models

In this blog, we will dive deeper into approach #1 Prompt Engineering and learn how to take the existing foundation models and use prompt engineering to make the model work for your use case.

Key Terminologies

Before we go further, let us understand a few key terms related to LLMs.

Foundation Model

A foundation model is a general-purpose AI model pre-trained on large text datasets, serving as a versatile base for various natural language processing tasks, and can be fine-tuned for specific applications.

Prompt Engineering

Prompt engineering involves designing and crafting precise prompts or instructions to effectively interact with language models, ensuring clear and desired outputs by formulating input instructions.

LLM Evaluation

LLM evaluation assesses the performance and limitations of Large Language Models, like GPT-3, through tasks such as quality assessment of generated text, question-answering ability, and domain-specific evaluations.

Prompt Injection

Prompt injection is the process of inserting predefined or custom prompts into model interactions to guide responses, maintain context, and ensure that language models generate desired outputs or adhere to specific constraints during conversations.

Prompt Engineering use-case

You are a data practitioner analyzing the reviews of cosmetic products. Customer review data is in a standard Snowflake table.

Applications of LLMs on this data are plenty. We can use the foundation models such as GPT-4 to analyze customer sentiments on these products, run topic modeling on customer reviews, answer Q&A about the cosmetic products, and even create product descriptions that include the ingredients list and allergy precautions and so on.

LLM Evaluation

In this example, we will use GPT-4 and two different prompt templates to generate product description. However, we need to evaluate the product description responses to understand which prompt performs better.

There are different methods to evaluate LLM model responses. First and the simplest one is human evaluation to check for errors, bias, safety, etc. However, it is not a comprehensive solution. You can build an automated prompt engineering and evaluation pipeline end to end for this use case as well.

What if the dataset is corrupted and we have a random product name instead of a cosmetic product in the table?
What if you tried to automate the product description generation without no humans in the loop? It only makes the development cycle faster.
What if someone injects a malicious prompt?

Although human evaluation is a good starting point, it is not the end-all be-all of LLM evaluation methods.

Hands-on Demo

Here is the brief outline for the hands-on demo. To follow along and build a prompt engineering and evaluation pipeline, here is the Quickstart.

Prompt Engineering and Evaluation of LLM responses

This quickstart will cover the basics of prompt engineering on your Large Language Models (LLMs) and how to evaluate…

quickstarts.snowflake.com

Access the cosmetics review data from Snowflake Marketplace
Invoke OpenAI’s ChatCompletion API with different variations of prompts and capture the model responses for product description
Save the model responses in a Snowflake table so we can compare the responses and evaluate which prompts give desired results
Build a Streamlit app to let users compare and evaluate the model responses

Conclusion & Resources

We learnt how to perform prompt engineering on your Large Language Models (LLMs) and how to evaluate the responses of different LLMs through human feedback by building an interactive Streamlit Application.

If you are looking to build more LLM Apps using Snowflake and Streamlit, check out these quickstarts:

Thanks for Reading!

If you like my work and want to support me…

The BEST way to support me is by following me on Medium.
For data engineering best practices, and Python tips for beginners, follow me on LinkedIn.
Feel free to give claps so I know how helpful this post was for you.

Prompt Engineering the GPT-4 model and evaluating LLM response using a Streamlit app

How to build LLMs today?

Key Terminologies

Foundation Model

Prompt Engineering

LLM Evaluation

Prompt Injection

Prompt Engineering use-case

LLM Evaluation

Hands-on Demo

Prompt Engineering and Evaluation of LLM responses

This quickstart will cover the basics of prompt engineering on your Large Language Models (LLMs) and how to evaluate…

Conclusion & Resources

Thanks for Reading!

Written by Vino Duraisamy