Using Large Language Models with Your Own Data

Michael Fox
ProfitOptics
Published in
8 min readOct 16, 2023

Shortly after discovering the amazing capabilities of modern large language models (LLMs), I began asking the question “How can I apply this power to my own data”. After all, this is really what we need from them, regardless of how fun it is to test the limits of their reasoning capability, song writing skills or inherent knowledge of Newtonian physics. In order for them to be truly helpful, they must be able to solve problems that are relevant to me. And for this to happen, it becomes necessary to share my data with them such that they can process questions and deliver insightful responses based upon this information. In this article, we will explore some of the techniques that have been developed for doing just this.

Simple Prompt

Instructions passed to the model briefly explaining what you are looking for. This is the way we all begin querying LLMs. The model responds based on information from its initial training.

Zero Shot Prompt

Like a simple prompt but with the addition of context information to assist the process. Note that in order to create a line feed in ChatGPT without submitting your prompt, use CTRL-Enter.

Prompt:
Determine the sentiment of the text, "Good" or "Bad".
Text: I really did not have a good time at the movies last night. My seat was sticky and the popcorn was stale.
Classification:

Response:
The text states negative experiences about a visit to the movies. Specifically:

1. "really did not have a good time"
2. "seat was sticky"
3. "popcorn was stale"
These are all unfavorable descriptions. Based on this, the classification of the sentiment of the text is:

Classification: Bad.

Here I have provided the model with some data of my own including the text I would like the model to judge the sentiment of and some basic formatting information for the response I expect.

Few Shot Prompt

Adding a small subset of labeled data to the prompt to assist in satisfying the request. I’ve found this is most helpful in showing ChatGPT the format of the answer I want making it easier to parse the results. It can be quite limited as the labeled data and the prompt all contribute to the token size of the request.

Prompt:
Determine the sentiment of the text, "Good" or "Bad". The output should be in json format, with key "classification".
Text: I really did not have a good time at the movies last night. My seat was sticky and the popcorn was stale.
Classification: {"classification": "Bad"}

Text: It is raining today. This will ruin my plans. I will just sit inside and work instead.
Classification: {"classification": "Bad"}

Text: Tomorrow is Christmas. It is my favorite holiday of the year. I am so excited.
Classification: {"classification": "Good"}

Text: I have a test today and I haven't studied. This is a math test and I have trouble with math. I will probably do poorly.
Classification: {"classification": "Bad"}

Text: I started sneezing this morning and could not stop. I believe I am getting a cold. I really don't like being sick.
Classification:

Response:
{"classification": "Bad"}

With the added information provided by the examples, the response given is even more likely to be formatted as I would like making it easier to parse automatically. Also, the model begins with better information regarding the desired content of the response and is more likely to process the prompt successfully.

Retrieval Assisted Generation (RAG)

This gets around the token limit for the most part, allowing an LLM to effectively work with large collections of data. It consists of retrieving relevant information or examples from a database, text file, spreadsheet or other knowledge source and using that information to build a context which is then passed in with the desired prompt (often automatically). This allows use of large data sets without requiring additional training.

Here is a quick example making use of the LlamaIndex data framework for LLM’s.

from llama_index import LLMPredictor, PromptHelper, GPTVectorStoreIndex, SimpleDirectoryReader
from llama_index.llms import OpenAI

if __name__ == '__main__':
try:
query = "What did the queen want done to the Dormouse?"

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003"))

max_input_size = 4096
num_output = 256
max_chunk_overlap = 20

prompt_helper = PromptHelper(max_input_size, num_output, chunk_overlap_ratio=0.1)

documents = SimpleDirectoryReader(
'./data/').load_data()
index = GPTVectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query(query)

print(response)
except Exception as e:
print("An error occurred: {e}" % e)
Output:
The queen wanted the Dormouse to be taken outside and have its head taken off.

The code ingests the documents it finds in the /data directory into a vector store which it uses to locate the desired data and provide the llm with the context it needs to generate an answer. In this case, the /data directory contains just one document, a text file of the story Alice in Wonderland. This is just a simple example. LlamaIndex is an incredibly flexible tool allowing it to work with many different types of data sources and data structures.

Be sure to ‘pip install llama-index’ and provide an OPENAI_API_KEY as an environment variable. For more information on LlamaIndex, check here https://docs.llamaindex.ai/en/stable/

Agents

These allow the model to look outside of itself for the answer. For example, going to the web, or calling other functions.

import os

from langchain.agents import initialize_agent, load_tools, AgentType
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import web_research
from langchain.utilities.google_search import GoogleSearchAPIWrapper
from langchain.vectorstores.chroma import Chroma

if __name__ == '__main__':
try:
llm = ChatOpenAI(temperature=0)

search = GoogleSearchAPIWrapper()

prompt = 'How many degrees cooler is it in Tampa Florida than in Miami Florida today?'

tools = load_tools(["google-search", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent.run(prompt)

except Exception as e:
print("An error occurred: {e}" % e)
Input:
How many degrees cooler is it in Tampa Florida than in Miami Florida today?'

Output:
> Entering new AgentExecutor chain...
I can use Google Search to find the current temperatures in Tampa and Miami.
Action: google_search
Action Input: "current temperature in Tampa Florida"
Observation: 10 Day Weather-Tampa, FL. As of 6:54 am EDT. Rip Current Statement ... Current weather in Tampa, FL. Check current conditions in Tampa, FL with radar ... Temperature History. 10/16. High. Low. Forecast. 73°. 55°. Average. 84°. 68°. Tomorrow's temperature is forecast to be COOLER than today. Radar; Satellite. WunderMap|Nexrad · icon. TonightSun 10/15 Low 63 °F. 5% Precip. / 0.00 °in Tampa, FL Weather Forecast, with current conditions, wind, air quality, and what to expect for the next 3 days. Location, Time (EDT), Weather, Vsby. (SM), Temp. (ºF), Dewpt. (ºF), Hum. (%), Wind (mph), Wind Chill / Heat Index (ºF), Pres. (in). Cooler Temperatures from the Great Plains to the East Coast Early This Week · Tampa, Tampa International Airport (KTPA) · Tampa FLSimilar City Names · Detailed ... Get Tampa, FL current weather report with temperature, feels like, air quality, humidity, UV report and pollen forecast from TheWeatherNetwork.com. Tampa, Tampa International Airport. Enter Your "City, ST" or zip code. metric. D a t e, Time (edt), Wind (mph), Vis. (mi.) Weather, Sky Cond. Temperature (ºF) ... Current Conditions · Sky 9 Cameras · Marine Weather · Travel Weather · Pollen Forecast ... Hourly Temperature. Hourly Wind Speed. NEIGHBORHOOD. 7-county | Florida ... Tampa FL detailed current weather report for 33601 in Hillsborough county, Florida ... The following chart reports what the hourly Tampa, FL temperature has been ...
Thought:I need to search for the current temperature in Miami as well.
Action: google_search
Action Input: "current temperature in Miami Florida"
Observation: Be prepared with the most accurate 10-day forecast for Miami, FL with highs ... Today · Hourly · 10 Day · Radar · Video · Try Premium free for 7 days. Learn More ... Current weather in Miami, FL. Check current conditions in Miami, FL with radar ... Temperature History. 10/16. High. Low. Forecast. 81°. 63°. Average. 86°. 74°. Cooler Temperatures from the Great Plains to the East Coast Early This Week · NWS Forecast Office Miami - South Florida. Miami, FL Weather Forecast, with current conditions, wind, air quality, and what to expect for the next 3 days. Today's and tonight's Miami, FL weather forecast, weather conditions and Doppler radar from The Weather Channel and Weather.com. Tomorrow's temperature is forecast to be MUCH COOLER than today. Radar; Satellite. WunderMap| ... Current conditions at. Miami, Miami International Airport (KMIA). Lat: 25.79°NLon: 80.32°WElev: 10ft. Mostly Cloudy. 78°F. 26°C. Humidity, 67%. 2 days ago ... Today's Forecast. Get the latest weather updates from the NBC6 ... WeatherMiamiSouth FloridaFort Lauderdalefirst alert weather. Trending ... Get Miami, FL current weather report with temperature, feels like, air quality, humidity, UV report and pollen forecast from TheWeatherNetwork.com. 14-day forecast. Add to your locationsAdd to your locations. Weather warnings issued. Forecast - Miami (FL). Day by day forecast. Last updated today at 05:13.
Thought:I have found the current temperatures in Tampa and Miami. Now I can calculate the difference.
Action: Calculator
Action Input: 73 - 81
Observation: Answer: -8
Thought:The temperature in Tampa is 8 degrees cooler than in Miami today.
Final Answer: 8 degrees

Here, I have used another helpful library, Langchain, to load both a math tool and a Google search engine tool and instantiate an agent from them. Internally, these tools are defined in Langchain with the necessary context to allow the LLM to choose between them and to employ their capability in producing a response to the user prompt. I’m sure you can see how much more powerful a Large Language Model can become with access to the data of the entire web. In addition to agents, Langchain offers a host of additional tools that further enhance your code's ability to process inputs and outputs of language models.

Be sure to pip install langchain, html2text and google-api-python-client and provide your OPENAI_API_ID, GOOGLE_CSE_ID and GOOGLE_API_ID as environment variables.

For more information, check here https://docs.langchain.com/docs/

Parameter Efficient Training

This is a training approach that allows you to fine tune a model with additional data by working with just a subset of it’s parameters. This makes the training process faster and less resource-intensive. Through a little mathematical magic, PEFT training can actually perform close to or as good as traditional fine tuning and can often be done on just a single graphics card or two, rather than giant banks of them. Models can be trained in this way with additional data, making them better suited to tackle specific tasks. You may have already been exposed to the concept of Lora models if you have ever worked with StableDiffusion A1111 for image generation or HuggingFace chatmodels using a tool like Oobabooga. By creating Lora models, users are able to “train” their favorite base model with much smaller data sets using just their personal computer or an inexpensively leased cloud instance. (see https://vast.ai/ to “borrow” a gpu or 2 for a reasonable price). While a detailed example of model training is beyond the scope of this article, effective training of LLMs as well as diffusion models using PEFT methods is quite possible without million dollar budgets and access to University super computing resources. To learn more, see this article https://huggingface.co/blog/peft

Traditional Fine-Tuning

This method of additional training also aims to make a model more capable of doing specific tasks through the use of additional data. Unlike PEFT approaches that achieve their results by using just a subset of a model’s parameters, traditional fine tuning requires consideration of all the model’s parameters throughout the entire training process and can take a tremendous amount of data and resources. Accordingly, this approach is often not the best option when looking to enhance Large Language Models with your own data. Traditional Fine-Tuning of commercial models, such as those offered by OpenAI, can get very expensive very quickly. Luckily, as we have been discussing, there are other options available.

--

--