The Power Trio: How Unicorn, Gemini Pro, and Vertex Search Supercharge
Large language models (LLMs) are powerful AI systems capable of remarkably human-like text generation. Trained on vast amounts of text data, they excel at tasks like translation, summarization, and creative writing. However, a significant challenge with LLMs is their tendency to hallucinate — that is, generate factually incorrect or nonsensical responses. This occurs due to biases in their training data and a lack of true understanding underlying the words they produce.
To combat this issue, there are various techniques:
Grounding: Connecting LLMs to reliable knowledge sources (e.g., databases, search engines) so they can “ground” their responses in verifiable facts.
Reasoning: Integrating logical reasoning capabilities into LLMs, allowing them to follow step-by-step processes rather than simply relying on statistical association.
Programmatic Functions: Enabling LLMs to interact with and pull information from external programs (like calculators or web APIs) to supplement their knowledge and reduce errors.
The following article is about how to combine them to create very powerful conversational solutions. I will not be using any chain framework like LangChain or LlamaIndex because I do not like to let systems control function calls in the background and I prefer to do it from scratch to promote deeper understanding of LLM mechanics and avoid the potential limitations of automated frameworks.
Tools Used
Vertex Search
It is a solution to handle information retrieval by leveraging cutting-edge techniques like vector databases and embeddings to turn data wheter it is structured or unstructured into contextual search engine. More information here.
Google Foundational Model: Unicorn
Unicorn is a model that falls under PaLM 2 umbrella that excels in generating different creative text formats, translating languages, and writing different kinds of creative content. It stands out for it’s ability to follow complex instructions, adapt its reponses to suit various context, and provide remarkably detailed and informative answers to open-ended questions. More information here.
Google Foundational Model: Gemini Pro
Gemini Pro is a powerful language model (a type of advanced AI) developed by Google DeepMind. It’s part of the larger Gemini family of language models, which also includes: Ultra and Nano. More information here.
Topology
The following architecture represents the solution where 2 data sources are being used: Wikipedia and Vertex Search, unicorn will be able to handle the reasoning of using both sources by asking questions itself and making observations, then we have another function to summarize data (text-bison or gemini-pro) if required.
Steps
Vertex Search is the only component we have to build before calling foundational models.
Create a Google Cloud Storage Bucket and Upload files.
Go to Google Cloud Console: Products > Storage > Cloud Storage > Create a bucket name it:
2. Upload a pdf file to Bucket.
Go to Products > Artificial Intelligence > Search & Conversation> Create a New App and select search:
Name it:
Create a new data store and select cloud storage and the bucket created before or the file:
Once the App has been created it starts a process of reading, parsing, creating embeddings (vector representations to ease the search) and storing them in a Vector Database to be consumed.
The snippet code to access the search engine is as follows:
- Define a client (SearchServiceClient)
- Define parameters like max_extractive_answer_count and max_extractive_segment_count being the latest the most verbose in the response. As the number of answers (paragraphs, text) in the response.
- Iterate through the response and extract things like the text, link and the page number.
#region Vertex Search
def vertex_search(self, prompt):
self.vsearch_client = discoveryengine.SearchServiceClient()
self.vsearch_serving_config = self.vsearch_client.serving_config_path(
project=self.project,
location=self.location,
data_store=self.data_store,
serving_config="default_search",)
request = discoveryengine.SearchRequest(
serving_config=self.vsearch_serving_config, query=prompt, page_size=100)
content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
return_snippet=True),
summary_spec = discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(
summary_result_count=2, include_citations=True),
extractive_content_spec=discoveryengine.SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
max_extractive_answer_count=2,
max_extractive_segment_count=2))
request = discoveryengine.SearchRequest(
serving_config=self.vsearch_serving_config, query=prompt, page_size=2, content_search_spec=content_search_spec)
response = self.vsearch_client.search(request)
documents = [MessageToDict(i.document._pb) for i in response.results]
context = []
num = 0
ctx = {}
for i in documents:
for ans in i["derivedStructData"]["extractive_segments"]:
num += 1
link = "https://storage.googleapis.com"+"/".join(i["derivedStructData"]["link"].split("/")[1:])
context = ans["content"]
page = ans["pageNumber"]
ctx[f"context: {num}"]="text: {}, source: {}, page: {}".format(context, link, page)
return ctx
Design the Prompt.
It is time for coding, in this section I will talk about prompt engineering techniques for creating ReAct.
What is ReAct (stands for Reasoning and Acting): it is a way to instruct the large language model through a chain of thoughts which can take actions for example:
Query: “What is an Apple Remote?”
- Thought 1: I need to search Apple Remote and find the program it was originally designed to interact with.
- Action 1: Search[Apple Remote]
- Observation 1: The traditional remote controls functions such as power, volume, channels, playback, track change, heat, fan speed, and various other features.
- As you can see from the example above we are telling the model to reason about the question is being asked and do a search through external resources if neeeded.
If the information has not been found in the first source we will create an iteration to call the model again to use other data source like in this case (Vertex Search) as a RAG (Retrieval Augmented Generation).
Response from LLM: “I have not found any information about Apple Remote Control just traditional remote”
Thought 2: I probably need to find the answer somewhere else… Here I have a connection to Vertex Search.
Action 2: Search using Vertex Search [Apple Remote]
Observation 2: The Apple Remote is a remote control introducted in October 2005 by Apple.
Response: The Apple Remote is a remote control introducted in October 2005 by Apple.
And that is it!, kind of easy isn’t it?
More information how this works:
Prompt:
You run in a loop of Thought, Action, PAUSE, Observation. At the end of the loop you output an Answer Use Thought to describe your thoughts about the question you have been asked. Use Action to run one of the actions available to you - then return PAUSE. Observation will be the result of running those actions.
Your available actions are:
wikipedia: e.g. wikipedia: Python Returns a summary from searching Wikipedia.
rag_search (for culture questions): e.g. rag_search: Python. Search vector database rag for that term.
summarization: e.g. summarization: Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured, object-oriented and functional programming. Wikipedia Returns a summarization for the description.
For culture questions prioritize to look at rag_search first for the rest use wikipedia.
Example session:
Question: What is the capital of France? Thought: I should look up France on Wikipedia Action: wikipedia: France
PAUSE
You will be called again with this:
Observation: France is a country. The capital is Paris.
You then output:
Answer: The capital of France is Paris
Here is the code.
Create the Functions and Prepare de Models.
I will be using 3 functions:
- For search through wikipedia using the python library httpx.
- For search through Vertex Search (previously created).
- For summarization in case needed using gemini pro or bison.
def wikipedia(q):
return httpx.get("https://en.wikipedia.org/w/api.php", params={
"action": "query",
"list": "search",
"srsearch": q,
"format": "json"
}).json()["query"]["search"][0]["snippet"]
def rag(q):
return client.vertex_search(q) #look at the code above or go to the github repo
def summarization(prompt):
if _summ_model == "gemini-pro":
gemini_model = GenerativeModel(_summ_model)
response = gemini_model.generate_content(
"Give me a summarization of the following:" + prompt,
generation_config=_sum_parameters
)
else:
bison_model = TextGenerationModel.from_pretrained(_summ_model)
response = bison_model.predict(
"Give me a summarization of the following:" + prompt,
**_sum_parameters
)
client.vertex_search comes from a file created before because it serves to multiple demos and I am too lazy to re-write code again, but here it is (function is callled vertex_search).
summarization receives information either from RAG or Wikipedia and does a summary using text-bison or gemini-pro.
Create a Class to Handle the Context and Unicorn Calls
Feel free to run it without class, for me was easier to create it to establish prompt grounding and iterate through the llm respones only:
class Chatbot:
def __init__(self, system=""):
self.system = system
self.messages = []
if self.system:
self.messages.append("Context: {}".format(system))
def __call__(self, message):
self.messages.append(message)
result = self.execute()
self.messages.append(result)
return result
def execute(self):
#print(self.messages)
response = unicorn_model.predict("\n".join(self.messages), **_react_parameters)
st.write(response.text)
print(response.text)
return response.text
Final Step is the Iteration
def query(question, max_turns=5):
i = 0
bot = Chatbot(prompt)
next_prompt = question
while i < max_turns:
i += 1
result = bot("Question: {}".format(next_prompt))
actions = [action_re.match(a) for a in result.split('\n') if action_re.match(a)]
if actions:
# There is an action to run
action, action_input = actions[0].groups()
if action not in known_actions:
raise Exception("Unknown action: {}: {}".format(action, action_input))
st.write(" -- running {} {}".format(action, action_input))
observation = known_actions[action](action_input)
st.write("Observation:", observation)
next_prompt = "Observation: {}".format(observation)
else:
return
Basically I am creating a loop that generates a list with the context prompt in the first position, and iterate through the maximum number of turns to have Though, Action and Observation / Response. Here we use a regular expresion for having wikipedia: Python, rag_search: Python or summarization: Python being the first value the action and the action_input used here: “known_actions[action](action_input)”.
Once we have all together, it looks like this: