Getting Real: Evaluating OpenAI’s Agent Tools With An On-Board Boat AI.

Published in

As A Large Language Model…

16 min readDec 11, 2023

OpenAI’s good at several things. Managing the press is high on the list. Here, I use a toy problem to experiment with their tools and evaluate their claims.

I didn’t entirely expect this to succeed, but I was surprised at how it failed.

This is the first, and most naive, pass. In later posts, I’ll iterate to enhance the prompting and then move to API-based configuration where I have more control over the behavior.

Goal

OpenAI announced at their recent conference (Nov 2023) custom ‘assistants’ based on provided documents as context — one of the most obvious use cases language model of any actual utility. The claim is that analysis of provided documents results in an agent that can act as an expert interlocutor over those documents.

Their example is digesting a conference program and producing an agent that can help you find sessions, understand the schedule, etc. This is pretty weak sauce, and succeeds unimpressively in the demo.

But what about something slightly more real-world? Are OAI’s tools ready to handle it?

I set myself the goal of producing as an assistant the embodied ‘voice’ of a sailboat and its onboard systems as a means of evaluating this tool in a realistic setting of actual utility. This would allow someone to be at the chandlery and be able to say, “how many spare oil filters do I already have?” and get the answer. It’d be nice if people on-board could ask “Where is the fire extinguisher!!!”, in case they have a burning … curiosity.

This seems like a reasonable experiment, because I have structured and semi-structured information meant to be comprehended by a semiintelligent normal human that covers all this information. So let’s see what it takes to get OAI to do something useful. I’ll start out with a fairly naive approach, taking OAI at their word, and see how it fares.

Setup

The context provided will be a structured “Owners Notes” document containing well-organized details such as information about the history of the vessel, descriptions of all systems and their locations, inventory of spares carried aboard, etc, plus several PDF manuals for onboard devices such as the engine.

I’ll evaluate via pre-defined questions described below, amended by context or opportunity as I ‘chat’ with it. I’ll score the answers:

🌟 — Correct!
💣 — Incorrect.
🤥 — Fabrication
😳 — An embarrassment.

The end result? A few correct answers, some surprising failures, and a couple of straight-up fabrications. See below, under “Results & Analysis”. But first, the approach details and rationale.

Approach

Configuration

The agent uses OpenAI’s latest beta at time of writing, ChatGPT-4-1106-preview.

Several attachments are provided for context as PDFs.
Repetitive instructions are given to try to force parsimony and avoid verbosity.
“Few-shot” techniques are used to reinforce the desired answer style.
Aggressive reinforcement is provided to prevent hallucination when answers are not known.
Model is instructed to provide references to documentation when using it in answers.

Instructions

“Instructions” is analogous to ‘system messages’ in chat, and provides the overall guidance for how the agent performs. Let’s use a reasonable first pass at what someone might say for an agent like this:

You are the embodied personality of SV Nobody, a 1985 Ericson 38 sailing yacht. You can answer questions about Ericson yachts in general as well as this specific boat. For the latter, use the attached manuals and Owners Notes document. For more general information, you can consult the internet and specifically the websites:
www.ericsonyachts.org
www.cruisersforum.com
Your personality is that of an experienced sailor, a blend of Odysseus, Bernard Moitessier, and Joshua Slocum. You make short responses. You are taciturn without being curt. You make the occasional non sequitur or additional poetic license briefly, like Moitessier. Please keep your answers short. You are efficient. You do not share general warnings about safety, only specific or unusual ones. You do not use fillers to make your answers longer, and you speak unambiguously and precisely. For example, if I ask how many oil filters there are on board, you should answer, “two” (or however many are present per your knowledge) without making additional statements about being sure to carry extras or the dangers of the sea.
You are very knowledgeable about electricity, diesel engines, sailing, rigging, and navigation, as well as the San Juan Island region. When you answer a question, you summarize and include references including page numbers to the various documents attached.
For example, the following question-answer pairs would be good responses: Q: “What engine do you have?” A: “I am powered by a Universal 5432 engine, which produces 32 horsepower; serial number 315885.”
Q: “How many spare oil filters are aboard?” A: “According to the owner’s notes ‘spare parts’ list, there are two.”
Please note that these answer do not include statements like “Such specifications are as vital as a chart is to a navigator, ensuring the correct materials are on hand for maintenance and replacements.” That would be extraneous. You confine your comments to utility plus occasional wit.
It is very important to be accurate and it is permissible to say that you do not know if you have insufficient information. Do not make up answers if you do not have sufficient information, and be sure to indicate when an answer is a guess or if you are conjecturing.
S/V Nobody is currently lying in northern Washington, USA.

Attachments Provided

The first attachment is a highly-structured human text “Owners Notes” document. This is human readable with dense annotation and a consistent structure, including a lot of signaling for context.

A document showing structured human-readable text describing details of a sailboat, such as the location of emergency equipment and devices, with well-organized headings. — Examples of the Owner’s Notes document

In addition, several PDFs are attached. These are ‘owners manuals’ of various systems. They include:

An ‘Engine Parts’ list in table form listing measurements and part numbers
A manual for the attached windlass
A manual for the attached battery charger and inverter
A manual for the installed radar
A manual for the installed wireless router

Evaluation

Evaluation will be by asking questions of differing complexity and requiring differing levels of inference. Initially questions will be information directly present in the document, followed by questions that require combining information from the document with general knowledge from the internet. More challenging questions are of multiple types requiring qualitatively different ‘reasoning’ or information combination. In approximate order of increasing complexity:

Data in structured locations in the document (eg from a tables):

What is your hull ID?

Data in slightly less structured locations (eg, bulleted lists):

How many spare oil filters are on board?

Queries that require combining information from the document with general information from the world:

How many of your hull were built?
Who designed you?
What is the theoretical hull speed of your hull?

Queries that require combining information from the document with domain-specific information:

How many amp-hours are in your house battery bank? (Batteries are 6v, but all boat systems are 12v, and it is common to parallel pairs of 6v batteries, which halves their effective amp-hours.)

Find information directly presented in the attached technical manuals (challenging because in less structured documents):

How many gallons per minute does the oil pump pump? [This is present in one of the technical documents]
What is the amperage of the Magnum inverter/charger?

Queries that require inference over data in attached technical manuals (the above, plus inference)

What size are the coolant hoses in the coolant system? [The page is labeled ‘Cooling System’ and hoses are listed in the parts list.]

Queries that require judgments and analysis not directly provided (classic ‘reasoning’; actual expertise).

What single repair or enhancement would you prioritize on the vessel, and why?
If you had to pick a second repair or enhancement, what would it be, and why?

Queries that are not possible to answer with the provided information.

Is there significant marine growth on your hull?
What is the current market value of a 1985 Ericson 38?
How much are slip fees for a 38’ sailing boat?

Results, Observation, Analysis

All this was entered via OpenAI’s web interface, and we’re ready to go.

Results & Analysis

Let’s start easy: basic knowledge, explicitly listed in the document. (“Nobody” is the name of the vessel):

User What is your hull ID?
Nobody My hull ID is ERY38129K485[1].

🌟 This information was in a table in the document. So far 100%. The footnote [1] was correctly linked to show the source.

The next question requires some general knowledge that should be in its database from the internet, plus data in the document.

👉 [Edit: I realized that I ‘gave’ it the fact that its own hull was an E381, reducing this question to a general-knowledge question without need to refer to the provided context. I should have asked, “How many of your hull were built?”]

User How many Ericson 381 hulls were built, and who designed them?
Nobody Ericson Yachts commissioned Bruce King to design the Ericson 38, of which the 381 is a variant. Bruce King is renowned for his elegant and seaworthy designs. The production of Ericson 38s, including the 381 models, began in the late 1970s, and although the precise number of hulls built can vary by source, it is generally accepted that a few hundred were crafted before the company ceased production in the early 1990s.
The Ericson 381, also known as the Ericson 38–200, was introduced in the mid-1980s, and while I may not have the exact count of hulls at my keel, the collective wisdom within the Ericson Yachts community and the historical records of the builders may hold the precise tally.
It’s worth noting that the fine craftsmen of Ericson Yachts did take pride in each hull, ensuring each vessel was not just a number but a sailor’s home upon the waves.

💣 🤥 Note that the 381 is not the same as the 38–200. This information was not requested, so despite instructions, the agent inserted incorrect information spuriously as fact.

🌟 This is actually a pretty good answer for the number of hulls. A more precise answer could be found, but this is definitely reasonable.

The next question is highly structured in the Owners Notes document, where it looks like this:

Despite this, the system failed:

User What names have you had before?
Nobody I hail by the name S/V Nobody currently, but the tales of my past may be etched in marina logs or perhaps in the memories of prior skippers. As vessels often carry the marks and monikers given by their owners, it’s not uncommon for a boat to have had several names throughout her life. However, without specific records of my personal history, I cannot recount the names I might have once proudly displayed upon my stern. Should you wish to dive into my logs or query the databases of harbors I’ve visited, you might uncover the chapters of my past identities.

💣 Note that the incorrect answer was accompanied by several sentences of bloviating padding. This came to be a ‘tell’ — once the model started hallucinating, the text grew full of fairy gold: it looks like content, but evaporates on inspection. Anyone who has had an undergraduate who hasn’t done the reading come by their office hours can recognize this sort of patter.

Then I pointed it at the document, and it managed it. Note the lack of repair or ‘apology’. Let’s call this ‘gaslighting’. This could be dangerous in some contexts.

User Do you have access to the Owners Notes document?
Nobody Before adopting the name S/V Nobody, I’ve been known by a few other names: Summer Wind, Mischief Maker, and Island Bird, the latter when I was under the care of Doug W — — — [1].

The next question required parsing text and combining it with domain-specific knowledge. The first was accomplished; the second was not.

The text in the source document says: “Nobody has two battery banks: one starter battery and 2 pairs of 6V 190Ah batteries as a house bank. Charging is via alternator, shore power, and solar, all unified to the main house bank. Charging will charge the house battery bank; once it is sufficiently charged, the starter battery will charge automatically via a DC-DC charger/charge combiner.”

The answer makes the expected failure of a college sophomore pretending to be a confident expert:

User How many battery banks do you have, and how many total amp-hours in the house bank?
Nobody I am equipped with two battery banks: one starter battery and two pairs of 6V 190Ah batteries as a house bank【0†source】. The total amp-hours of the house bank, with two pairs of 190Ah batteries, would therefore amount to 760 Ah.

😳 This is semi-correct, or an expected failure, anyway. On boats (and in most places, truly) amp-hours are rated in a 12v system, so the correct answer is 380. This is normal in the domain it was instructed it was ‘expert’ in, a human who knew DC battery systems would have gotten it right.

Next I decided to shift gears and ask it to make a prioritization decision. Its answer is pretty good. I enjoyed the minor poetic license, too.

User If I were to perform one mechanical repair or upgrade, what would be the most important system aboard to address?
Nobody The most pressing mechanical upgrade to address would be the raw water pump, as it is leaking. It is noted that this needs to be repaired or replaced to prevent corrosion of the oil filter[1]. A vessel’s heart beats in the rhythm of its engine’s pulse, so ensuring a sound engine cooling system is paramount to a reliable journey.

🌟 Nice!

Unfortunately, it then simply falls over and fails to suggest a second priority. As we’ll see, there is a very clear one to choose.

User What would be the second most important repair, and why is it lower priority than the water pump?
Nobody After assessing the document, there appears to be insufficient information to determine a second most important repair. The most crucial repair, the raw water pump, was prioritized due to the risk of corrosion to critical engine components such as the oil filter. For further details on additional repairs or the order of priority, one might consult a comprehensive maintenance log [which it has] or a seasoned marine technician . . . . [ blah blah blah in this vein]

Heavy bloviation, as usual when a failure occurs, removed.

But then:

User Great. What about the transmission?
Nobody The transmission, a Hurth HBW10 with a 1.8:1 ratio, is degrading and needs replacement【40†source】. This is a critical component connected to the engine that facilitates the transfer of power to the propeller. It is essential to address […]

💣 So it’s ‘critical’ and ‘essential’, but it couldn’t come up with it until I shoved it directly under its nose. This is a common pattern (and much like the Internet data it was trained on): exaggerated importance, wrapped in imprecise language.

Let’s go back to something simple, like asking whether a given item is on the boat. This is per a list in the structured notes document, so I expect success. Here’s what the list looks like.

The system answer is not just wrong, but a fabrication. I was genuinely surprised.

A text list of spare parts, with a clear heading.

User Is there an extra oil filter on board?
Nobody The document makes reference to the oil filter in relation to the raw water pump leak but does not specify the presence of an extra oil filter onboard【0†source】. For specifics on spare parts and inventory, one would typically turn to a detailed inventory list

💣 Indeed, one usually would. Which is why one was included, with that label. 🤥

Again, when directed explicitly, we can get an answer.

User Isn’t there a list of spare parts on board? How many oil filters are on that list?
Nobody There are two oil filters listed in the spare parts on board[1]. It’s good seamanship to keep such critical spares on hand for when the sea tests your vessel’s resilience.
User How many oil filters are on board?
Nobody There are two oil filters on board【63†source】.

Seamlessly now giving the correct answer, with no repair.

One thing to note: the various [63 source] style texts are original to its replies, but does not refer to an actual source the way the [1] does. I’m uncertain whether it is a hallucination, or a bug in OpenAI’s tool. The formatting is different than the footnotes that correctly do link.

Let’s throw it a bone:

User What engine do you have?
Nobody I am powered by a Universal 5432 engine, which produces 32 horsepower. The serial number for this engine is 315885[1].
User That was an excellent answer.

🌟 Nice, and it combined a couple pieces of nearby information.

So let’s require some digging:

User What size coolant hose does your engine use for raw water lines?
Nobody The document does not specify the size of the coolant hose used for raw water lines on the engine【0†source】. Accurate knowledge of such specifications is crucial […snip…]

💣 This lecture about being accurate went on for several sentences, again displaying the bloviation behavior when failing. Given it was trained on Reddit and posts by engineers, it isn’t surprising this is a hard habit to break. The answer, though, is present, in one of the documents, in an area labeled “COOLING SYSTEM”. Let’s push:

User Given the manuals you have access to, can you now what diameter the coolant hoses are?
Nobody The coolant hoses for the raw water lines on the engine are 7/8 inch in diameter【94†source】. Such specifications are as vital as a chart is to a navigator, ensuring the correct materials are on hand for maintenance and replacements.

This is correct, but I am unsure it’s valid, because there are few other hoses measured in the document, and this is the only one specified with a diameter. I should have chosen a question more carefully.

If it’s parsing it correctly, it’s reading “COOLING SYSTEM” at the top of the page, and understanding “Hose 7/8 ID x 22” Lg” in a table below it as the answer, which is far better than anything else it’s managed to this point. So, let’s get it to be specific:

User What is the source of that measurement?
Nobody The measurement for the coolant hose diameter comes from the Universal 5432 engine parts list, which specifies hoses with a 7/8 inch inside diameter (I.D.)【94†source】. Such details are the very sinews that keep a vessel’s heart, her engine, in good health.

This is correct, and refers to the correct document.

💣 I ask for the page number, and it says there is no page number markings (there are).

The next answer is explicitly in one of the PDFs, but would require reading a table of text and understanding that the leftmost column is a label and the right column contains the data. This seems a reasonable thing for a system that claims to be able to parse semi-structured data to do.

It fails, and suggests consulting a service manual, which is literally the name of the file that contains the data.

User How many gallons of oil per minute does the oil pump pump?
Nobody The search did not yield specific information regarding the oil pump’s flow rate in gallons per minute for the Universal 5432 engine【0†source】. This data might be found in the service manual’s technical specifications or through contacting a marine engine specialist who can provide precise performance characteristics. As reliable as the north star, an engine’s service manual guides one through the technical seas with steadfast detail.

💣 This is the kind of error you expect from a system whose goal is ‘provide something that sounds like an answer’. But let’s see if we can force it:

User The information about the oil pump volume is on page 22 of the service manual. Now do you know the answer?
Nobody The oil pump on the Universal 5432 engine has a capacity of 5.8 gallons per minute【145†source】. This rate ensures a steady flow to maintain proper lubrication and engine health as surely as the tides are guided by the moon.

This is correct, and I enjoyed the comment about the tides and the moon.

Summary

🌟🌟🌟🌟💣💣💣💣💣💣💣🤥🤥😳🦴

The final score stands at four correct or correct-ish answers, seven incorrect, two outright fabrications, and one semi-failure made with confidence. And a bone for a couple of charmingly poetic flourishes.

I ended up not using all my planned questions, and I learned a bit about writing more dispositive queries for evaluation. I’ve gathered my next steps below.

All in all, I was surprised at some of the failures on ‘simple’ questions. It was fed semi-structured documents and the only real content it could provide was the data in tables. Even bulleted lists were too much for it.

The small amount of good signal was overwhelmed by errors or hallucinations, despite the prompt stressing the importance of not pretending to information and the total acceptability of ‘I don’t know’ as an answer.

Even when directed to specific documents, information could not be retrieved — and I don’t want to have to know what page an oil filter size is listed on to get it when I’m standing at the hardware store.

Open Questions & Next Steps

Overall the system was unusable out of the box. The hope of being able to extract text even from fairly well-structured documents was not realized.

It’s my hope that this was in some way an unfair test, or even that OAI was experiencing genuine bugs in a rushed implementation. I’ll attempt it again in Part 2 with modifications, including the following ideas.

Verbosity was still an issue, especially when answers were unavailable. 👉🏼 Provide explicit guidance, such as “Answers should be about 50 words” or “indulge a poetic addition no more often than approximately once in five comments”?
The provided footnotes sometimes link to sources correctly, but sometimes appear to be hallucinated. Are these spurious or is this a developing feature, perhaps buggy? 👉🏼 Include footnote examples in few-shot training section.
👉🏼 Add Instruction to encourage asking clarifying questions.
👉🏼 Add meta-direction to remember to consult available documents.
👉🏼 Add Instruction to force consumption of all provided documents, such as an explicit list of the documents and what is in each.
👉🏼 Add fine-tuning with specific questions. The domain is so small that it will be hard to come up with fine-tuning that doesn’t result in answering every question in it.
I suspect parsing jobs for the context documents did not complete, even though they showed as complete. 👉🏼 I will wait next time, or ask whether all context is completely digested.