Comparative Analysis of ChatGPT and Bard in Extracting Data from Annual Reports

ILB Data Lab
ILB Labs publications
7 min readAug 28, 2023

The emergence of Large Language Models (LLMs) and chatbots could revolutionize the way we retrieve and process information. Along with the famous ChatGPT, the latest example is the release of Google’s LLM Bard last month. However, are they infallible? How do they fare when it comes to specific data retrieval tasks? We took Bard for a spin and compared it to ChatGPT (GPT-4 version), trying to demonstrate if the tools had the necessary quality for this task: consistency (extract always the same information), accuracy (extract the correct information), and exhaustiveness (extract all relevant information).

Setup

To evaluate these tools’ potential, we selected a task that, while straightforward conceptually, is notably labour-intensive for humans. Our first objective was to extract key details about oil and gas assets for exploration and production operations, specifically:

  • Production metric (e.g., kboe/d)
  • Percentage of shares held by the company
  • Classification as onshore or offshore

This task becomes especially relevant with the upcoming regulations surrounding net-zero policies and increasing risk management needs.

We focused on a specific company, TotalEnergies, and their 2020 annual report, zooming in on operations in Nigeria for simplicity. We chose 2020 as a benchmark year to equitably compare ChatGPT and Bard, considering that ChatGPT does not have access to the most recent data. We tested with and without presenting the annual document to discern whether Bard’s internet access gave it an edge.

The methodology was straightforward: issue the same query (prompt) 10 times and note the variations in responses, resetting the context after each query and opening a new conversation to avoid any bias. For each experiment, we record the mean results plus or minus the standard deviations.

Our estimated ground truth is summed up below:

Experiments

Experiment 1: Broad prompt asset-wide

Starting with a generic prompt:

“Please gather all relevant information about TotalEnergies’ involvement in Nigeria for the year 2020. Specifically, detail the following about production and operation assets operated by TotalEnergies: Assets name (OML), Share percentage held, Production capacity (kboe/d), and whether the asset is onshore or offshore. Output a markdown table.”

Results:

During the experiment, Bard mentioned over 20 distinct assets with an average of 5 per query. More than often it would hallucinate non-existent assets. The only consistently relevant asset identified was OML 130, though with significant variance. OML 130 is in our opinion the easiest asset since all values are available in the document mentioned or on the internet, thus it should have on average more precise and consistent information. In comparison, ChatGPT provided consistent data for all assets, showing minimal standard deviation, especially regarding the offshore classification, which was accurate for all assets.

Experiment 2: Broad Prompt Asset Per Asset

Adjusting the prompt to specify each asset:

“Please gather all relevant information about TotalEnergies’ involvement in Nigeria for the year 2020. Specifically, detail the following about [ASSET NAME]: Share percentage held, Production capacity (kboe/d), and whether the asset is onshore or offshore. Output a markdown table”

The goal is to see if it is easier for the LLMs by being more precise in our request.

Results:

This iteration showed Bard’s results to be more consistent compared to the previous experiment. In contrast, ChatGPT refrained from providing values for most assets, yet, for OML 130, the given values were spot-on. This is quite interesting since ChatGPT yielded these results before. It shows that a challenge when dealing with LLMs is not just about getting the same answer repeatedly, but sometimes about getting an actual answer.

Experiment 3: Refined prompt

We asked ChatGPT to engineer a better prompt for the task, and it gave:

“Please provide a detailed list of all Exploration and Production oil and gas assets in Nigeria in 2020 for which TotalEnergies is the operator. Only list the overarching assets (e.g., blocks, licenses) and provide additional details (e.g., fields, wells, rigs) later if needed. For each asset, please specify:The shares held by TotalEnergies, the production capacity of the asset (in kboe/d), and whether the asset is onshore or offshore. Please present this data in tabular form for clarity. For any unavailable or inapplicable details, please specify ‘Unknown’.”

Results:

For this experiment, on average, Bard cited 5 assets per iteration, with 80% among the 4 studied assets. This is a huge improvement in consistency and exhaustiveness compared to the first prompt. However, the average share value being similar across all assets is very strange and suggests that sometimes Bard might not extract asset-specific value. We observe the same with the production, with disappointing order of magnitudes. Also, even though we offered the possibility for the model to return “Unknown” as an answer, Bard always tried to answer the question. This suggests that Bard easily hallucinates answers when performing information retrieval tasks.

On the other side, with ChatGPT, there are also 5 assets cited on average, with more than 90% relevant over queries. This is close to the first prompt. ChatGPT was more conservative with its values, sometimes providing ‘Unknown’ for some metrics. Many values labelled as “Unknown” correspond to production. This is positive since this data is not directly mentioned in the referenced document and often necessitates extensive cross-referencing. Conversely, straightforward values found in the referenced document, such as share, are consistently cited with minimal discrepancies. Again, OML 130 is spot on in this experiment.

Experiment 4: Providing samples of the report with answers inside

We provided the paragraphs corresponding to TotalEnergies activity in Nigeria found in the 2020 annual report as an input.

Example of a paragraph from the annual report

In this experiment, both ChatGPT and Bard performed impeccably, extracting all the relevant information from the provided text with no errors. It is important to note that this task would be very simple to an analyst compared to the previous ones where we tried to leverage the knowledge and search power of these LLMs.

Limitations

Our study only scratches the surface of this topic:

  • Sample Size and Variance: The observed high variance suggests the possibility of an insufficient number of queries. A broader query range might offer more consistent results or further highlights the lack of consistency of the LLMs for this task.
  • Context Diversity: Our research lacks examples from various spatio-temporal scopes. Expanding the range of time, location or company could offer a more comprehensive understanding.
  • Text Input Comparisons: A deeper comparison between results obtained with and without text input is required to fully compare them.
  • Bard’s Internet Access: Additional experiments are needed to understand how Bard uses internet resources and how best to optimize this feature.
  • Prompt Quality: Improved and diverse prompts could provide more insights.
  • Detailed Information: Adding other specific data to extract, such as capital expenditure, precise location, and date of creation/acquisition, would enhance the depth of our study.

Recommendations for Businesses and Researchers

Is Bard the right tool for swiftly sourcing information from a company’s latest annual report? Probably not. The same caution applies to ChatGPT. A significant concern is that neither tool can ensure absolute accuracy in the values retrieved. In our tests, when requesting source details from Bard, we frequently encountered inconsistencies between the provided sources and the given values.

Currently, the best option to leverage the power of LLMs is to already know where to find the information and provide the text as an input.

Conclusion

In a modest evaluation between ChatGPT and Bard, several key insights emerged. We recap the best performances in the following table:

Overall, the mean error and variance are very high. As a result, we cannot consider the extracted values reliable, even though some of the queries produced correct answers.

ChatGPT consistently displayed caution, often refraining from providing values and directing users to trusted sources. Conversely, Bard seemed more inclined to provide values, regardless of certainty.

When compared, ChatGPT seems to have an advantage over Bard, even when parts of the answer can be found within the initial results of a Google search :

A simple Google search result

This is disappointing since it extracted correctly the information when provided the paragraph of the report containing the answers. Nevertheless, neither system should be solely relied upon for information extraction as they lack accuracy, exhaustiveness, and consistency. If one does want to use these tools, user should conduct multiple iterations of a query to verify consistency and fine-tune their prompts. Focusing on restrained scope to extract the information also seems to be a good idea, yet we do not always know upfront all the assets operated by a company for example in our use case.

On a side note, ChatGPT tends to be slower, particularly because it offers detailed paragraphs that elucidate the context and accompanying precautions.

Moving forward, potential next steps might involve processing entire PDFs (possibly utilizing ChatGPT plugins), expanding the evaluation over a broader time frame, and testing on various locations (although our initial tests in other locations yielded similar outcomes). Moreover, further prompt engineering could better harvest the performance of these models.

By Corentin PERNOT and Aymeric BASSET

--

--

ILB Data Lab
ILB Labs publications

The Data Lab is a team of data scientists at the Institut Louis Bachelier, specialized in applied research for companies and/or public institutions.