Exploring LLMs Performance on Accounting Exams

Published in

Sage Ai

8 min readMar 27, 2024

Introduction

In the realm of artificial intelligence, the integration of Large Language Models (LLMs) into our daily technological toolkit has become increasingly prevalent. These advanced models, equipped with the ability to process and generate human-like text, have shown remarkable success across a variety of tasks, and they are making significant advancements into domain-specific applications. Yet, when it comes to the specialized fields of accounting and finance, the question remains: How effectively can these models perform on more complex, domain-specific, and specialized tasks like the CPA (Certified Public Accountant) exam? Undertaking this challenge, Sage Ai embarked on a project to explore the extent to which GPT-4 and similar LLMs can comprehend and solve CPA exam questions. Leveraging a distinctive suite of innovative methodologies and techniques unique to Sage Ai, we aim to not only assess but also to amplify the capabilities of these models, and in so doing, enhance their success rate significantly. Our ultimate goal is clear: to harness AI’s potential in specialized domains, transforming expert accounting knowledge into a widely accessible asset, utilizing the power of AI to enhance effectiveness in financial management across the globe.

Background

The CPA exam stands as a crucial milestone for aspiring accounting professionals, serving as a comprehensive assessment of their proficiency in financial regulations, accounting principles, and strategic management skills. It’s a rigorous examination that not only assesses the theoretical understanding and practical applications of accounting principles but also evaluates one’s analytical and problem-solving skills in real life scenarios, making it a major challenge for incoming professional accountants.

When it comes to Large Language Models (LLMs) like GPT-4, navigating the CPA exam’s complex, math-heavy concepts presents a unique set of challenges. Unlike tasks that primarily involve natural language understanding and generation, the CPA exam requires a deep grasp of numerical reasoning, financial analysis, and the application of accounting standards — areas where LLMs have historically shown limitations [8]. These models, while exceptional in processing and generating text, often struggle with the precise and logic-driven demands of mathematical reasoning inherent in accounting tasks. Previous efforts to evaluate LLMs on similar tasks have revealed these shortcomings, highlighting a gap between their current capabilities and the specialized requirements of the CPA exam.

Project goals

Recognizing this gap, our project sought to not only evaluate AI’s current capabilities but to also identify strategies that could enhance its performance in this specialized domain, and to address these challenges, we focused on assessing the problem-solving abilities of LLMs. Our approach involved experimenting with various prompt engineering techniques known to improve LLMs’ performance, such as few-shot prompting, chain of thought reasoning, and employing external LangChain tools like calculators and Python interpreters. These techniques, widely regarded for their ability to guide LLMs towards more accurate and correct responses, were essential in our strategy to enhance the models’ performance in solving complex problems, as well as its understanding of accounting concepts.

Past research has tested the capabilities of earlier version of LLMs, like GPT-3.5, against the CPA exam, only to find that they fell short of expectations, as these models struggled to navigate the problem-solving scenarios presented by the exam. Motivated by these findings, Sage Ai aimed to leverage its unique expertise and the advanced capabilities of GPT-4 to bridge this gap. Our goal was to overcome previous limitations, and demonstrating that with the right approach, LLMs can indeed tackle the specialized challenges of the CPA exam, breaking the ground for AI applications in the field of accounting and finance.

Methodology

In our quest to evaluate the proficiency of LLMs against CPA exam questions, we constructed a methodology that enabled us to have a detailed overview of the LLM performance. The CPA exam is divided into four key sections: Auditing and Attestation (AUD), Business Environment and Concepts (BEC), Financial Accounting and Reporting (FAR), and Regulation (REG). Each section brings its own set of challenges, from staying updated on the latest accounting rules and regulations, which the LLM may not be trained on, to tackling math-heavy questions — a significant barrier given that LLMs lack built-in calculators for direct computation.

Starting our evaluation, we curated a dataset of 339 multiple-choice questions gathered from three trusted sources. These questions were carefully parsed into JSON objects, capturing essential metadata such as their answers, type, source, and other relevant details. This structured approach allowed us to carefully assess the LLM’s performance across the diverse areas of the CPA exam.

To interface with the LLM, we’ve developed a two-part code infrastructure. The first part consists of a custom API function, get_gpt4_response, designed specifically to communicate with OpenAI’s GPT-4 via the Azure API, specifying parameters like the engine, model, temperature, and top_p to customize the request according to our needs.

API function to query GPT-4 with CPA exam questions

Building upon this, the second function get_answer systematically processes a JSON file containing an array of CPA exam questions. It iterates through each question, constructing different prompts that include the question text along with its multiple choice options. These prompts are then fed into our get_gpt4_response function to retrieve the LLM’s answers. Upon receiving responses, the function records them, pairing each with its corresponding question number to enable a structured analysis of the model’s performance.

Our evaluation employed a variety of prompting techniques and practices to best simulate the exam environment. Zero-shot prompting tests the LLM’s ability to solve questions without prior examples, while the Chain of Thought method encouraged the model to break down complex problems into simpler, manageable steps. Additionally, we leveraged outside tools such as a LangChain Python Agent, capable of writing and executing code to solve math-related problems.

The snippet below showcases our integration of a LangChain Python Agent. By writing and executing Python code on demand, the agent provides a dynamic approach to problem-solving, bridging the gap between theoretical knowledge and practical application. The methos not only enhances the accuracy of our evaluations but also mirrors the real-world scenario where professionals often rely on computational tools to resolve complex issues.

The criteria for success were straightforward: correctness was determined solely by whether the LLM’s response matched the original question’s answer, without considering any explanatory text the model might provide. This focus on the final answer allowed for a clear, objective measure of the LLM’s understanding and application of accounting knowledge.

While our primary focus was on evaluating GPT-4, given its capabilities compared to GPT-3.5, we also allocated time to experiment with Claude-v2. The results of these experiments promise to shed further light on the potential and challenges of utilizing LLMs within the domain of professional accounting knowledge.

Results and Analysis

In our investigation into the capabilities of LLMs like GPT-4 in tackling CPA exam questions, we discovered quite a bit about their strengths and where they need a bit more help. Starting off with GPT-4, we found that when we used zero-shot prompting — basically asking it to pick the right answer out of four choices without giving any explanations or showing its work — it got a success rate of 71.37%. That’s pretty good, but still just short of the 75% you need to pass the CPA exam. However, the integration of a Python agent, which writes and executes Python code for calculations, marked a significant improvement, pushing the success rate up to 78.4%.

The real game-changer came with chain of thought prompting. By instructing GPT-4 to break down its thinking step by step, we not only decomposed complex questions into more manageable parts but also achieved an average success rate of 81.8%, surpassing the passing mark. This method’s effectiveness highlights the potential of LLMs in processing and solving intricate problems through a step-by-step approach.

On the other side, our experiments with Anthropic/Claude-v2, particularly using chain of thought prompting, yielded an average success rate of 63%. While this performance falls behind GPT-4, Claude-v2 demonstrated higher response time and an advanced ability to follow instructions compared to GPT-4.

Moreover, the integration of RAG (Retrieval Augmented Generation) into our methodology could further enhance LLM performance. RAG allows the model to pull in external information relevant to the task at hand, essentially giving it access to a broader knowledge base than it was originally trained on. This can be particularly beneficial for questions that rely on the latest accounting standards or up-to-date tax regulations. By leveraging RAG and other prompting techniques, the overall success rates of LLMs could drastically improve in complex domains like accounting and finance topics.

Selection Bias

To enrich our methodology and ensure a comprehensive evaluation of Large Language Models (LLMs) on CPA exam questions, it’s crucial to address the concept of selection bias, particularly in the context of multiple-choice questions. LLMs, including those like GPT-4, have inherent challenges when dealing with multiple-choice formats, primarily due to selection bias. This bias arises in two main forms: token bias and position bias.

To address this issue of selection bias, we implemented two strategies: firstly, by randomizing the choices to ensure we bypass position bias, and secondly, by removing choice IDs (e.g., A, B, C, D) to eliminate token bias. These adjustments were pivotal in creating a more neutral testing environment, allowing us to more accurately measure the LLM’s comprehension and reasoning abilities without the interference of external biases.

Conclusion

The project’s outcomes highlight a promising future where LLMs can serve as a valuable resources for professionals, offering assistance with complex financial analysis, regulatory compliance, and strategic decision-making. It provided us with a deeper understanding of the strengths and limitations of current AI technologies in professional settings. LLMs passing the CPA exam under certain conditions marks a significant step forward in the integration of AI into accounting education and practice.

As we look towards the future, it is clear that the application of LLMs in accounting and beyond holds tremendous promise. The continuous advancement of AI models, coupled with innovative methodologies and a deeper integration of domain-specific knowledge, will undoubtedly pave the way for more accessible, efficient, and sophisticated tools for professionals across a range of industries.