Evaluating GPT-4o on Financial Tasks

Jacqueline Garrahan
3 min readMay 16, 2024

--

On May 13, OpenAI announced its new flagship model, GPT-4o, capable of reasoning over visual, audio, and text input in real time. The model has clear benefits in response time and multimodality, offering impressive millisecond latency for voice processing comparable to human conversation cadence. Further, OpenAI has boasted comparable performance to GPT-4 across a set of standard evaluation metrics, including the MMLU, GPQA, MATH, HumanEval, MGSM, and DROP for text; however, standard model benchmarks often focus on narrow, artificial tasks that don’t necessarily translate to aptitude on real-world applications. For example, both the MMLU and DROP datasets have been reported to be highly contaminated, with significant overlap between training and test samples.

At Aiera, we employ LLMs on a variety of financial-domain tasks, including topic identification, summarization, speaker identification, sentiment analysis, and financial mathematics. We’ve developed our own set of high-quality benchmarks for the evaluation of LLMs in order to perform intelligent model selection for each of these tasks, so that our clients can be confident in our commitment to the highest performance.

In our tests, GPT-4o was found to trail both Anthropic’s Claude 3 models and OpenAI’s prior model releases on several domains including identifying the sentiment of financial text, performing computations against financial documents, and identifying speakers in raw event transcripts. Reasoning performance (BBH) was found to exceed Claude 3 Haiku, but fell behind other OpenAI GPT models, Claude 3 Sonnet, and Claude 3 Opus.

Spider Plot Comparison of Task Performance
Raw Evaluation Scores

As for speed, we used the LLMPerf library to measure and compare performance across both Anthropic and OpenAI’s model endpoints. We performed a modest analysis, running 100 synchronous requests against each model using the LLMPerf tooling as below:

 python token_benchmark_ray.py \
--model "gpt-4o" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 100 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'
Performance Test Results

Our results reflect the touted speedup over gpt-3.5-turbo and gpt-4-turbo. Despite the improvement, Claude 3 Haiku remains superior on all dimensions except the time to first token and, given it’s performance advantage, remains the best choice for timely text analysis.

Despite it’s shortcomings on the highlighted tasks, I’d like to note that the OpenAI release remains impressive considering its multimodality and indication of a world to come. GPT-4o is assumedly quantized given it’s impressive latency and therefore may suffer from performance degradation due to reduced precision. The decrease in computation burden from such reduction facilitates quicker processing, but introduces potential errors in output accuracy and variations in model behavior across different types of data. These trade-offs necessitate careful tuning of the quantization parameters to maintain a balance between efficiency and effectiveness in practical applications.

Consistent with OpenAI’s release cadence to-date, subsequent versions of the model will be rolled out in the coming months and will doubtlessly demonstrate significant performance jumps across domains.

To learn more about how Aiera can support your research, check out our website or send us a note at hello@aiera.com.

--

--