Sitemap

Bullish, Bearish, or Just Meh? Fine Tuning LLMs to Beat Traditional ML at Financial Sentiment

7 min readSep 11, 2025
Press enter or click to view image in full size
A head-to-head benchmark of ML vs LLMs on financial news sentiment.

Before we get started, if you’d like to follow too see my scripts for this project, you can find them on my GitHub along with a detailed README!

The Spark Behind the Project

Ever wondered whether the financial news you skim over each morning is actually optimistic, pessimistic, or just plain neutral? As someone who’s been diving deeper into LLMs, I wanted to see whether modern large language models could truly understand financial context better than traditional machine learning models.

For this project, I used the zeroshot/twitter-financial-news-sentiment dataset from Hugging Face, which contains tweets about financial markets labeled as bullish, bearish, or neutral. The goal? Compare classic ML approaches like Random Forest and SVC against base and fine-tuned LLMs, both open and closed source, and see who comes out on top in understanding market sentiment.

Spoiler alert: the LLMs left traditional ML models in the dust!

Dataset Overview

The dataset is straightforward but rich:

  • Source: Twitter (financial news and market commentary)
  • Labels: Bullish, Bearish, Neutral
  • Size: 10k+ labeled examples, perfect for benchmarking models

The training dataset is heavily skewed towards neutral tweets with the following distribution, which creates a challenge for training models:

  • Neutral: 65%
  • Bullish: 20%
  • Bearish: 15%

A typical entry looks like this:

Tweet: "Tech stocks are rallying hard today! Big gains for the Nasdaq."  
Label: Bullish

Financial language can be tricky, with sarcasm, abbreviations, and financial jargon adding extra layers of complexity, making it a great playground for testing LLMs.

Traditional ML Approach

First, I started with the classics: Logistic Regression, Random Forest, Support Vector Classifier and MLP . Just for kicks, I even tried a random labeler which simply guessed between the three labels. You’ll find how each model performed below.

Workflow:

  1. Clean text: remove URLs, numbers, codes, etc.
  2. Tokenize and vectorize: TF-IDF features (baseline) to Word2Vec and Doc2Vec embeddings, culminating in SBERT sentence embeddings.
  3. Train and evaluate models: accuracy, F1-score, and confusion matrix.

Results:
While the best traditional ML models performed decently (~65-70% F1 scores), they struggled to distinguish subtle bullish/bearish sentiment in nuanced tweets, often misclassifying slightly optimistic news as neutral.

Enter LLMs

Next, I experimented with closed-source LLMs like GPT-3.5 and Claude as well as open-source LLMs like LLaMA, Qwen, Gemma and Phi. I followed this up with fine-tuning the best performing model (Qwen 2.5 9B Instruct) trained specifically on financial sentiment tweets.

Why LLMs shine:

  • Context-aware embeddings capture the subtle difference between “slight optimism” vs “neutral tone.”
  • Fine-tuned models learn financial jargon and abbreviations specific to market discussions.

Results:

  • Base LLMs already outperformed Random Forest/MLP models, hitting ~80-85% F1 scores.
  • Fine-tuned LLMs on the dataset reached 90%+ F1 scores, handling tricky semantic nuances perfectly.

Example:

Tweet: "Earnings not bad, but could be better 🙃"  
Random Forest → Neutral
Fine-tuned LLM → Bearish

Whether you’re bullish on AI or just neutral, one thing’s clear: LLMs are changing the game!

Model Results

Bag of Words Logistic Regression with Count Vectorizer

This extremely simple and interpretable model is fast to train and forms a good baseline to compare against more advanced models. However, it does not understand context or word order (e.g. ‘dog bites man’ and ‘man bites dog’ have the same meaning for the model) and it does not understand synonyms (e.g. wise and knowledgeable). It also tends to overfit for small and noisy datasets.

Press enter or click to view image in full size

Logistic Regression with Word2Vec

Each word in this model gets a vector representation where similar words are close together. Each document is then represented as the average of the word embeddings upon which the classifier acts. This model generalizes better to unseen datapoints.

However, averaging Word2Vec embeddings leads to loss of word order and nuance. Different texts collapse into similar dense vectors, which makes the classes harder to separate. Due to this, the classifier cannot find a clear boundary causing it to predict one class.

Press enter or click to view image in full size

Support Vector Classifier with Word2vec

The support vector classifier fails due to similar reasons as the previous model due to the drawbacks of Word2Vec embeddings.

Press enter or click to view image in full size

Random Forest Classifer with Doc2vec

Doc2Vec uses paragraph vectors to learn a document-level embedding instead of averaging word vectors like Word2Vec, this captures document semantics. This preserves distinctive information and fixes ‘collapse-to-one-class’ issues.

Press enter or click to view image in full size

MLP with SBERT

This model uses the BERT transformer architecture which is further optimized for sentence-level embeddings. This produces dense vectors which capture meaning and context both. Unlike Word2Vec, it can model word order and nuanced meanings by using attention mechanisms.

The fact that it is pre-trained means that it performs well on smaller datasets as well and is great for out-of-the-box use too. The Multi-layer Perceptron classifier further improves results by using non-linear decision boundaries across classes.

Press enter or click to view image in full size

GPT 4o Mini

I went with the smaller version of the 4o model because I didn’t feel like splurging. In the scripts, you will see how I constructed the prompt to be fed to the LLM. Since I was using the base model, there was no reason to train the model since the model is already great at capturing context and nuance. The biggest downside of using frontier LLMs is that there is a cost associated with each API call even though it’s in the magnitudes of a fraction of cent. Another drawback is the loss of explainability and interpretable decision boundaries.

Press enter or click to view image in full size

Claude 3.5 Sonnet

This model belongs to Anthropic’s Clause family. It has similar characteristics as the GPT 4o Mini model. It is optimized around safety, reasoning and helpfulness but performs worse that the GPT model as you can see below.

Press enter or click to view image in full size

Open Source LLMs (Qwen 2.5, Gemma 2 and Phi 3 base models)

While each model has minor difference in strenghts, weaknesses and how they work, they are light enough to run on smaller GPUs but still provide great results. I used 4 bit quantization to make the models even smaller in order to run on the free version of Google Collab. Some drawbacks include less robustness compared to the frontier LLMs on edge cases and lack of domain specialization.

Out of the three, Qwen 2.5 had the best performance. It outperformed Claude 3.5 Sonnet and almost matched the performance of GPT 4o Mini.

Press enter or click to view image in full size

Qwen 2.5 7B Instruct (fine-tuned)

In order to overcome domain specialization challenges where the model can misinterpret finance jargon, I further fine-tuned the Qwen 2.5 model using LoRA/QLoRA adapters to keep compute spends low. This shifts the model from being generalized to being a specialized classifier with improved accuracy and consistency.

Here’s the shocker! The fine-tuned version even outperformed GPT 4o Mini by miles!!

Qwen delivered an F1 score of 93% vs GPT’s 85%. This shows how we can maximize performance and deliver better results than the latest frontier LLMs on a minimal budget too!

Press enter or click to view image in full size

Key Takeaways

  1. Context matters: Financial sentiment isn’t just positive/negative words. Market jargon and semantic nuances change meaning.
  2. LLMs understand nuance: Fine-tuned models can differentiate between slightly bullish and strongly bullish statements, which traditional ML often misses.
  3. Open vs closed-source: Open-source LLMs (Qwen, LLaMA, etc) give closed source models AKA the frontier LLMs a run for their money, without breaking the bank. The complete training run using Google Collab with 8000+ tweets costed me less than 50 cents!

Potential Applications of Fine-tuned LLMs

  • Real-time market sentiment monitoring
  • Portfolio risk analysis
  • Automated news summarization for financial analysts

Conclusion

Traditional ML models still have their place — fast, lightweight, and explainable — but when it comes to understanding subtle financial language, LLMs are the real heavy-hitters. Fine-tuning them on domain-specific datasets like the zeroshot/twitter-financial-news-sentiment dataset from Hugging Face can dramatically boost performance, giving analysts and traders a sharper tool for decoding market mood.

If you’d like to recreate these results or repurpose code for your project, you can find the scripts used throughout this project on my GitHub along with a detailed README!

--

--

Falak jain
Falak jain

Written by Falak jain

Committed to lifelong learning

No responses yet