Bullish, Bearish, or Just Meh? Fine Tuning LLMs to Beat Traditional ML at Financial Sentiment
Before we get started, if you’d like to follow too see my scripts for this project, you can find them on my GitHub along with a detailed README!
The Spark Behind the Project
Ever wondered whether the financial news you skim over each morning is actually optimistic, pessimistic, or just plain neutral? As someone who’s been diving deeper into LLMs, I wanted to see whether modern large language models could truly understand financial context better than traditional machine learning models.
For this project, I used the zeroshot/twitter-financial-news-sentiment dataset from Hugging Face, which contains tweets about financial markets labeled as bullish, bearish, or neutral. The goal? Compare classic ML approaches like Random Forest and SVC against base and fine-tuned LLMs, both open and closed source, and see who comes out on top in understanding market sentiment.
Spoiler alert: the LLMs left traditional ML models in the dust!
Dataset Overview
The dataset is straightforward but rich:
- Source: Twitter (financial news and market commentary)
- Labels: Bullish, Bearish, Neutral
- Size: 10k+ labeled examples, perfect for benchmarking models
The training dataset is heavily skewed towards neutral tweets with the following distribution, which creates a challenge for training models:
- Neutral: 65%
- Bullish: 20%
- Bearish: 15%
A typical entry looks like this:
Tweet: "Tech stocks are rallying hard today! Big gains for the Nasdaq."
Label: BullishFinancial language can be tricky, with sarcasm, abbreviations, and financial jargon adding extra layers of complexity, making it a great playground for testing LLMs.
Traditional ML Approach
First, I started with the classics: Logistic Regression, Random Forest, Support Vector Classifier and MLP . Just for kicks, I even tried a random labeler which simply guessed between the three labels. You’ll find how each model performed below.
Workflow:
- Clean text: remove URLs, numbers, codes, etc.
- Tokenize and vectorize: TF-IDF features (baseline) to Word2Vec and Doc2Vec embeddings, culminating in SBERT sentence embeddings.
- Train and evaluate models: accuracy, F1-score, and confusion matrix.
Results:
While the best traditional ML models performed decently (~65-70% F1 scores), they struggled to distinguish subtle bullish/bearish sentiment in nuanced tweets, often misclassifying slightly optimistic news as neutral.
Enter LLMs
Next, I experimented with closed-source LLMs like GPT-3.5 and Claude as well as open-source LLMs like LLaMA, Qwen, Gemma and Phi. I followed this up with fine-tuning the best performing model (Qwen 2.5 9B Instruct) trained specifically on financial sentiment tweets.
Why LLMs shine:
- Context-aware embeddings capture the subtle difference between “slight optimism” vs “neutral tone.”
- Fine-tuned models learn financial jargon and abbreviations specific to market discussions.
Results:
- Base LLMs already outperformed Random Forest/MLP models, hitting ~80-85% F1 scores.
- Fine-tuned LLMs on the dataset reached 90%+ F1 scores, handling tricky semantic nuances perfectly.
Example:
Tweet: "Earnings not bad, but could be better 🙃"
Random Forest → Neutral
Fine-tuned LLM → BearishWhether you’re bullish on AI or just neutral, one thing’s clear: LLMs are changing the game!
Model Results
Bag of Words Logistic Regression with Count Vectorizer
This extremely simple and interpretable model is fast to train and forms a good baseline to compare against more advanced models. However, it does not understand context or word order (e.g. ‘dog bites man’ and ‘man bites dog’ have the same meaning for the model) and it does not understand synonyms (e.g. wise and knowledgeable). It also tends to overfit for small and noisy datasets.
Logistic Regression with Word2Vec
Each word in this model gets a vector representation where similar words are close together. Each document is then represented as the average of the word embeddings upon which the classifier acts. This model generalizes better to unseen datapoints.
However, averaging Word2Vec embeddings leads to loss of word order and nuance. Different texts collapse into similar dense vectors, which makes the classes harder to separate. Due to this, the classifier cannot find a clear boundary causing it to predict one class.
Support Vector Classifier with Word2vec
The support vector classifier fails due to similar reasons as the previous model due to the drawbacks of Word2Vec embeddings.
Random Forest Classifer with Doc2vec
Doc2Vec uses paragraph vectors to learn a document-level embedding instead of averaging word vectors like Word2Vec, this captures document semantics. This preserves distinctive information and fixes ‘collapse-to-one-class’ issues.
MLP with SBERT
This model uses the BERT transformer architecture which is further optimized for sentence-level embeddings. This produces dense vectors which capture meaning and context both. Unlike Word2Vec, it can model word order and nuanced meanings by using attention mechanisms.
The fact that it is pre-trained means that it performs well on smaller datasets as well and is great for out-of-the-box use too. The Multi-layer Perceptron classifier further improves results by using non-linear decision boundaries across classes.
GPT 4o Mini
I went with the smaller version of the 4o model because I didn’t feel like splurging. In the scripts, you will see how I constructed the prompt to be fed to the LLM. Since I was using the base model, there was no reason to train the model since the model is already great at capturing context and nuance. The biggest downside of using frontier LLMs is that there is a cost associated with each API call even though it’s in the magnitudes of a fraction of cent. Another drawback is the loss of explainability and interpretable decision boundaries.
Claude 3.5 Sonnet
This model belongs to Anthropic’s Clause family. It has similar characteristics as the GPT 4o Mini model. It is optimized around safety, reasoning and helpfulness but performs worse that the GPT model as you can see below.
Open Source LLMs (Qwen 2.5, Gemma 2 and Phi 3 base models)
While each model has minor difference in strenghts, weaknesses and how they work, they are light enough to run on smaller GPUs but still provide great results. I used 4 bit quantization to make the models even smaller in order to run on the free version of Google Collab. Some drawbacks include less robustness compared to the frontier LLMs on edge cases and lack of domain specialization.
Out of the three, Qwen 2.5 had the best performance. It outperformed Claude 3.5 Sonnet and almost matched the performance of GPT 4o Mini.
Qwen 2.5 7B Instruct (fine-tuned)
In order to overcome domain specialization challenges where the model can misinterpret finance jargon, I further fine-tuned the Qwen 2.5 model using LoRA/QLoRA adapters to keep compute spends low. This shifts the model from being generalized to being a specialized classifier with improved accuracy and consistency.
Here’s the shocker! The fine-tuned version even outperformed GPT 4o Mini by miles!!
Qwen delivered an F1 score of 93% vs GPT’s 85%. This shows how we can maximize performance and deliver better results than the latest frontier LLMs on a minimal budget too!
Key Takeaways
- Context matters: Financial sentiment isn’t just positive/negative words. Market jargon and semantic nuances change meaning.
- LLMs understand nuance: Fine-tuned models can differentiate between slightly bullish and strongly bullish statements, which traditional ML often misses.
- Open vs closed-source: Open-source LLMs (Qwen, LLaMA, etc) give closed source models AKA the frontier LLMs a run for their money, without breaking the bank. The complete training run using Google Collab with 8000+ tweets costed me less than 50 cents!
Potential Applications of Fine-tuned LLMs
- Real-time market sentiment monitoring
- Portfolio risk analysis
- Automated news summarization for financial analysts
Conclusion
Traditional ML models still have their place — fast, lightweight, and explainable — but when it comes to understanding subtle financial language, LLMs are the real heavy-hitters. Fine-tuning them on domain-specific datasets like the zeroshot/twitter-financial-news-sentiment dataset from Hugging Face can dramatically boost performance, giving analysts and traders a sharper tool for decoding market mood.
If you’d like to recreate these results or repurpose code for your project, you can find the scripts used throughout this project on my GitHub along with a detailed README!
