Integrating AI and Finance: My MS Portfolio Project Showcasing a Sophisticated Financial Analysis Model

azhar
azhar labs
6 min readFeb 24, 2024

--

This article presents an in-depth exploration of my master’s portfolio project, where Artificial Intelligence (AI) meets finance. The project focuses on developing a comprehensive model to automate financial analysis by harnessing advanced AI techniques, leveraging the Mistral 7B model, Together AI API, LangChain framework, and Streamlit for web application development.

Before we proceed, let’s stay connected! Please consider following me on Medium, and don’t forget to connect with me on LinkedIn for a regular dose of data science and deep learning insights.” 🚀📊🤖

Project Synopsis — FINQA: A New Frontier in Financial Analysis:

The crux of the project is the FINQA dataset, a large-scale, expert-curated collection of financial question-answer pairs extracted from S&P 500 earnings reports. The dataset is designed to challenge and enhance AI’s capacity in handling complex financial data, involving both structured and unstructured formats. The primary objective is to enable deep, automated analysis of extensive financial documents, addressing the challenges posed by the volume and complexity of financial data.

Dataset Composition and Structure: FINQA encompasses 8,281 rows across four JSON files: train.json, test.json, dev.json, and private_test.json. The dataset is meticulously structured to include various components critical for financial analysis:

  • Pre-text and post-text surrounding tables
  • Filename details, including stock ticker symbols, report year, and page number
  • Original and processed tables
  • A comprehensive ‘qa’ section with questions, answers, explanations, annotated rows, computational steps, and reasoning programs

This structure empowers the AI model to perform nuanced financial analysis, emphasizing explainability and transparency, essential in finance.

Data Analysis and Visualization: Through Exploratory Data Analysis (EDA), I extracted valuable insights, focusing on operations used in financial calculations and the distribution of these operations across various dimensions like companies, report years, and page numbers. This analysis is visually represented through bar charts, heatmaps, and other graphical tools, providing an intuitive understanding of the dataset’s characteristics.

(A) The Filename Column Syntax consists of the following:

  • Stock Ticker Symbols for the publicly traded companies on stock exchanges.
  • The Year of Report
  • The Page Number from which the Report has been referred.

So, we can extract all these 3 Aspects and get a Visualization for Analysis on different Companies, Years of Report, and Page Numbers used the most

Bar Chart for Number of Times Reports by Companies have been Used

  • Here, we can see that the Reports have been used from the company “ETR” the most.

Years of Publications of Reports considered for Analysis

  • Here, we can see that from 1999 to 2019, maximum number of Reports have been used from the Year 2017 and minimum number of the Reports have been used from the Year 1999.

Page Numbers of Reports w.r.t the Counts

  • Here, we can see that maximum number of Contexts have been considered from Page Number 83, and as we go to the end of Reports, the usage of contexts have become lesser from these Pages. Initial Pages of Reports have been used far more than End Pages.

(B) Also, to solve these Questions, since we need both Mathematical Operations and Tabular Operations , so we can analyze which Operations are particularly used the most, and any Question uses how many Operations to solve that Question for getting the Answer.

Percentage Distribution of the Operations used to answer the Financial Questions

  • Here, we can see that the operation “divide” has been used the most, and that makes sense, since we tend to calculate Percentages more in Numerical Context Problems. followed by “subtract” , “addition” , “multiply” and so on.

Percentage Distribution of the Number of Operations required to answer the Financial Question

  • As per above , more than half of the Problems require only 1 Operation to get the Answer. There are very rare Questions needing 6 Operations.

© Also, for every Answer generated, the Dataset also provides an Explanation as well. So we can see how many Explanations are generated for a Solution as below.

Percentage Distribution of the Number of Explanations generated after answering the Financial Question

  • We can see that more than 96% of the Solutions need only one Explanation to suffice the Accuracy of the Calculations.

(D) Since we know that we have to provide Unstructured Text and Structured Tabular Data to get the Answers for Questions, that doesn’t means both Text, Tables should be required all the time. We can see that as below:

Percentage Distribution for Data considered while solving from Context having both Texts and Tables

  • We can see that 62.5% Questions only need Tabular Data to get the Answer. Around 23.4% need only Text Data and remaining 14% need both Text and Table Data.

(E) Also, we have an input “model_input”, which shows how mnay sentences are considered as Input Data for our Model to fetch the Answers. So, we can analyze on the Number of Model Inputs as well, as below:

Year of Publication vs Number of Model Inputs based on Counts

  • This shows that for all years, majority of the Questions need only 3 Sentences to get the Solution.

(F) We also have the Top-N Facts which are required for Retrieval Purposes. And we can analyse that as well , as shown:

Percentage Distribution of the Number of Top-N Facts required to answer the Financial Question

  • Most of the Questions have TOP-2 Facts to answer the Question. After that, TOP-1 and TOP-0 facts are considered.

Some Other Visualizations based on Years , Facts , Operations and Top-N Facts:

LLM Modeling with Nous-Hermes-2-Mistral-7B-DPO and Together AI API: At the heart of this project is the application of Large Language Models (LLM) for financial question-answering. Using Mistral 7B, a potent AI model, in conjunction with the Together AI API, the project applies prompt engineering to generate precise numerical answers from the dataset. The model effectively handles zero-shot prompting, outperforming other models like Google PaLM and Google Gemini Pro in accuracy and explanation quality.

Web Application Development Using Streamlit and LangChain: To showcase the project’s practical application, I developed a Streamlit-based web application. LangChain, a framework for chaining language models, was instrumental in integrating the AI model with the web interface. The application allows users to input financial contexts and questions, receiving AI-generated answers in real-time, thereby demonstrating the model’s capability in a user-friendly format.

Conclusion: This master’s portfolio project represents a significant leap in integrating AI with financial analysis. By developing an AI-driven tool capable of deciphering complex financial data, the project not only showcases my technical prowess in AI and finance but also marks a step forward in the future of automated financial decision-making. The success of this project lies in its ability to transform vast, intricate financial data into actionable insights, paving the way for more informed and efficient financial analysis and decision-making.

--

--

azhar
azhar labs

Data Scientist | Exploring interesting (research paper / concepts). LinkedIn : https://www.linkedin.com/in/mohamed-azharudeen/