Building AI Assistant for Investment Research — Part 1

12 min readFeb 1, 2024

Intro

Data Analysis in the Investment Research space can be (and is) a cumbersome process. Let’s make it a lot easier by building an AI-driven research assistant. Thankfully there are powerful LLMs out there and powerful frameworks like LangChain and LlamaIndex we can use to build versatile AI Assistants.

In this article, I will describe the process of building a pretty powerful AI assistant using LangChain for LLM orchestration.

Here is my GitHub repo where I develop and publish various demonstration projects showcasing AI Assistants designed for Financial Research. Here you can find the complete code for this article along with instructions on how to use it for your projects. That includes packaging, notebooks examples, Streamlit frontend, and running the whole stack in docker using Postgres database and stocks data from yfinance.

Tools + SQL Agent for Stocks Analysis

Our primary goal is to enable our agent to do stock market research with the following capabilities:

Company Profile Retrieval: Get detailed information like market cap, latest earnings numbers, dividend yield and rate, operating and profit margins, etc.

News and Events Search: Utilize Tavily Search to find the latest news, events, and M&A deals.

Analytical Operations: Perform SQL RAG on stock market data including candles, volume, and net changes.

Python Code Execution: For more complex calculations and plotting charts, the agent can write and execute Python code.

Streamlit Integration: The assistant is accessible through a user-friendly Streamlit application.

This will be a pretty basic version of such an agent to be tested and improved in the next iterations.

If you want a deep (very deep) dive

First of all, if you prefer to deep dive into the topic of creating LLM applications — there is a lot that will be out of the scope of this article. I’ll try to explain things on a high level anyway. Still, if that is not enough — there is this article that collects a list of papers suggested by Andrej Karpathy that are essential to achieving ninja-level mastery in the topic of LLMs.

Add the following papers to the list:

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

ReAct: Synergizing Reasoning and Acting in Language Models

Language Models are Few-Shot Learners

to better understand what LLM reasoning is and how LLMs achieve that ability.

Also, I will leave useful links wherever possible because it may be too much for one article:)

So what is an Agent?

CoT + ReAct

You can think of an agent as your companion who has an LLM-powered brain. This brain is capable of reasoning given a task you want to solve. Modern LLMs are pre-trained with these reasoning capabilities in mind (gpt-4–1106 for example) so that we can “program” them to plan and execute our tasks via prompt engineering.

There are two main methodologies: Chain of Thought (CoT) and Reason and Act (ReAct). To better understand what they are about let’s take a look at the following example from ReAct paper:

There is a complex query that requires decomposition and looking outside of LLM's own brain to solve the hallucination problem.

How we, humans can approach answering this query?

“Aside from the Apple Remote, what other device can control the program Apple Remote was originally designed to interact with?” © From paper by Yao et al on ReAct

First of all, we will decompose that query into the following steps:

What is Apple Remote?
What program it was originally designed to interact with?
Take that program Apple Remote was originally designed to control and look for all devices intended to control it.

CoT (on the very left of the image example above) succeeds in decomposing the query by utilizing a “step-by-step” approach. However, the problem is that LLM doesn’t know the answers to our first and second steps, so, it tries to figure them out by itself, in other words — hallucinate. We need to add tools at the disposal of those reasoning capabilities, a web search tool in our case. Take a look at the very right successful ReAct flow. Instead of hallucinating, we use a Search tool to find relevant answers for our sub-questions in decomposed queries.

Agent Loop

As you may have noticed — there is some sort of loop in the example above. The thing is that we simply call LLM with updated context information after every ReAct cycle until our model decides that we achieved some finished state. LangChain has this Agent Loop implemented for us in a pretty powerful AgentExecutor abstraction which also handles constructing ReAct prompts with tools descriptions.

Now, hands-on!

First, let’s visualize a big picture of what we are building:

Building AgentExecutor

Here is the code for creating AgentExecutor:

def get_tools(
    llm: BaseLanguageModel,
    db_connection_string: str = None,
    preinitialize_database: bool = False,
    stocks: List[Dict[str, str]] = None,
    tavily_api_key: str = None,
) -> List:
    date_tool = DateTool()
    calculator_tool = CalculatorTool()
    repl_tool = PythonREPLTool()

    company_overview_tool = CompanyOverviewTool()
    tools = [date_tool, calculator_tool, repl_tool, company_overview_tool]

    tavily_api_key = tavily_api_key or TAVILY_API_KEY
    if tavily_api_key:
        tavily_search_tool = get_tavily_search_tool(tavily_api_key=tavily_api_key)
        tools.append(tavily_search_tool)
    else:
        logging.info(
            msg="tavily_tool initialization failed, please provide Tavily API key"
        )
    sql_database_tool = get_sql_database_tool(
        llm=llm,
        db_connection_string=db_connection_string,
        preinitialize_database=preinitialize_database,
        stocks=stocks,
    )

    tools.append(sql_database_tool)

    return tools


def create_sql_market_agent(
    llm: BaseLanguageModel,
    db_connection_string: str = None,
    preinitialize_database: bool = False,
    stocks: Union[List[str], List[Dict[str, str]]] = None,
    tavily_api_key: str = None,
) -> AgentExecutor:
    # Not proceeding if db_connection_string is None and preinitialize_database is False
    
    # Some input processing logic...

    tools = get_tools(
        llm=llm,
        db_connection_string=db_connection_string,
        preinitialize_database=preinitialize_database,
        stocks=stocks,
        tavily_api_key=tavily_api_key,
    )
    assistant_system_message = """You are a helpful assistant. \
        Use tools (only if necessary) to best answer the users questions."""
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", assistant_system_message),
            MessagesPlaceholder(variable_name="chat_history"),
            ("user", "{input}"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ]
    )
    llm_with_tools = llm.bind(
        functions=[format_tool_to_openai_function(t) for t in tools]
    )
    agent = (
        {
            "input": lambda x: x["input"],
            "chat_history": lambda x: x["chat_history"],
            "agent_scratchpad": lambda x: format_to_openai_function_messages(
                x["intermediate_steps"]
            ),
        }
        | prompt
        | llm_with_tools
        | OpenAIFunctionsAgentOutputParser()
    )
    agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

    return agent_executor

The overall process is pretty simple:

We define system prompt.
Then ChatPromptTemplate where we pass the system prompt, MessagePlaceholders for chat history, and agent scratchpad. Agent scratchpad is a place where we store information about function calls and respective observations. AgentExecutor updates it on each step of the ReAct loop.
Define tools and “bind” them to the OpenAI model. We will also facilitate our agent to use other models in future iterations.
Creating Agent inference chain of steps using LCEL. I highly recommend to get yourself familiar with this concept of chaining basic LangChain components.

Define tools

To define a tool we need to specify its Description and Arguments for our AgentExecutor to select tools properly. You can use multiple ways of doing so in LangChain, please refer to Defining Custom Tools in the documentation.

Company Overview Tool

from pydantic.v1 import BaseModel, Field
from langchain.tools import BaseTool
from typing import Type
import yfinance as yf


class SymbolRelatedToolsInput(BaseModel):
    symbol: str = Field(..., description="Companies stock symbol")


class CompanyOverviewTool(BaseTool):
    name = "company_overview_tool"
    description = """Useful when you need to get a comprehensive company overview by its stock symbol.
        Metrics of interest include:
        Market Capitalization, Enterprise Value, Price-to-Earnings (P/E) Ratio, Price-to-Sales Ratio (P/S),
        Price-to-Book Ratio (P/B), Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA),
        Revenue and Revenue Growth, Net Income and Growth, Earnings Growth, Dividend Yield and Rate, Operating and Profit Margins,
        Return on Assets (ROA) and Equity (ROE), Free Cash Flow, Debt-to-Equity Ratio, Current and Quick Ratios,
        Beta, 52-Week High and Low, Average Volume and many others. If there are ambiguous variants for answers, 
        think carefully which answer is most appropriate by reasoning from key in large dict you obtain."""
    args_schema: Type[BaseModel] = SymbolRelatedToolsInput

    def _run(self, symbol: str) -> dict:
        stock = yf.Ticker(symbol)
        info = stock.get_info()

        return info

Tavily Search Tool

For the Tavily Search Tool to work, we need to register the Tavily API Key and pass it to create_sql_market_agent or define it as an environment variable.

from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.utilities.tavily_search import TavilySearchAPIWrapper

# Search tool
def get_tavily_search_tool(tavily_api_key: str = None) -> Tool:
    description = """A search engine optimized for comprehensive, accurate, \
        and trusted results. Useful for when you need to answer questions \
        about current events or about recent information. \
        Input should be a search query. \
        If the user is asking about something that you don't know about, \
        you should probably use this tool to see if that can provide any information."""
    search = TavilySearchAPIWrapper(tavily_api_key=tavily_api_key)
    tavily_tool = TavilySearchResults(api_wrapper=search, description=description)

    return tavily_tool

Python REPL Tool

Sometimes we need to perform some complex calculations like correlation or chart plotting where we may use data from other tools as inputs. For example, if we want to plot %DailyChange for some stock over the last month. In this scenario our agent will use an SQL tool to get that data for us and the Python REPL Tool will plot the chart for us. Warning: Python REPL Tool can execute arbitrary code, and use sandboxed environments wherever possible.

from langchain.pydantic_v1 import BaseModel, Field
from langchain.tools import BaseTool, Tool
from langchain_experimental.utilities import PythonREPL


# PythonREPL tool. Warning: This executes code locally, which can be unsafe when not sandboxed
class PythonREPLInput(BaseModel):
    code: str = Field(
        description="The python code to execute to do code calculations or generate chart."
    )


class PythonREPLTool(BaseTool):
    name = "PythonREPLTool"
    description = """Use this to execute python code. If you want to see the output of a value, 
    you should print it out with `print(...)`. This is visible to the user."""
    args_schema: Type[BaseModel] = PythonREPLInput
    repl = PythonREPL()

    def _run(self, code: str) -> str:
        try:
            result = self.repl.run(code)
        except BaseException as e:
            return f"Failed to execute. Error: {repr(e)}"
        return f"Succesfully executed:\n```python\n{code}\n```\nStdout: {result}"

SQL Tool

Here we will use the create_sql_agent function from langchain-experimental which implements AgentExecutor specifically tailored to create and run SQL queries over arbitrary databases. Again, as in Python REPL Tool — SQL Agent is capable of executing insert, update, and delete queries over the database you connect to it. Please, use with caution and sandbox database environment wherever possible.

Here is how we create our SQL Agent:

from pathlib import Path
from langchain.sql_database import SQLDatabase
from langchain_core.language_models import BaseLanguageModel
from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain.tools import Tool
from langchain.agents.agent_types import AgentType
from langchain.agents import create_sql_agent
from sql_market_agent.agent.tools.prompts.sql_prompts import (
    SQL_PREFIX,
    SQL_SUFFIX,
)
from sql_market_agent.agent.tools.storage.db_fetcher import run_fetch_job
from typing import List, Dict
import logging

# Setup basic logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)


def get_database(
    db_connection_string: str = None,
    preinitialize_database: bool = False,
    stocks: List[Dict[str, str]] = None,
) -> SQLDatabase:
    if not db_connection_string:
        logging.info(f"DB connection string not provided, using local db on disc...")
        db_connection_string = (
            f"sqlite:///{Path(__file__).parent / 'storage/StockData.db'}"
        )

    run_fetch_job(
        stocks=stocks,
        db_connection_string=db_connection_string,
        preinitialize_database=preinitialize_database,
    )
    logging.info(f"Connecting to db for sql agent...")
    db = SQLDatabase.from_uri(db_connection_string)
    logging.info(f"Connected to db for sql agent successfully")
    return db


def get_sql_database_tool(
    llm: BaseLanguageModel,
    db_connection_string: str = None,
    preinitialize_database: bool = True,
    stocks: List[Dict[str, str]] = None,
) -> Tool:
    db = get_database(
        db_connection_string=db_connection_string,
        preinitialize_database=preinitialize_database,
        stocks=stocks,
    )
    schema_to_insert = db.get_table_info()
    sql_prefix = SQL_PREFIX.replace("{schema}", schema_to_insert)
    toolkit = SQLDatabaseToolkit(db=db, llm=llm)
    db_agent = create_sql_agent(
        llm=llm,
        toolkit=toolkit,
        verbose=True,
        agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
        prefix=sql_prefix,
        suffix=SQL_SUFFIX,
    )

    sql_database_tool = Tool(
        name="sql_database_tool",
        func=db_agent.run,
        description="Useful when you need to answer questions about data in database, but only when there are no other tools to answer this question. "
        "You must use this tool only if there is no other tool for answering this specific question about data in database "
        "where there is information about US stocks symbols, sectors, OHLC data and %DailyChange data. "
        "Input should be in the form of a question containing full context. Do not use this tool if you have an answer in chat history. "
        "If there is no data available for given query - say I don't know and never hallucinate!",
    )

    return sql_database_tool

First, we prepare our database for connection to our SQL Agent. This involves:

Connecting to the database and creating SQLDatabase SQLAlchemy wrapper over it.
Fetching data for stocks if necessary. More on that here.
Creating SQLDataBaseToolkit which is a wrapper class for tools that our SQL Agent will use for creating and running SQL queries. Look at get_tools from SQLDataBaseToolkit below. Take a look at get_tools inside SQLDataBaseToolkit implementation.
Creating SQL Agent with custom PREFIX where we will paste schema for our database. We do this to reduce LLM calls needed to select and use ListSQLDatabaseTool by passing the schema of SQL tables directly into the system prompt of an agent.
Wrapping this agent into a tool using Tool class with name, description, and main “run” function.

Note that there are separate setups for different types of agents inside the create_sql_agent implementation. The pattern is similar — create a system prompt by passing PREFIX, tools, and SUFFIX into PromptTemplate. It can use either default PREFIX and SUFFIX or we can pass custom predefined values. In our case, we use the latter approach with a schema placeholder which we fill before calling create_sql_agent.

Essentially AgentExecutor for our SQL Agent is working like this:

Receive query.
Use ListSQLDatabaseTool for getting table names. (Unnecessary when we pass that info into prefix).
Use InfoSQLDatabaseTool for getting tables schemas. (Unnecessary when we pass that info into prefix).
Create and check SQL query using QuerySQLCheckerTool.
Run SQL query and return text result using QuerySQLDataBaseTool.

Ok, now let’s test our agent!

Try this out

Here you can find the complete code for this article along with instructions for how to use it for your projects. That includes packaging, notebooks examples, Streamlit frontend, and running the whole stack in docker using Postgres database and stocks data from yfinance.

Note, that currently plotting will not work out of the box inside Streamlit app.

Notebook demo for creating and running basic sql market agent using local SQLite database on disc:

arkad-demo/arkad-sql-demo/notebooks/test_sql_market_agent.ipynb at main · Arkad-Finance/arkad-demo

Collection of AI Assistants for Investment Research - arkad-demo/arkad-sql-demo/notebooks/test_sql_market_agent.ipynb…

github.com

In this notebook, we ask to plot the close price for XOM's most recent 5 days. For our agent that involves:

Write and execute SQL query to get those close prices. Update: I added a simple date_tool for our agent to convert verbal explanations of dates like “last Monday” to a concrete date. I did this by analogy to the tools definition process I described above.
Invoke REPL tool to plot a chart.

Our agent did well on that simple task, I double-checked that on TradingView.

The problem with this approach, however, is that we pass data that we obtained from our agent via a large prompt where we have more and more tokens with each new LLM call for ReAct loop. Remember, we use agent_scratchpad for that: an agent scratchpad is a place where we store information about function calls and respective observations.

The process is something like this (approximately, to grasp a general idea):
We call LLM with the original query:
-> Thought 1: “I need to query the SQL database for close prices”
-> Act: “Queries data from database”
-> Again, call LLM in a new agent loop where we now have details about SQL tool usage in the agent scratchpad where we store observation value (which is essentially a list of required close prices)
-> Thought 2: “I now have data to plot a chart, I need to use REPL tool for that”
-> Act: “Write and execute code”.

Essentially bear more and more information in agent_scratchpad (or any other placeholder for storing Agent activity history). That is not efficient, especially if we want to pass large data structures between tools.

We will tackle this problem by a multi-agent approach in future posts using tools like LangGraph and AutoGen.

Streamlit

Our GitHub repo contains instructions for running Streamlit frontend for our Agent.

Screenshots of Streamlit interface. Getting revenue TTM for Caterpillar (click to enlarge):

Final Thoughts

Congratulations, you now know how to create your first Stock Market Research Assistant using LangChain for LLM orchestration. There are still a ton of experiments and improvements which we will work further:

Here are they:

Compare with the approach of storing and processing stock market data inside pandas data frames opposite to SQL approach.
Use LangGraph or AutoGen to introduce multiagent workflows. We will observe more on those approaches and their benefits in future posts.
Use open-source models as brains for our agents.
Add RAG over unstructured PDFs of SEC forms (10k and 10q) for our stocks of interest.
Extensive evaluation of our agents and their RAG workflows.

More to come, stay tuned, and good luck in your AI journey!

Feedback and propositions are highly welcome. Feel free to reach out via LinkedIn.