Building AI Assistant for Investment Research — Part 2. Comparing Pandas and SQL Agents

10 min readFeb 3, 2024

Intro

In the previous post of the series we created an AI Agent with the following functionality:

Company Profile Retrieval: Get detailed information like market cap, latest earnings numbers, dividend yield and rate, operating and profit margins, etc.
News and Events Search: Utilize Tavily Search to find the latest news, events, and M&A deals.
Analytical Operations: Perform SQL RAG on stock market data including candles, volume, and net changes.
Python Code Execution: The agent can write and execute Python code for more complex calculations and plotting charts.
Streamlit Integration: The assistant is accessible through a user-friendly Streamlit application.

We discussed the process of creating an AI Agent capable of executing those tasks using LangChain for LLM orchestration.

We used the SQL “stockdata” table to store information about stock candle data (Open, High, Low, Close, Volume) along with some quantity of performance in the form of DailyChangePercent.

Now let’s compare this with an approach when we store this data in Pandas DataFrame instead of SQL table.

TLDR; The only difference between our current and previous approaches is that we use Pandas Tool instead of SQL Tool. Here we observe the implementation of the Pandas Tool. Refer to the previous article for more details on the whole Agent development process.

Here is my GitHub repo where I develop and publish various demonstration projects showcasing AI Assistants designed for Financial Research. Here you can find the complete code for this article along with instructions on how to use it for your projects. That includes packaging, notebooks examples, and Streamlit frontend.

Hands-on!

Let’s revisit the big picture of what we are building:

Descriptions Index

Recently, I found an interesting approach in LlamaIndex Pack for storing Pandas query engine per each stock. That means that we have separate pandas data frame agent per stock. Those engines are indexed by summaries (descriptions) of the data stored in the respective data frame engine.

Think of Pandas Query Engine in LlamaIndex as an analog for AgentExecutor tailored specifically for working with pandas DataFrames.

summaries = [f"{ticker} historical market data" for ticker in tickers]

We will use quite a similar approach but instead of creating an agent per stock DataFrame we will create an agent per sector DataFrame where we will store information on stock OHLC data for the whole sector in one DataFrame. Say, separate DataFrame for the Oil sector (“XOM”, “CVX”, etc.), separate for Solar (“ENPH”, “SPWR”, etc.). This will be more effective in terms of Sector performance comparison.

Think of Pandas Query Engine in LlamaIndex as an analog for AgentExecutor tailored specifically for working with pandas DataFrames. Here is a visualization of our Pandas Tool to better grasp an idea:

Here is the full code for our Agent (main AgentExecutor, as a reminder)

Note, that we replaced SQL Tool with Pandas Tool.

from typing import Union, List, Dict
import sys
import os


# Calculate the path to the root of the project
current_dir = os.path.dirname(os.path.abspath(__file__))
project_root = os.path.dirname(current_dir)

# Append the project root to the system path
sys.path.append(project_root)

from langchain.agents import AgentExecutor
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain_core.language_models import BaseLanguageModel
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_community.tools.render import format_tool_to_openai_function
from pandas_market_agent.agent.tools.tools import get_tavily_search_tool, CalculatorTool
from pandas_market_agent.agent.tools.datetime_tools import DateTool
from pandas_market_agent.agent.tools.pandas_tools import PandasTool
from pandas_market_agent.agent.tools.company_overview_tools import CompanyOverviewTool
from dotenv import load_dotenv
import logging

# Setup basic logging
# Configure logging to file
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
)

load_dotenv()

TAVILY_API_KEY = os.environ.get("TAVILY_API_KEY")


def get_tools(
    llm: BaseLanguageModel,
    stocks: List[Dict[str, str]] = None,
    tavily_api_key: str = None,
) -> List:
    date_tool = DateTool()
    calculator_tool = CalculatorTool()
    company_overview_tool = CompanyOverviewTool()
    tools = [date_tool, calculator_tool, company_overview_tool]

    tavily_api_key = tavily_api_key or TAVILY_API_KEY
    if tavily_api_key:
        tavily_search_tool = get_tavily_search_tool(tavily_api_key=tavily_api_key)
        tools.append(tavily_search_tool)
    else:
        logging.info(
            msg="tavily_tool initialization failed, please provide Tavily API key"
        )
    pandas_tool = PandasTool(llm=llm, stocks=stocks)

    tools.append(pandas_tool)

    return tools


def create_pandas_market_agent(
    llm: BaseLanguageModel,
    stocks: Union[List[str], List[Dict[str, str]]] = None,
    tavily_api_key: str = None,
) -> AgentExecutor:
    # Check if stocks is a list of strings
    if stocks and all(isinstance(stock, str) for stock in stocks):
        stocks = [{"ticker": stock, "sector": ""} for stock in stocks]

    # Check if stocks is a list of dictionaries with the keys 'ticker' and 'sector'
    elif stocks and all(
        isinstance(stock, dict) and "ticker" in stock and "sector" in stock
        for stock in stocks
    ):
        pass  # stocks is already in the desired format

    else:
        # Handle the case where stocks is not in one of the expected formats
        logging.warn(
            "Invalid format for stocks. Expected a list of strings or a list of dictionaries with 'ticker' and 'sector' keys. "
            "Will use default stocks from internal stocks.json."
        )

    tools = get_tools(
        llm=llm,
        stocks=stocks,
        tavily_api_key=tavily_api_key,
    )
    assistant_system_message = """You are a helpful assistant. \
        Use tools (only if necessary) to best answer the users questions."""
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", assistant_system_message),
            MessagesPlaceholder(variable_name="chat_history"),
            ("user", "{input}"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ]
    )
    llm_with_tools = llm.bind(
        functions=[format_tool_to_openai_function(t) for t in tools]
    )
    agent = (
        {
            "input": lambda x: x["input"],
            "chat_history": lambda x: x["chat_history"],
            "agent_scratchpad": lambda x: format_to_openai_function_messages(
                x["intermediate_steps"]
            ),
        }
        | prompt
        | llm_with_tools
        | OpenAIFunctionsAgentOutputParser()
    )
    agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

    return agent_executor

And here is our Pandas Tool:

import logging
import pysqlite3 as sqlite3
import sys

sys.modules["sqlite3"] = sqlite3
from typing import List, Dict, Optional, Type
from langchain.tools import BaseTool
from langchain_core.language_models import BaseLanguageModel
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain.schema.document import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.chroma import Chroma
from langchain.callbacks.manager import (
    CallbackManagerForToolRun,
)
from pydantic.v1 import BaseModel, Field
from pandas_market_agent.agent.tools.storage.data_fetcher import run_fetch_job
from pandas_market_agent.agent.tools.prompts.pandas_prompts import PREFIX


# Setup basic logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)


class PandasToolInput(BaseModel):
    query: str = Field(
        description="Query which requires constructing and "
        "running analytics code over pandas dataframe with historical "
        "market data to do code calculations or generate chart."
    )


class PandasTool(BaseTool):
    name = "PandasTool"
    description = """Tool to do code calculations or generate chart over over pandas dataframe with historical 
    market data to do code calculations or generate chart."""
    args_schema: Type[BaseModel] = PandasToolInput

    llm: BaseLanguageModel = None
    stocks: Optional[List[Dict[str, str]]] = None
    sectors_data: List = []
    sector_dfs_agents: List = []
    summaries: List = []
    index_to_agent: Dict = {}
    db: Chroma = None

    def __init__(
        self, llm: BaseLanguageModel, stocks: Optional[List[Dict[str, str]]] = None
    ):
        # Call the parent class constructor with empty kwargs to bypass Pydantic validation
        super().__init__(**{})

        self.llm = llm
        self.stocks = stocks
        self.sectors_data = run_fetch_job(stocks=self.stocks)
        self.sector_dfs_agents = [
            create_pandas_dataframe_agent(
                llm=self.llm, df=sector_data["data"], prefix=PREFIX
            )
            for sector_data in self.sectors_data
        ]

        self.summaries = []
        for index, sector_data in enumerate(self.sectors_data):
            sector_summary = Document(
                page_content=f"Historical {list(sector_data['data'].columns)} data for {sector_data['data']['Symbol'].unique().tolist()} "
                f"stocks which belong to {sector_data['sector']} market sector",
                metadata={"sector_index": index},
            )
            self.summaries.append(sector_summary)

        self.index_to_agent = {
            index: sector_df_agent
            for index, sector_df_agent in enumerate(self.sector_dfs_agents)
        }
        self.db = Chroma.from_documents(self.summaries, OpenAIEmbeddings())

    def _run(
        self, query: str, run_manager: Optional[CallbackManagerForToolRun] = None
    ) -> str:
        result = self.db.similarity_search(query=query, k=1)
        sector_index = result[0].metadata["sector_index"]
        sector_df_agent = self.index_to_agent[sector_index]
        response = sector_df_agent(
            query,
            callbacks=run_manager,
        )
        return response

Pandas Tool in more detail

We are using a Class method to define our Pandas Tool because we need more control to create, store, and retrieve over the DataFrame agents index.

We initialize the stocks field and run run_fetch_job to fetch data for each sector.
Then we create pandas DataFrame agent ReAct loop implementation from LangChain. The logic of this agent is similar to the logic of the SQL agent in the previous article, except now it uses an internal Python REPL Tool for pandas calculations. Also, it is provided with df.head() to figure out the schema of the given DataFrame to perform the required manipulations. It is a good exercise to explore the implementation of LangChain’s pandas DataFrame agent and I highly encourage you to do that.
We create summary descriptions of each sector DataFrame including the Sector itself and corresponding tickers present in a DataFrame. We also create a metadata field to save the corresponding index for a current agent.
Then we create index_to_agent mapping.
Then we embed those summary descriptions Documents and save embeddings in the Chroma db vector store for further retrieval.

Try this out

Let’s compare our pandas' DataFrame agent with the SQL agent on the sector performance analysis task of plotting the average DailyChangePercent for the Oil vs Solar sector for the last 5 trading days.

SQL Agent

Here is the full experiment in a notebook:

arkad-demo/arkad-sql-demo/notebooks/test_sql_market_agent.ipynb at main · Arkad-Finance/arkad-demo

Collection of AI Assistants for Investment Research - arkad-demo/arkad-sql-demo/notebooks/test_sql_market_agent.ipynb…

github.com

First, we create SQL market agent and load stocks from stocks.json file into the local SQLite database on disc (default setup if we don’t pass the stocks list explicitly and do not provide Postgres credentials for a remote database).

openai.api_key = OPENAI_API_KEY


llm = ChatOpenAI(temperature=0, model="gpt-4", streaming=True)

agent_executor = create_sql_market_agent(
    llm=llm,
    preinitialize_database=True,
)

Then we call our agent_executor:

response = agent_executor(
    {"input": "Calculate average performance for Oil and Solar sectors for last 5 days. "
     "You must take average performance of " 
     "every stock in a sector as an average performance of given sector in a given day. "
     "Plot comparative chart for both sectors per date. Use date on x axis.", "chat_history": []}
)

Let’s look at how Agent reasons and acts on our task:

Agent selects sql_database_tool and rephrases the query:

2. It looks at the table and its schema:

3. Writes and checks query to get the last 5 days' “dailychangepercent” for stocks from sectors of interest:

4. Writes code to execute in Python REPL Tool. Note that we bear stock performance data selected by sql_database_tool inside agent_scratchpad placeholder of our ChatPromptTemplate. Note that the agent first defines that performance input in the dictionary to further create two respective DataFrames for subsequent plotting.

5. Here is plot for comparative sectors performance:

Our Agent did a good job except it failed to display dates on x axis. Regarding the chart itself — it is fine, I checked averages for given stocks for the following dates: 2024–01–29, 2024–01–30, 2024–01–31, 2024–02–01, 2024–02–02. The solar sector had a pretty volatile week with a rough ending.

Pandas Agent

Here is the full experiment in a notebook:

arkad-demo/arkad-pandas-demo/notebooks/test_pandas_market_agent.ipynb at main ·…

Collection of AI Assistants for Investment Research …

github.com

First, we create Pandas market agent and load stocks from stocks.json file into multiple DataFrames per sector (default setup if we don’t pass the stocks list explicitly).

openai.api_key = OPENAI_API_KEY


llm = ChatOpenAI(temperature=0, model="gpt-4", streaming=True)

agent_executor = create_pandas_market_agent(
    llm=llm,
)

Then we call our agent_executor:

response = agent_executor(
            {"input": "Calculate average performance for Oil and Solar sectors for last 5 days. "
             "You must take average performance of " 
             "every stock in a sector as an average performance of given sector in a given day. "
             "Plot comparative chart for both sectors per date. Use date on x axis.", "chat_history": []},
)

Let’s look at how Agent reasons and acts on our task:

Agent selects pandas_tool:

2. Select Solar sector data from the Solar sector DataFrame, selected from the descriptions index.

3. Writes code and plots performance for the Solar sector for the last 5 days. Note, that Agent failed to plot performance for the Oil sector. It would’ve involved repeating the same process of getting data from Oil DataFrame and passing it to a separate Python REPL Tool which we defined to be used for more complex calculations and plotting.

4. Here is the chart for Solar sector which is similar to what SQL Agent produced for this same sector:

Streamlit

Our GitHub repo contains instructions for running Streamlit frontend for our Agent.

Final thoughts

Ok, so we observed how we can implement Pandas Tool vs SQL Tool which we used in the previous article. Both approaches are pretty robust already, but there are inefficiencies:

Passing large data structures as inputs and outputs between tools while solving complex tasks for our users.
SQL and Pandas Agents must be constantly corrected via initial prompts which we pass to them. It is an iterative process where we pass some input, observe some output, and ask our agent to adjust its output. Our prompt in both experiments for this article is already pretty complex but still Pandas Agent failed to fetch and plot data for the Oil sector. What if we have a supervisor agent who will handle those iterations for us?
From now on we will use SQL agent for further improvement. The main reason for this is that it is more scalable in terms of data volumes compared to in-memory pandas DataFrames.
To tackle 1. and 2. we will upgrade our SQL agent by introducing multiagent workflows in future article by using LangGraph and AutoGen in the next one.

There are still a ton of experiments and improvements which we will work further:

Here are they:

Use LangGraph or AutoGen to introduce multiagent workflows. We will observe more on those approaches and their benefits in future posts.
Use open-source models as brains for our agents.
Add RAG over unstructured PDFs of SEC forms (10k and 10q) for our stocks of interest.
Extensive evaluation of our agents and their RAG workflows.

More to come, stay tuned, and good luck in your AI journey!

Feedback and propositions are highly welcome. Feel free to reach out via LinkedIn