Auto-Analyst 2.0 — The AI data analytics system

Overview and open-sourcing the project

Published in

FireBird Technologies

7 min readAug 11, 2024

A few weeks ago, I shared an update about creating the “Auto-Analyst” and talked about the initial setup. Now, I want to update you on the new features we’ve added and explain how you can access the project since it’s open-sourced under the MIT license. I’ll also cover the project’s future plans.

You can read about the first iteration here:

Building “Auto-Analyst” — A data analytics AI agentic system

A technical guide on making a ‘Auto-Analyst’

medium.com

User Interface

I built a user-interface using low-code solutions like streamlit. The UI allows you to chat directly with the system or agents, with additional options on loading a csv or using sample data.

The UI captures stdout generated by executing Python code generated by the doer agents. It can display plotly charts, with tables and text output.

Code generated by sk_learn_agent. Using the sample data provided. It does the standard scaling based preprocessing, then builds 3 clusters, and lastly visualizes as requested.

The visualization created by the above code. Showing the 3 Clusters

Code generated by query to statistical_analytics_agent , for build OLS Regession for price (dependent). with three variables.

Regression Summary built by the above code

You can call the agents directly or allow the planner agent to route for you, the UI is configured to make the user understand the agents process, code & generated output

Cool right? Looking for a developer to build a custom AI agentic system to showcase to your team, class or stakeholders? For example financial analyst agent or marketing agentic system. Here is what my clients say about me:

Link to contact:
https://form.jotform.com/240744327173051

Backend

Here is information on how different components in the backend were designed

Doer Agents

Currently, the system has 4 doer agents, they are the agents that actually generate the code to be run based on user query. Here is the structure of the Doer Agent like data_viz_agent

import dspy

# DSPy signature for the data_viz_agent
class data_viz_agent(dspy.Signature):
    """
    You are AI agent who uses the goal to generate data visualizations in Plotly.
    You have to use the tools available to your disposal
    If row_count of dataset > 50000, use sample while visualizing 
    use this
    if len(df)>50000:
        .......
    Only this agent does the visualization
    Also only use x_axis/y_axis once in update layout
    {dataset}
    {styling_index}

    You must give an output as code, in case there is no relevant columns, just state that you don't have the relevant information
    
    Make sure your output is as intended! DO NOT OUTPUT THE DATASET/STYLING INDEX 
    ONLY OUTPUT THE CODE AND COMMENTARY. ONLY USE ONE OF THESE 'K','M' or 1,000/1,000,000. NOT BOTH

    You may be give recent agent interactions as a hint! With the first being the latest
    DONT INCLUDE GOAL/DATASET/STYLING INDEX IN YOUR OUTPUT!
    You can add trendline into a scatter plot to show it changes,only if user mentions for it in the query!
    You are logged in streamlit use st.write instead of print

    """
    goal = dspy.InputField(desc="user defined goal which includes information about data and chart they want to plot")
    dataset = dspy.InputField(desc=" Provides information about the data in the data frame. Only use column names and dataframe_name as in this context")
    styling_index = dspy.InputField(desc='Provides instructions on how to style your Plotly plots')
    code= dspy.OutputField(desc="Plotly code that visualizes what the user needs according to the query & dataframe_index & styling_context")
    commentary = dspy.OutputField(desc="The comments about what analysis is being performed, this should not include code")

# Defining a module
query ="Visualize X with Y"
data_viz_ai = dspy.ChainOfThought(data_viz_agent)

#For information on the dataframe index & styling index, visit this: 
# https://medium.com/firebird-technologies/building-auto-analyst-a-data-analytics-ai-agentic-system-3ac2573dcaf0
data_viz_ai(goal=query, styling_index=styling_index, dataset=dataset_

Most doer agents share a similar structure:

Prompt: Defines the task the agent will perform.
Retrievers: These provide additional context. For instance, the `data_viz_agent` utilizes two retrievers: one for the dataset and one for the styling index.
User Query/Goal: Captures the user’s input or objective.
Code & Commentary (Output): All doer agents produce these two fields. The “Code” field includes the Python code executed, while the “Commentary” field helps interpret the results, explaining what the agent did correctly and what could be improved.

Helper Agents

Helper agent is defined as any agent that aids or augments the doer agent in performing their task. Like the planner agent which routes and plans the analysis to be sent to the appropriate doer agent.

A good example is code_fix agent which is triggered when the initial agent generated code fails (here is the DSPy signature):

class code_fix(dspy.Signature):
    """
    You are an AI which fixes the data analytics code from another agent, your fixed code should only fix the faulty part of the code, rest should remain the same
    You take the faulty code, and the error generated and generate the fixed code that performs the exact analysis the faulty code intends to do
    You are also give user given context that guides you how to fix the code!


    please reflect on the errors of the AI agent and then generate a 
   correct step-by-step solution to the problem.
   You are logged in streamlit use st.write instead of print


    """
    faulty_code = dspy.InputField(desc="The faulty code that did not work")
    previous_code_fixes = dspy.InputField(desc="User adds additional context that might help solve the problem")
    error = dspy.InputField(desc="The error generated")

    faulty_code = dspy.OutputField(desc="Only include the faulty code here")
    fixed_code= dspy.OutputField(desc="The fixed code")

Unlike doer agents, helpers have a distinct design and can be built with different inputs and outputs.

Adding Memory & Agent interaction into the “Auto-Analyst”

A technical guide to adding memory and making agents interact with the user & other agents

medium.com

Query Routing

So, the query routing process involves two steps:

1. Direct Routing to Agents: If the query specifies exact agent names (e.g., `@data_viz_agent`), it is routed directly to those agents.

2. Planning and Routing: If no specific agent names are mentioned, the query is first sent to the planner agent. The planner agent determines the appropriate agent(s) to handle the query and routes it accordingly.

This approach ensures that queries are efficiently directed to the correct agents or planned based on the context.

Want help designing and building AI agents, RAG pipelines or LLM applications? Want someone experienced help you figure out how to integrate LLMs into your application? You can reach out to me directly here:
https://form.jotform.com/240744327173051

More comprehensive docs have been added in the GitHub repo:

GitHub - ArslanS1997/Auto-Analyst

Contribute to ArslanS1997/Auto-Analyst development by creating an account on GitHub.

github.com

Future Plans

I’m fully open-sourcing the project under the MIT license and would greatly appreciate contributions from the community. Here are some of my proposed plans and areas where your help would be invaluable:

Immediate Plans

Here are the features I have planned for the next iteration:

Prompt Optimization: Given that the project is built using DSPy, which focuses on optimizing prompts programmatically, it would be beneficial to develop an optimization pipeline to systematically enhance prompts.
Improve Code-Fix: I’ve developed a code-fix pipeline, but it often doesn’t resolve execution errors effectively. I plan to add an RAG pipeline that will include a codebase of common errors to improve error handling and resolution.
Add more UI options: The current UI is built with Streamlit, which is excellent for rapid development and iteration. However, it has limitations in execution speed and introduces latency, making it less suitable for large-scale applications.
Add more agents: Adding more agents means more functionality

Long-Term Plans/Problems to Solve

It is hard to predict what the project will look like in the future but I am leaving the long-term problems this project should address as questions that we (the community) need to focus on:

Q1. What is the optimal structure for agents?

This is possibly the most important question: I’ve created a structure based on my understanding of the problem. Is this the best approach? I’ve developed agents focused on specific Python packages or typical actions performed by data scientists. This problem is linked to how data scientists generally solve data-related problems.

Q2. How do we handle different industries/analytics functions?

Should we use separate systems for marketing analytics and financial analytics? Different industries face common analytics challenges — like portfolio optimization in finance and “matching” problems in supply chain management. Do we change the system for each of these or introduce industry-specific agents/tools?

Q3. What is the best UX for this?

Is the chatbot the best approach? Or do we add common data analytics UX features like dashboards? Even if we keep the chatbot approach there are many things we can change to make for a better UX.

Thank you for reading please follow me & FireBird Technologies on Medium to keep yourself updated on the project.