Auto-Analyst 2.0 — The AI data analytics system

Overview and open-sourcing the project

Arslan Shahid
FireBird Technologies
7 min readAug 11, 2024

--

Image by Author

A few weeks ago, I shared an update about creating the “Auto-Analyst” and talked about the initial setup. Now, I want to update you on the new features we’ve added and explain how you can access the project since it’s open-sourced under the MIT license. I’ll also cover the project’s future plans.

You can read about the first iteration here:

User Interface

I built a user-interface using low-code solutions like streamlit. The UI allows you to chat directly with the system or agents, with additional options on loading a csv or using sample data.

The UI captures stdout generated by executing Python code generated by the doer agents. It can display plotly charts, with tables and text output.

Image by Author — UI in one image
Code generated by sk_learn_agent. Using the sample data provided. It does the standard scaling based preprocessing, then builds 3 clusters, and lastly visualizes as requested.
The visualization created by the above code. Showing the 3 Clusters
Code generated by query to statistical_analytics_agent , for build OLS Regession for price (dependent). with three variables.
Regression Summary built by the above code

You can call the agents directly or allow the planner agent to route for you, the UI is configured to make the user understand the agents process, code & generated output

Visualization made by the system

Cool right? Looking for a developer to build a custom AI agentic system to showcase to your team, class or stakeholders? For example financial analyst agent or marketing agentic system. Here is what my clients say about me:

Client Testimonial

Link to contact:

https://form.jotform.com/240744327173051

Backend

Here is information on how different components in the backend were designed

Doer Agents

Currently, the system has 4 doer agents, they are the agents that actually generate the code to be run based on user query. Here is the structure of the Doer Agent like data_viz_agent

import dspy

# DSPy signature for the data_viz_agent
class data_viz_agent(dspy.Signature):
"""
You are AI agent who uses the goal to generate data visualizations in Plotly.
You have to use the tools available to your disposal
If row_count of dataset > 50000, use sample while visualizing
use this
if len(df)>50000:
.......
Only this agent does the visualization
Also only use x_axis/y_axis once in update layout
{dataset}
{styling_index}

You must give an output as code, in case there is no relevant columns, just state that you don't have the relevant information

Make sure your output is as intended! DO NOT OUTPUT THE DATASET/STYLING INDEX
ONLY OUTPUT THE CODE AND COMMENTARY. ONLY USE ONE OF THESE 'K','M' or 1,000/1,000,000. NOT BOTH

You may be give recent agent interactions as a hint! With the first being the latest
DONT INCLUDE GOAL/DATASET/STYLING INDEX IN YOUR OUTPUT!
You can add trendline into a scatter plot to show it changes,only if user mentions for it in the query!
You are logged in streamlit use st.write instead of print

"""
goal = dspy.InputField(desc="user defined goal which includes information about data and chart they want to plot")
dataset = dspy.InputField(desc=" Provides information about the data in the data frame. Only use column names and dataframe_name as in this context")
styling_index = dspy.InputField(desc='Provides instructions on how to style your Plotly plots')
code= dspy.OutputField(desc="Plotly code that visualizes what the user needs according to the query & dataframe_index & styling_context")
commentary = dspy.OutputField(desc="The comments about what analysis is being performed, this should not include code")

# Defining a module
query ="Visualize X with Y"
data_viz_ai = dspy.ChainOfThought(data_viz_agent)

#For information on the dataframe index & styling index, visit this:
# https://medium.com/firebird-technologies/building-auto-analyst-a-data-analytics-ai-agentic-system-3ac2573dcaf0
data_viz_ai(goal=query, styling_index=styling_index, dataset=dataset_

Most doer agents share a similar structure:

  1. Prompt: Defines the task the agent will perform.
  2. Retrievers: These provide additional context. For instance, the `data_viz_agent` utilizes two retrievers: one for the dataset and one for the styling index.
  3. User Query/Goal: Captures the user’s input or objective.
  4. Code & Commentary (Output): All doer agents produce these two fields. The “Code” field includes the Python code executed, while the “Commentary” field helps interpret the results, explaining what the agent did correctly and what could be improved.

Helper Agents

Helper agent is defined as any agent that aids or augments the doer agent in performing their task. Like the planner agent which routes and plans the analysis to be sent to the appropriate doer agent.

A good example is code_fix agent which is triggered when the initial agent generated code fails (here is the DSPy signature):

class code_fix(dspy.Signature):
"""
You are an AI which fixes the data analytics code from another agent, your fixed code should only fix the faulty part of the code, rest should remain the same
You take the faulty code, and the error generated and generate the fixed code that performs the exact analysis the faulty code intends to do
You are also give user given context that guides you how to fix the code!


please reflect on the errors of the AI agent and then generate a
correct step-by-step solution to the problem.
You are logged in streamlit use st.write instead of print


"""
faulty_code = dspy.InputField(desc="The faulty code that did not work")
previous_code_fixes = dspy.InputField(desc="User adds additional context that might help solve the problem")
error = dspy.InputField(desc="The error generated")

faulty_code = dspy.OutputField(desc="Only include the faulty code here")
fixed_code= dspy.OutputField(desc="The fixed code")

Unlike doer agents, helpers have a distinct design and can be built with different inputs and outputs.

Query Routing

Image by Author

So, the query routing process involves two steps:

1. Direct Routing to Agents: If the query specifies exact agent names (e.g., `@data_viz_agent`), it is routed directly to those agents.

2. Planning and Routing: If no specific agent names are mentioned, the query is first sent to the planner agent. The planner agent determines the appropriate agent(s) to handle the query and routes it accordingly.

This approach ensures that queries are efficiently directed to the correct agents or planned based on the context.

Want help designing and building AI agents, RAG pipelines or LLM applications? Want someone experienced help you figure out how to integrate LLMs into your application? You can reach out to me directly here:

https://form.jotform.com/240744327173051

More comprehensive docs have been added in the GitHub repo:

Future Plans

I’m fully open-sourcing the project under the MIT license and would greatly appreciate contributions from the community. Here are some of my proposed plans and areas where your help would be invaluable:

Immediate Plans

Here are the features I have planned for the next iteration:

  1. Prompt Optimization: Given that the project is built using DSPy, which focuses on optimizing prompts programmatically, it would be beneficial to develop an optimization pipeline to systematically enhance prompts.
  2. Improve Code-Fix: I’ve developed a code-fix pipeline, but it often doesn’t resolve execution errors effectively. I plan to add an RAG pipeline that will include a codebase of common errors to improve error handling and resolution.
  3. Add more UI options: The current UI is built with Streamlit, which is excellent for rapid development and iteration. However, it has limitations in execution speed and introduces latency, making it less suitable for large-scale applications.
  4. Add more agents: Adding more agents means more functionality

Long-Term Plans/Problems to Solve

It is hard to predict what the project will look like in the future but I am leaving the long-term problems this project should address as questions that we (the community) need to focus on:

Q1. What is the optimal structure for agents?

This is possibly the most important question: I’ve created a structure based on my understanding of the problem. Is this the best approach? I’ve developed agents focused on specific Python packages or typical actions performed by data scientists. This problem is linked to how data scientists generally solve data-related problems.

Q2. How do we handle different industries/analytics functions?

Should we use separate systems for marketing analytics and financial analytics? Different industries face common analytics challenges — like portfolio optimization in finance and “matching” problems in supply chain management. Do we change the system for each of these or introduce industry-specific agents/tools?

Q3. What is the best UX for this?

Is the chatbot the best approach? Or do we add common data analytics UX features like dashboards? Even if we keep the chatbot approach there are many things we can change to make for a better UX.

Thank you for reading please follow me & FireBird Technologies on Medium to keep yourself updated on the project.

--

--

Arslan Shahid
FireBird Technologies

Life has the Markov property, the future is independent of the past, given the present