Introducing Auto-Analyst 🎉

An Open-Source Tool for Data Analysis Using Natural Language

5 min readMar 29, 2023

Introduction

Data analysis has become a crucial aspect of modern businesses and industries, with the need to understand and make sense of vast amounts of data becoming increasingly important. But for many people, the process of running analytical queries and creating visualizations can be time-consuming and daunting, especially if they data skills. Enter Auto-Analyst, an open-source project designed to simplify this process by allowing you to perform complex data analysis using natural language.

Auto-Analyst provides an intuitive interface that enables users to upload a CSV file and start asking analytical questions about their data. The tool processes these questions, performs the required data analysis, and returns the results in an easy-to-understand format. Auto-Analyst is written in Python, users can easily run the tool on their local machines by cloning the code repository and following the ‘Run Locally’ instructions from the repository.

Opportunity

The landscape of data analysis is rapidly evolving, with significant advancements in natural language processing and large language models (LLMs) fueling this change. LLMs such as OpenAI’s ChatGPT, Facebook’s Llama, and Stanford’s Alpaca have demonstrated remarkable success in tasks like code generation, reasoning, and language understanding. Their capabilities open up new opportunities to make data analysis more accessible and user-friendly for non-technical users.

Auto-Analyst was born out of the need to harness the power of these LLMs to make data analysis more approachable. With the increasing importance of data in decision-making, it’s essential to empower people to analyze and visualize data without relying on programming skills or deep technical expertise. Auto-Analyst provides a user-friendly interface that allows you to ask questions about your data using natural language, simplifying the data analysis process and saving time. By leveraging the potential of LLMs, Auto-Analyst aims to become a powerful tool in the data analysis ecosystem, bridging the gap between complex data and its interpretation.

Under the Hood

Auto-Analyst combines the power of large language models and a well-structured pipeline to facilitate seamless data analysis. Let’s take a closer look at the detailed steps involved in the process:

Parsing the data, description, and question: Auto-Analyst takes the dataset, a brief description of the data, and a question as input. Using the large language models like ChatGPT, it understands the context and structure of the data, as well as the intent behind the question.
Basic data cleaning and preprocessing: Before diving into the analysis, Auto-Analyst performs essential data cleaning tasks, such as removing duplicate entries, handling missing values, and converting data types. This ensures the data is ready for further processing and analysis.
Determining the answer type: Based on the input question, Auto-Analyst classifies the required response as either an aggregation or a visualization. This decision is critical in guiding the subsequent steps of the analysis process.
Aggregation: If the question requires an aggregated answer, Auto-Analyst uses the large language models to generate an SQL query tailored to the specific question. It then attempts to execute the query on the dataset. If the query fails, the large language model is leveraged again to iteratively refine and correct the query. This process continues until a working query is obtained or the user-defined maximum number of tries is reached. The final aggregation results are then returned to the user.
Visualization: If the question calls for a plot, Auto-Analyst first identifies the aggregated data needed for the visualization by employing large language models. It uses the aggregation steps described above to obtain this data, ensuring it’s in the correct format for plotting. Next, the large language model is used to generate Python code for the plot, taking into account the specific visualization type (e.g., bar chart, line chart, pie chart) and any customization options required. Finally, Auto-Analyst renders the visualization and returns it to the user.

Next Steps

Auto-Analyst is a continually evolving open-source project, and we have several exciting plans to expand its capabilities and scope:

Support for multiple data sources: Currently, Auto-Analyst supports analytics on a single file. We aim to extend its functionality to handle multiple files, databases, APIs, and even unstructured data from documents. This will enable users to extract insights from diverse and complex data sources more efficiently.
Advanced statistical analyses and open-ended questions: At present, Auto-Analyst supports summary statistics and visualizations. Our goal is to incorporate more sophisticated statistical analyses, such as regression, clustering, and time series analysis. Additionally, we want to enhance the tool’s ability to handle more open-ended questions, enabling users to dive deeper into their data and discover hidden patterns.
Community-driven development: As an open-source project, we believe that the community has a significant role to play in shaping the future of Auto-Analyst. We welcome contributions from developers, data scientists, and enthusiasts to improve the tool’s functionality, add new features, and refine the user experience. By collaborating, we can ensure that Auto-Analyst becomes a comprehensive and user-friendly data analysis solution for everyone. Github repository: https://github.com/aadityaubhat/auto-analyst

By addressing these goals, we aim to make Auto-Analyst an even more powerful and versatile tool for data analysis, empowering users to unlock the full potential of their data, regardless of their technical expertise.

Conclusion

There is a growing demand for user-friendly data analysis tools that cater to both technical and non-technical users. While startups like seek.ai and olli.ai are already providing similar solutions for enterprise data, there is a pressing need for open-source projects in this space to democratize access to these tools and encourage innovation.

Auto-Analyst is one such promising open-source project that aims to simplify data analysis. Although the project is in the early stages of development, its potential to transform data analysis for a wide range of users is evident. By supporting and contributing to projects like Auto-Analyst, we can work together to make data analysis more accessible and empower users from diverse backgrounds to unlock the full potential of their data.