How to leverage AI for Data Visualization using pandas?

Vivek Muskan
4 min readJul 1, 2023

pandas AI is a Python library that enhances Pandas with generative AI capabilities.

Integrating the OpenAI API key with the traditional pandas library allows for the seamless integration of AI capabilities and generative abilities into pandas. By leveraging the power of the OpenAI language model, users can enhance their data analysis and manipulation workflows within pandas. The API key acts as a bridge, enabling pandas to interact with the OpenAI model and access its generative capabilities. This integration empowers users to perform advanced data summarization, cleaning, imputation, feature generation, and more, using natural language commands and interactions. With the OpenAI API key, pandas become a more versatile and intelligent tool, enabling users to derive deeper insights from their data and automate complex analytical tasks.

Installation:

Pandas-AI is a Python library that enriches the capabilities of pandas, the popular data analysis and manipulation tool, by adding generative AI functionalities. It is designed to complement pandas, providing users with advanced data summarization and manipulation techniques. In this article, we will explore how to install Pandas-AI, provide code examples, and explain its key features.

pip install pandas-ai

Code Examples and Explanation:

Clean Data:

One of the key features of Pandas-AI is the ability to clean data efficiently. Here’s an example of how to use the clean_data function:

import pandas as pd
from pandasai import PandasAI

data = {
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William'],
'Age': [25, 32, None, 28, 35],
'Salary': [5000, None, 4500, 7000, 5500]
}

df = pd.DataFrame(data)
pandas_ai = PandasAI()

cleaned_df = pandas_ai.clean_data(df)

The clean_data function helps identify and handle common data cleaning tasks such as missing values, inconsistent formats, and outliers. It automatically detects missing values and applies suitable imputation techniques. In the example above, the function identifies missing values in the 'Age' and 'Salary' columns and imputes them using appropriate methods.

Impute Missing Values:

Imputing missing values is a crucial step in data preprocessing. Pandas-AI provides the impute_missing_values function to handle this task effectively. Here's an example:

imputed_df = pandas_ai.impute_missing_values(df)

The impute_missing_values function identifies missing values in the DataFrame and replaces them with appropriate values based on the column's data type. For instance, it can replace missing numeric values with the mean or median, while categorical values can be imputed with the most frequent category.

Generate Features:

Feature generation is essential for creating meaningful predictors in machine learning tasks. Pandas-AI offers the generate_features function to simplify this process. Here's an example:

feature_df = pandas_ai.generate_features(df)

The generate_features function allows you to create new features from existing ones using various transformation techniques. It provides a convenient way to derive insights from your data and improve model performance.

Plot Histogram:

Data visualization is crucial for understanding the distribution of variables. Pandas-AI includes the plot_histogram function for easy visualization. Here's an example:

pandas_ai.plot_histogram(df, column="Salary")

The plot_histogram function generates a histogram plot for the specified column in the DataFrame. It helps visualize the distribution of values and identify any patterns or anomalies.

Asking Questions:

Pandas-AI enables users to ask questions related to their data and obtain insightful answers. Here’s an example:

employees_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William'],
'Department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance']
}

salaries_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Salary': [5000, 6000, 4500, 7000, 5500]
}

employees_df = pd.DataFrame(employees_data)
salaries_df = pd.DataFrame(salaries_data)

# Asking a question
result = pandas_ai([employees_df, salaries_df], "Who gets paid the most?")

By passing multiple DataFrames and a question to the PandasAI instance, you can leverage the generative AI capabilities to obtain insights and answers that span across different data sources.

Try it yourself (Google Colab)

pip repository

Congratulations !! on using artificial intelligence for your data visualization

PandasAI brings generative AI capabilities to pandas, augmenting its data analysis and manipulation abilities. In this article, we covered the installation process, along with practical code examples and explanations for various functionalities provided by PandasAI, such as data cleaning, imputing missing values, generating features, and plotting histograms. By integrating PandasAI into your data analysis workflow, you can enhance your productivity and gain valuable insights from your datasets.

--

--

Vivek Muskan

Full-Stack Developer | NodeJS | MongoDB | MySQL | Java | AI/ML | Data Visualization | Generative AI