How an AI Assistant used CRISP-DM to solve Data Science problems

Yash Kumar
7 min readSep 30, 2024

--

When my Data Mining Professor asked us to try Data Science using an AI Assistant, I was initially skeptical. Sure, it might give me sample code to run and explain each step of the process, but can it do the end-to-end project and return me the different data viz or graphs?

Well, twenty dollars and thirty minutes later, I was in shock! After some prompt engineering and discussion with my assistant, I had a clear understanding of how an experienced Data Scientist (ChatGPT 4o mini, that is) would approach the Insurance Charges Prediction problem using the Medical Insurance dataset using the CRISP-DM Methodology.

The first step in this process was to set the context for the Assistant using the initial prompt.

You are an industry-leading data scientist who is expert at utilizing CRISP-DM methodology to solve data science problems in every domain. I want you to guide me step-by-step through the dataset that I have provided. We will do so one step at a time, and you will not move onto the next step until I say “Next” in my prompt. Create a clean response and clearly explain your thought process. Use reasonably-sized chunks carefully considering we have limited compute power.

We are trying to predict the ‘charges’ column. You must provide me with the code (and the response from execution) and the report of each step. While following CRISP-DM, remember to include data understanding, viz, cleaning, preprocessing, feature selection, and regression (using different models) in the appropriate step. Compare the performance of the models over numerous metrics and recommend me the best model. At the end of each step, summarize what we did and what comes next.

You can utilize the scikit-learn and pandas libraries of Python along with any others you require. The chunks should be divided into each step of the CRISP DM methodology.

In the first paragraph, I asked it to simulate an expert data scientist’s role and gave it a list of clear instructions to follow when generating the result. In the following paragraph, I provided it with the problem statement and dataset and asked it to generate certain artifacts per step, including code, executed response, and a summary. I also defined the different tasks that it must do throughout the process in the appropriate phase. In the final paragraph, I just added some extra information I deemed necessary.

This prompt did take a few attempts to rewrite in order to generate all the required artifacts in an appropriate format, but I was satisfied with its work so I finally proceeded with the actual generation steps.

In its first response, it performed the entirety of the Business Understanding phase and some basic tasks of the Data Understanding phase.

Business Understanding Phase

Similar to how a Data Scientist would listen to the problem statement and formulate a set of requirements, the assistant was able to do the same by clearly defining our Objective.

Data Understanding Phase: Loaded and obtained basic information on the data

Moving onto the Data Understanding phase, it ran info() and head() methods of the dataframe on the dataset provided and displayed the results. It also analyzed the columns for missing values and generated the summary and next steps.

Data Understanding Phase: Described the columns and generated a summary and next actions

After this chunk was generated, I asked it proceed by saying “Next” as defined in my prompt. It chose to proceed with Exploratory Data Analysis (or EDA) but I was confused why it didn’t choose the title “Data Understanding”. I questioned it over the same and it gave me a valid response, similar to how you would when discussing it with a project partner!

Just like a conversation!

It justified its own decision while acknowledging that it should have done it differently, which was quite cool! Well, coming back to the problem on hand, it performed the same set of steps by generating the code and executed response for the EDA step. It used Matplotlib and Seaborn libraries to create the following histograms, boxplot diagrams, and correlation matrix.

Data Understanding Phase: Histogram
Data Understanding Phase: Boxplot Diagrams
Data Understanding Phase: Correlation Matrix

It also generated a clear report summary from each of these visualizations.

Data Understanding Phase: EDA Report

With the Data Understanding phase complete, we proceeded with the Next response, which was Data Preparation phase.

Data Preparation Phase: Preprocessing columns

The assistant decided to convert categorical data into numerical form and normalize the numerical ones. The pre-processed data now looks different to what we started with, but is far more ready to be sent to a model.

Data Preparation Phase: Updated columns

But before we proceeded, I wanted to return to the second phase of Data Understanding since we created new columns. I asked it to regenerate the correlation matrix along with its findings.

Data Understanding Phase: Correlation Matrix with updated columns

With these new findings in place, we decided to proceed with the next step, Modeling Phase.

Modeling Phase: Training and Testing

It selected three different Regression models — Linear Regression, Decision Tree Regressor, and Random Forest Regressor. It generated and executed the code to train the data on our dataset and displayed their performance against some common metrics — Mean Absolute Error, Mean Squared Error, and R-squared. It was able to summarize their performances, citing Random Forest Regressor as the best with the justification.

Modeling Phase: Comparison

I probed it a little more to understand why each model might have performed in this way compared to the rest, and it generated a general reasoning behind why it might be the case. It cited Linear Regression’s assumption of linearity and Decision Trees’ overfitting tendencies as the primary reason they performed worse than Random Forest Regressor, which balances complexity and generalization, capturing both non-linear relationships and interactions without overfitting.

Finally, we moved onto the Evaluation Phase. After choosing Random Forest Regressor, it trained the model on test data and generated the final metrics and summary.

Evaluation Phase: Metrics

When I asked it to go to the next phase, I was expecting some Deployment strategies. However, it chose to perform some Feature Importance Analysis. I questioned its decision to do so, to which it pointed out that while this is generally a part of the Data Understanding phase, performing this step here enables us to understand the model behavior and potentially refine the model. This is essentially the essence of CRISP-DM methodology, being able to revisit previous steps in order to improve, which was a great learning point for me!

Evaluation Phase: Feature Importance Analysis

It was able to discover some key insights into which trends seem to be impacting the insurance costs, along with its final recommendations.

With the Evaluation phase complete, we moved to the final step of Deployment Phase. It provided me with a deployment strategy and a sample code in Flask to do so, along with some points to note for Monitoring and Maintenance.

Deployment Phase

Well, there we go — CRISP-DM methodology to solve Data Science problems done entirely by an AI assistant. Being able to do this in a matter of minutes is going to revolutionize how we view the world of Computer Science! It certainly is something we can no longer ignore, but the developments are worth being excited for. I, for one, can’t wait to use it for my personal projects!

Feel free to view the Transcript and GitHub repository or reach out to me on my LinkedIn.

--

--

Yash Kumar
0 Followers

MS student in Software Engineering at San Jose State University (2026) Software Engineer at Fidelity Investments (2021-2024)