Leveraging ChatGPT in the Daily Life of a Data Scientist

Published in

GumGum Tech Blog

9 min readJun 19, 2024

Introduction

ChatGPT is an advanced AI language model developed by OpenAI that can understand and generate human-like text based on the input it receives. It is designed to assist with tasks such as answering questions, generating text, and providing insights across various domains. In the recent release of ChatGPT-4, it even included an Advanced Data Analysis feature, which allows users to upload data directly to ChatGPT to automate code generation and insight extraction (though only available to premium/paid accounts).

As a data scientist, I have always had mixed feelings about ChatGPT. Would this advanced and ever-improving AI technology eventually take my job? Would my skillsets be obsolete in a few years? Driven by curiosity and a touch of professional paranoia, I decided to explore what ChatGPT-4 was all about.

To my delight, I discovered that rather than being a threat, ChatGPT could actually make a very valuable ally. This blog explores FIVE practical applications (with a bonus 6th at the end) of ChatGPT in a data scientist’s daily routine, along with some limitations it still faces.

Pre-configuration

Before using ChatGPT, it’s important to configure a “persona” so that its response can be more tailored to your use cases. Common tasks, such as plotting distribution, identifying outliers, engineering new features, fitting Machine Learning models are already part of the Advanced Data Analysis feature offered by ChatGPT-4, so it’s not necessary to include these in the instruction. However, if you are using bespoke calculated metrics, please be sure to include the calculation method here. It’s also worth mentioning that a useful instruction to add is the “install any library as needed”, as it wouldn’t do it on its own (yet).

1. Data Exploration and Preprocessing

One of the most common and time-consuming tasks for data scientists is data exploration and preprocessing. ChatGPT can assist with:

Data Cleansing: ChatGPT can provide strategies and implementation for handling missing values, outliers, and data normalisation. For example, it can suggest using interpolation for missing time series data or applying log transformation to normalise skewed data. Alternatively, it can also just write codes and implement the standard transformation or standard cleansing technique that you request.

Data Summarisation & Insights: By feeding raw data into ChatGPT, you can quickly perform a comprehensive Exploratory Data Analysis (or EDA) within seconds. It can generate summary statistics for each column, spot anomalies, plot distributions, perform correlation analysis, saving valuable time in the initial data exploration phase.

Here at GumGum, we specialised in Attention Prediction — we use interactions and behaviour signals to predict attention (eye gaze duration) spent on a display/video ad. The example dataset used in this entire article is based on this.

Example prompts that for conducting EDA could look like this:

Attached is a dataset used to train a prediction model, the train test split is denoted in column “split”, target is denoted in column “gazeDuration”, id is denoted in column “id”, the rest of the columns are features. Please first show the summary statistics of the target column as well as the numeric features. Then plot the histogram distribution of the target and the numeric features.

Results:

2. Feature Engineering

Feature engineering is a crucial step in the data science workflow that involves creating new features or transforming existing ones to improve the performance of machine learning models. Here are some key aspects in this area that ChatGPT-4 can assist with:

Generating Feature Ideas: Based on the dataset you upload and its description in addition to the pre-trained feature engineering principles and strategies, ChatGPT can brainstorm potential features that are tailored for you. For instance, for user web browsing data, it might suggest creating interaction features (e.g. multiplication, division, polynomial terms) out of existing independent signals to capture the more complex relationships in human behaviour; for time series data, it might suggest extracting date/time components, lead/lag terms, cumulative or rolling features over a window.

Feature Selection: High dimensionality can be a blessing or a curse. Features are not necessarily the more the merrier. Within a few commands, ChatGPT can perform Principal Component Analysis (PCA, a technique that reduces dimensionality while retaining most of the variance) to reduce the number of features in order to avoid overfitting. Using methods like Recursive Feature Elimination (RFE) or simply showing feature importance from models, ChatGPT can help you quickly identifier your most important features.

Automating Feature Creation: Don’t let ChatGPT’s wonderful feature engineering ideas stop at just being ideas. Once you approve those ideas, you can tell ChatGPT to implement the build, show you the code, fit a model and show the feature importance to verify the value of these new features.

Example prompts for feature engineering could look like this:

Based on the data, could you give me 5 feature engineering ideas, implement these ideas and fit a model with the new features. After that, plot the new feature importance.

Results (after describing the steps, it couldn’t complete the task due to limited computation resource, so it outputted the code for me to run it locally):

3. Model Building and Evaluation

Building and evaluating machine learning models involves numerous steps, where ChatGPT can be particularly useful:

Model Selection: ChatGPT can recommend suitable algorithms based on the problem at hand (e.g. it might suggest starting with Logistic Regression if the target is binary), data characteristics, and desired outcomes. It can explain the pros and cons of various algorithms, aiding in informed decision-making. In additional, it can also fit multiple algorithms and return the best one based on an evaluation metric of your choosing.

Hyper-parameter Tuning: Get advice on effective hyper-parameter tuning techniques, including suggestions on search spaces and optimisation methods like Grid Search or Bayesian Optimisation. In terms of implementation, at the time of writing, ChatGPT seems to not be able to perform computationally heavy tasks such as Cross Validation Grid Search. However, after a few unsuccessful attempts, I was surprised to see ChatGPT naturally (without my explicit commands) pivoted to writing out the code for me to run locally.

Example prompts for conducting Hyper-parameter Tuning could look like this:

Please fit a random forest model, and perform grid search on max_depth and n_estimators to find the best model that can produce the best r2 and MAE. Please output the result of each try then identify which one is the best

Results(again, it couldn’t complete the task due to limited computation resource, so it outputted the code for me to run it locally):

4. Code Review and Debugging

Writing and debugging code are integral parts of a data scientist’s job. ChatGPT can help with:

Code Snippet Generation: Provide ChatGPT with a description of the task or problem, and it can generate relevant code snippets, reducing the time spent on routine coding tasks.

Debugging Assistance: Describe the error or unexpected output, and ChatGPT can offer insights into potential causes and suggest debugging strategies.

Example prompts for debugging could look like this:

I’m working on a machine learning project using Python. My code loads a dataset, preprocesses it, and trains a RandomForestClassifier. However, I’m getting an error during the training step. Here is my code:{}. The error message I’m getting is: ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Can you help me identify the problem and fix it?

Results (not only did it identify and fix the problem, it also provided explanation to help me understand the error):

5. Documentation and Reporting

Effective communication of findings and methodologies can help business stakeholders understand and gain more trust and confidence in data scientists work. For the data scientists who are less proficient or inclined in writing, ChatGPT can be a great help with:

Process Documentation: Generate clear and concise documentation for your data processing steps, models, and analysis procedures, ensuring that your work is reproducible and understandable to others. You can also use the “ELI5” (explain like I’m 5) feature to make complicated/technical concepts easier to understand for stakeholders or the summary feature to provide a high level overview for executives.

Model Performance Report: Summarise model performance, key insights, and actionable recommendations in well-structured reports. ChatGPT can help format and phrase these reports to suit different audiences, from technical teams to business stakeholders.

Example prompts for report generation could look like this (feature importance needs to be generated at runtime so the prompt explicitly requested a refit):

Attached is a dataset used in training a machine learning model. The train-test split is denoted by the “evaluation_mode” column. The actual target column is “gazeDuration”. The predicted outcome by the model is denoted by the “predicted_gaze_duration” column. Please provide a summary report on the fitted model. It must include a table containing MAE, R2 and bias metrics broken by training set and test set. Then refit the model with a random forest algorithm and plot a bar chart of the top 10 most important features. Please omit code and make the report short and concise.

Results:

Limitations

Although ChatGPT has impressive capabilities, it is not without its flaws; here are a few limitations I have come across when experimenting with it:

Connection Issues: When trying to attach dataset (be it local upload or from linked cloud drive), it would intermittently fail to attach — root cause is unknown and the only solution is to keep trying (restarting chats sometimes helps)

Resource Constraints: As mentioned above, the internal resource constraints will not allow it to perform computationally heavy task. When it happens, we will need to copy the code it generously provide and run the code locally.

Incorrect Insights: When giving evaluation or insights, ChatGPT would sometimes give “subjective” evaluation that is not aligned with the actual objective value. For example, it could incorrectly identify a refined model with lower R2 (compared with original) as “an improvement model”.

Pseudo Results: When asked to deliver results ChatGPT doesn’t know how to obtain, it would at times, generate pseudo results, which looks seemingly correct at first glance but is actually fake. Data scientists need to be cautious with these silence failures and verify the results as needed.

Conclusion

Incorporating ChatGPT into the daily workflow of a data scientist can significantly enhance productivity, streamline complex processes, and foster innovation. Whether it’s for data preprocessing, feature engineering, model building, code debugging, or documentation, ChatGPT offers versatile support, making it an invaluable tool in data scientists’ toolbox. With its ever-improving power, we data scientists must embrace this technology to stay ahead in the competitive landscape of analytics and drive more impactful data-driven decisions.

While impressive, ChatGPT is still far from perfect, when leveraging ChatGPT, data scientists must also be cautious with the responses and be the ultimate gatekeeper of quality.

Bonus Application

Unconventionally, a data scientist can also use ChatGPT to prepare for interviews! Ask it to give you a mock interview by testing how well you know statistics theories and concepts, as well as giving you coding test. Here at GumGum, we’re always looking for new talents! Why not have a look at the jobs here and use ChatGPT to prepare for your next interview with us?