My Coffee Breaks with ChatGPT

Insights and Tips for Using ChatGPT in Real World Data Science Work — Part 1: Visualizations

11 min readJan 28, 2023

Image created with the assistance of DALL-E 2

If your day-to-day work experience is anything like mine, you spend a lot of time consulting cheat sheets, Stack Overflow, Medium, your prior code and other resources to remember the best way to accomplish a specific task. Sure, you have done it many times before. But after spending a long stretch on, say, data acquisition, data cleansing and exploratory data analysis (EDA), you draw a blank when you move on to other tasks. So, off you go to consult your tried-and-true resources to jog your memory and get started.

After reading the use case examples that Sadrach Pierre, Ph.D. included in his excellent Mastering ChatGPT in Python, I wanted to see if ChatGPT could help me be more productive in my ongoing project work. I was pleasantly surprised to discover just how helpful ChatGPT could be across a wide variety of data science endeavors. And it is so courteous!

“Sure! Here is how you can do that …”, or “I am so sorry about that error. The error I made in my prior answer was …”

Using ChatGPT really felt like consulting with colleagues over coffee breaks. (Okay, maybe colleagues are not always that courteous.)

Key Take-aways

ChatGPT provides a coding pattern that you can follow to accomplish the challenge you give it, not just code snippets as you might find using web searches;
Resubmitting a request may get you an alternative pattern for accomplishing your task. This helps you learn new approaches and coding techniques;
Depending on how much information you give it, ChatGPT will do a lot of the work for you. Its response will incorporate your dataset names, your variable names, etc., eliminating the time you would spend cut-and-pasting code examples and adjusting them to fit your situation;
It does a reasonable job of documenting code;
Because it maintains the context of your conversation, even across sessions, it is easy to pick up where you left off, and, more importantly, it tailors new responses to include what it has learned from you in earlier parts of the dialog;
Responses often explain how the code works. This can help with understanding and learning; and,
ChatGPT is not foolproof. It will at times propose wrong answers. It takes feedback well, though. Tell it where it made a mistake and give it the error message, and it will propose a corrected solution. Even if it does not get the code exactly right, the code pattern it does provide is usually correct and can get you most of the way to your final solution.

The Experiments

I focused my experiments for this article on visualizations because I often feel like I am starting from scratch every time I need to create one. I find myself asking things like: “What’s that option for rotating the text?” Or “How do I animate the plot to show changes over time?”

For my experiments, I used citizen service request data (“311” data) from the City of Chicago Open Data portal (https://data.cityofchicago.org/). This is data I have used extensively in the past and I know many of its quirks. The Terms of Use for Chicago Data are here. Chicago 311 Service Request data and the Chicago Crime data are licensed for non-commercial use under CC BY-NC-SA 3.0. Attribution: 311 Service Request data — City of Chicago.

I used the ChatGPT portal for my experiments, not OpenAI’s GPT APIs. Screenshots of my requests and ChatGPT’s responses are shown below. ChatGPT usually displays the entire program in each response. I show only the new or changed portions of the code for each request. I also show screen shots of the output from running the ChatGPT-provided Python code in a notebook.

A link to the repo with the full python code for this series of experiments appears at the end of the article. That repo also includes a summary of the tips mentioned throughout the article.

Let’s dig in, starting with a few data acquisition, data engineering, and EDA steps.

My requests look like this:

And ChatGPT’s responses appear as:

Some background — sodapy is the python library used by Chicago and numerous other government entities to make data publicly available through APIs. From past experience, I know the default timeout value for this dataset is not adequate so I explicitly told ChatGPT to use a higher value (100 seconds). The API only returns 1,000 records by default. I wanted a lot more for my tests, so I asked ChatGPT to increase the limit to 1,000,000. As you see above, ChatGPT dutifully complied with my requests in the code it gave me.

Executing the supplied code “as is” worked, resulting in this output:

(Additional columns omitted)

We are off to a good start. Let’s give ChatGPT a few more data challenges. I know that a lot of the records in this dataset are uninteresting “311 INFORMATION ONLY CALL” -type requests. So, I’ll ask ChatGPT to exclude those.

The sodapy API provides a SQL-like syntax to filter data on the read, which would make the list comprehension unnecessary here. I’ll try coaching ChatGPT to use that approach:

That’s more like it! ChatGPT even notices and mentions that including the limit in the read saves the resources that would be needed to read the entire dataset (several millions of records) and select just 1,000,000 records from it.

This code worked correctly, dropping the 311 information only requests:

(Additional columns dropped)

Let’s now ask ChatGPT to do a bit of data engineering. I want to find out how long the service requests have been open.

This worked well:

Three of the above records have suspiciously low elapsed_time_open values. As it turns out, some city departments may not record accurate closed dates. We’ll see this phenomenon again in later EDA steps.

It is interesting that ChatGPT used the dt.total_seconds() method to calculate the total seconds between the created_date and closed_date, and then divided the total seconds by 86,400 (the total number of seconds in a day) to arrive at the elapsed time open in days and fraction of days. In prior conversations with ChatGPT in which I asked it to find the elapsed time open between these same two columns, it just did a simple subtraction of closed_date from created_date to create the elapsed_time_open column. That also works, but the elapsed_time_open column will then have a timedelta64 data type. That data type caused problems in the EDA steps of that earlier conversation, requiring ChatGPT to do the conversion to seconds and division by 86,400 to arrive at days and fraction of days.

ChatGPT used the seconds conversion and division approach on the first try here, even though my request was done in a completely new, different conversation. I am not sure whether that was luck or whether ChatGPT learned from my separate earlier conversation thread. Either way, I got the desired result.

I now want to try some simple EDA, identifying records with elapsed_time_open values that may be outliers.

Note that I asked ChatGPT to exclude records with sr_type=”Aircraft Noise Complaint” in this request since I know from experience that there are thousands of those records (the Chicago residents living around the major airports are quite vocal!) and all of those records have closed dates equal to the created dates, causing incorrect time open values.

ChatGPT correctly added import statements for matplotlib and Seaborn at the beginning of the code (not shown above) but neglected to leave the import statements for pandas and sodapy in the program. I manually re-added those statements to the code and tried executing it:

Oops.

I asked ChatGPT to solve this problem, giving ChatGPT the statement causing the error and the error message itself:

Screen clip by the author

Executing the program with these changes produced the desired result:

If you are suspecting that portions of the data are inaccurate given the heavy distribution to very low elapsed time open values, you would be correct. As it turns out, there are several types of 311 service requests in Chicago’s data that have unrealistic closed dates and, consequently, unrealistic elapsed time open values. I’ll ask ChatGPT to help me identify those request types in a moment, but first let’s ask it to repeat the outlier analysis excluding records with an elapsed time open of less than one day. That should filter out those where the closed date and time equal the created date and time:

Well, that helps some, but let’s really focus on the remaining short elapsed time open values to see if certain types of requests are causing the heavy left skew in the data:

Note that I asked ChatGPT to only show me the new or changed code lines here. ChatGPT will often reprint the entire program in its responses, including the new and changed lines. That can take a long time and result in time-out issues in the current ChatGPT research preview. Asking it to just show the new and changed code can really speed up the process. Here was the result:

This is helpful. Seeing this would lead me to look closely at the “Graffiti Removal Request” and “Weed Removal Request” records to see if the underlying data are valid.

I really appreciate that ChatGPT showed the correct parameters and values for things like axis labels, plot title and label text rotation in the code above. Those are the type of details I often forget and that can cause no small amount of frustration as I work to tidy up a visualization.

In a few final experiments (not shown), I tried pushing ChatGPT to work on more complex visualizations. For example, I asked it to construct a python program to read the same 311 service request data, plot the requests by zip code, and animate the plot by request creation date. ChatGPT swiftly got me to the static plot by zip code but broke down with the code for the animation by date. I went back and forth with ChatGPT several times, asking it to correct errors related to the animation. It never did get the correct syntax, and it was obvious I could get the answer more easily with a simple query to Stack Overflow.

Visualizations done in other less popular or complex libraries (for example, d3.js) also proved challenging for ChatGPT. The frameworks it presented were largely accurate, but you would have to some significant adjustments to the code to get it to work.

You can find the code from the experiments shown above here. The README file of that repo has a list of the tips and suggestions coming out of this article.

Summary

ChatGPT is a helpful assistant as you work through creating python-based data visualizations. Asking it to help with data acquisition, data engineering, and EDA coding tasks will save time. It does a good job of suggesting a code pattern or framework for accomplishing a task, but it can get some details wrong. It helps you remember the code syntax for tasks you use infrequently.

It will correct errors in its code if you tell it the error and ask for a correction. But it may take several iterations to get to correct code. This can get frustrating quickly, especially if you think a simple web search will get you the correct answer. Every practitioner will have to decide when switching to a different tool is more efficient.

ChatGPT is a valuable teacher. I found that it suggests efficient, performant code patterns that are easy to understand. Resubmitting a request can get you an alternate approach, one that you may not normally consider. It is refreshing to see and can help you expand your skills.

All in all, I think ChatGPT is a time-saver and I will incorporate it into my workflow. It will only get better as OpenAI refines the models behind ChatGPT.

Additional resources

An article from Josep Ferrer titled 5 ChatGPT features to boost your daily work highlights additional, creative ways to use ChatGPT to save time and improve your code, including refactoring your code for a given style (think pep-8), documenting your code, and understanding what someone else’s spaghetti code is doing.

Author The PyCoach recently published several excellent articles with examples of using ChatGPT as well as OpenAI’s GPT APIs for coding and data science tasks. Check out his posts on Medium and his videos on YouTube.

OpenAI has an API-based coding-focused toolset trained on natural language as well as on a massive amount of code in many software languages. It is called Codex and is currently in alpha and available for free trial. More information is here.

Coming up in the next Coffee Break segment — How can ChatGPT help me create time series models for predicting crime based on city data?

My Coffee Breaks with ChatGPT

Insights and Tips for Using ChatGPT in Real World Data Science Work — Part 1: Visualizations

Written by Jeff Braun