Using ChatGPT to Explore NYC Open Data

10 min readMay 22, 2024

Out of curiosity, I started to experiment with uploading New York City Open Datasets into ChatGPT. I ran through a rudimentary data science pipeline, consisting of research question formulation, data gathering, data cleaning, data analysis, and data visualization. I was impressed with what I saw. Uploading the COVID-19 Daily Counts of Cases, Hospitalizations, and Deaths dataset from the NYC Open Data Portal into ChatGPT I got descriptive statistics and maps about COVID trends in less than five minutes. You can too, for any dataset on the NYC Open Data portal! Let’s talk about how this works, why I think it’s amazing, and what the pitfalls are.

https://chat.openai.com/share/89cdb299-d74c-40e0-9266-da9413e564a2

It was exciting to see a new natural language interface capable of generating working code that turned datasets into maps and graphs with a simple prompt. I immediately saw the potential for these tools to lower the barrier to generating useful data analysis. However, the pitfalls of using any generative AI tool, such as bias in training data and hallucinations in the responses, are also present when doing data analysis.

In late 2023 OpenAI, the company behind ChatGPT, launched a new feature that allowed users to create and share their own custom GPTs that focused the general GPT model on a specific task, like generating recipe ideas for dinner or creating a tutor for a specific topic.

I created a tutor GPT called “NYC 311 Open Data Tutor” that focused on teaching students how to use ChatGPT to explore NYC Open Data and learn Data Science concepts. Additionally, I experimented with creating a general “NYC Open Dataset Helper” GPT that could programmatically interact with the NYC Open Data portal based on user input, using the Actions functionality.

By specifying a schema (an expected format for an output) that aligned with the SODA API (the Application Program Interface that allows users to access NYC Open Datasets using computer code), I created a GPT that takes a URL for any dataset on the Open Data portal and returns information about it, such as:

A summary of metadata
Descriptive statistics (e.g., column names and number of rows)
A download URL for a filtered view of a dataset

Using the NYC Open Dataset Helper GPT, the NYC 311 Open Data Tutor GPT, and uploading filtered views of Open Datasets into a ChatGPT interface, users could:

Ask follow-up questions
Get suggestions for possible research questions
Receive recommendations for data cleaning steps for a specific dataset
Obtain working code with good documentation for analysis that can be tested in an independent coding environment
Get suggestions and working code for producing meaningful data visualizations

While it didn’t work perfectly all the time, I felt that it worked well enough to demonstrate the concept of using a natural language chat interface to explore and interact with datasets.

Give it a try here (custom GPTs and access to the new GPT4o model are now available to all GPT accounts, even free accounts):

Try the NYC 311 Open Data Tutor GPT

Try the NYC Open Dataset Helper GPT

My hope is that chat based interfaces to access data will make it easier for people to go from a question that could be addressed with data to having relevant data at their fingertips. When more people can easily access and analyze data, it can lead to better-informed decisions across various domains, including public policy. However, as chat interfaces for data discovery and analysis become more prevalent, it will be crucial to develop guidelines and best practices to address data privacy, security, and ethical issues.

I dove into this questions at a presentation I gave for the 2024 School of Data conference, demonstrating how you can use ChatGPT to run through the steps of a typical Data Science pipeline:

Research Question Formulation with ChatGPT

Data Gathering with ChatGPT

Data Visualization with ChatGPT

But it wasn’t all smooth sailing. When I first started experimenting with using ChatGPT for code analysis I was able to generate impressive maps with simple prompts, but then a few weeks before my School of Data presentation ChatGPT started to refuse to generate the maps, instead generating “pseudocode” in markdown cells that I could copy and paste into a coding environment and execute there.

This extra step is not difficult if you know what you are doing, and ChatGPT can guide you through the process of using Replit or Google CoLab if you are unfamiliar, but I missed the frictionless ability to conjure a complex map with a sentence. I found out that the Code Interpreter module had been changed, removing some Python libraries available to ChatGPT. This change might have been made due to security concerns or compatibility issues. I expect that in general, the future “tooling” available to LLM interfaces will continue to expand, but sometimes things are removed or get worse, or the model that works best for a specific kind of task gets supplemented by a new model. It’s a lot to keep up with, but it’s important to be aware of the fluid and fast state of development of these models and tools.

The current state of working with LLMs requires contextual knowledge of how the models work and prompting techniques that need to be learned to get the best results. For instance, when asking NYC Open Dataset Helper to provide a download URL for a dataset with a filter applied (311 Complaints in August 2020, for example), the GPT will often correctly supply the URL, but sometimes it will reply that it is unable to do this. Asking the GPT to construct the URL in a markdown cell will solve this, but that’s a bit of arcane knowledge to come by. Sometimes just saying something like “I know you can do this, try again”, or “Think through step by step how to do this”, or any number of prompt engineering “hacks” will solve this.

To get the most out of the current models, these kinds of maneuvers are necessary, but as the models improve and get better at predicting user intent, this kind of prompt engineering may become less and less necessary. In the meantime, here are a few general principles that can help improve your prompts:

Be specific and precise with what you want.
Ask for one thing at a time when possible.
Specify the format (including style) of the output you are looking for.
Give examples of the output you are looking for.
For more complex queries, ask the LLM to first come up with a plan and then execute the plan in a sequential manner (the step by step method).

To take a look under the hood at the custom instructions I used to create these GPTs check them out on Github:

Check out my School of Data presentation here: Full School of Data Slides

LLMs, like any tool, have limitations and biases that can impact the accuracy and reliability of the generated output. I am highlighting these recommendations from my presentation out here to emphasize the importance of being critical when using LLMs for data analysis (or anything else). Many of these recommendations about how to critically approach using LLMs were informed by the writing of Ethan Mollick, whose recent book “Co-Intelligence” is a succinct and informative guide to “living and working with AI”.

Here are some key recommendations for using LLMs critically:

Learn and recognize the limitations of LLMs
Verify generated information
Be aware of built-in biases
Understand and evaluate LLM source and training model data
Consider the relevance of information being provided by an LLM
Use an LLM as a tool, not a crutch
Assess ethical implications, particularly for sensitive content
Spend lots of hands on time using them
Reject a lot of output from LLMs
Be humble enough to accept the output when it is good

Of all these recommendations, I think that “Spend lots of hands-on time using them” is the most important. Hands-on experience allows users to develop a deeper understanding of how LLMs work, their strengths, and their limitations. If you’ve used my GPTs or other LLM tools to explore NYC Open Data or teach yourself data science concepts, let me know — I’d love to hear how others are experimenting and learning with these tools. Feel free to reach out to me or connect with me on Twitter/X or LinkedIn.

Update:

As I was preparing to publish this blog post, Google and OpenAI each made major releases of new LLM models. I have focused on exploring the enhanced coding and data analysis capabilities of GPT4o (GPT4 Omni), which is an updated version of the GPT4 model with a larger context window. The context window refers to the amount of prompted information the model can “remember,” and it is particularly useful for doing initial analyses of open datasets. The model has better coding abilities, and has new tooling that includes excellent visualization of data tables. Oh, and the new model is a lot faster, further reducing friction for chat based data analysis. And the new model is free to all users.

I ran variations on a prompt suggested by Ethan Mollick on a few uploaded NYC Open Datasets, some of them filtered to get under the current file size limit. The prompt asks the model to provide a comprehensive analysis of the dataset, including key insights, trends, and potential areas for further exploration. The new GPT4o model produced useful and more comprehensive results much quicker than the previous GPT4 model.

The larger context window allowed me to upload all 311 data for one community board for one year and then get a series of analyses based on that data. However, this is only a fraction of the data needed to do a complete analysis of the 311 data. Even with the filtered data, I had trouble getting useful results from larger datasets as the Chat instance would generate lots of text but then crash. This might be due to resource limitations or model constraints when dealing with larger datasets. I expect that performance issues, along with context window size, file size limits, and speed, will continue to improve in future iterations of the model.

To further test the capabilities of GPT4o, I re-created the analysis of the COVID data and generated an analysis of the 2018 Central Park Squirrel Census. In both cases, I obtained more comprehensive results in a much shorter timeframe compared to using the previous GPT4 model. These examples demonstrate the potential of GPT4o for efficient and insightful data analysis, while also highlighting some of its current limitations when dealing with larger datasets.

NYC COVID Data Analysis using GPT4o

NYC Squirrel Census Analysis using GPT4o

In one session, the ChatGPT integrated dummy weather data to illustrate an analysis idea correlating 311 service requests to weather patterns. In another session, it suggested that I download the Motor Vehicle Collision dataset to augment the analysis.

I tried to get GPT4o to make a map with boundaries of the boroughs, and it wrote code to access the NYC Open Data portal on its own (without custom instructions). However, it then realized that it couldn’t access the internet. I uploaded the borough boundary geojson file from the Open Data Portal and asked it to use that instead.

Just like when using the GPT4 model, I ran into the same multi-part geometry issue. But instead of reporting the issue to me and giving up right away, GPT4o rapidly iterated through several possible solutions. It came up with a workaround solution to make a downloadable html map much more quickly and with much less prompting from me.

When I was doing this testing, the new GPT4o model had not yet become deployable to custom GPTs, so I wasn’t able to test the new model in combination with custom instructions. This will likely be implemented soon. However, as the performance of GPT4o indicates, the necessity to create complex custom instructions may decrease as the base models improve.

These are just some initial results. If anyone has had any success producing interesting and useful analyses of NYC Open Data using GPT4o, I’d love to hear about it! Please feel free to share your experiences in the comments below or reach out to me directly.

Using ChatGPT to Explore NYC Open Data

Written by Nathan Storey