Summarizing data with ChatGPT

Dean Allemang
7 min readApr 7, 2023

Those of you who are familiar with data.world know that the company has a mission to build the most meaningful, collaborative, and abundant data resource in the world. One way that data.world approaches this mission is to provide a community data portal, which is what you will see when you create a free account and login to data.world. There are currently over 150K open data resources available on the community site, including such sources as the US Census Bureau, NASA, and data.gov.uk.

I have outlined in another blog how you can be one of the curators of this data resource. If you’ve given that a try, you know that to really make your data usable by someone else, it takes more than just uploading a spreadsheet to a hosting and querying service like data.world to make your data FAIR, but it is a start. Among the things you need to do is to describe your data, maybe make a data dictionary for it, and perhaps even advertise it to potential users of the data.

By now it shouldn’t come as any surprise that ChatGPT can help in all of these tasks. As an example of this, I have used ChatGPT to provide a summary of the data that Syngenta provides to track progress on its Good Growth Plan.

Syngenta is an international agrichemical company. Their business is to provide chemicals and know-how to help farmers all over the world produce food. Starting back around 2014, Syngenta launched its “Good Growth Plan” — a set of commitments aimed not just at making money by selling chemicals, but at improving the efficiency and sustainability of food production all over the world.

A hand holding a cell phone over a cornfield with a rising sun in the background, with some graph data overlaid

I know what you’re thinking; some big international conglomerate wants to greenwash its activities by saying magic words like ESG and paying lip service to specific goals. But Syngenta expects you to think that, too; after all, they are a big international chemical conglomerate. How can a company like that possibly gain your trust that they are actually making progress on any of the goals that they claim to be pursuing?

One way to do this is to publish the metrics they are using to measure progress, collect data to evaluate these metrics, then (and this is the innovative bit) publish that data as open data so that citizen data scientists of the world can check their work. Toward that end, Syngenta has been publishing this data since 2014.

Since Syngenta is a chemical company not a data management service provider, they stopped hosting the data a few years back. Fortunately, Agroknow, which is a company that makes its business in agricultural data, has been hosting the data for a while now. In accordance with data.world’s mission, and following the CC-BY-SA license on the data, I have echoed this data on data.world, at an organization called Syngenta Good Growth Plan, in a dataset called Good Growth Plan Open Data.

If you visit that link, you’ll se a description and summary of the dataset, which is taken from the Agroknow website describing the productivity data set. This is useful information to know if you want to use this data.

Good Growth Plan information from its web page, copied into data.world

But it takes a bit more to really orient someone who isn’t already familiar with the data to its contents. This is a problem with a lot of open data; even if we make it available, making it understandable and usable takes a good deal of human labor to summarize, document and interpret the data.

If you scroll down the page a bit, you’ll see three files in the dataset, called Business.md, DataDictionary.md and DataSummary.md. If you have a look at these, you’ll probably not be surprised to learn that they were generated by ChatGPT. Since they are generated, I might generate them again by the time you read this, so I’m going to include screenshots of the current state.

Before we look at each of these, let’s talk about what information ChatGPT has about the data that lets it generate them. First off, we have the name of datasets, “Good Growth Plan Open Data”, and we have the human-provided (from the webpage) summary of the dataset. But we also have the filenames of the source data, as well as the column headers. In principle, we also have the data itself, but in this exercise, I only used the headers, making this a metadata-only exercise. That’s pretty important, and I’ll come back to it later.

First, let’s look at the DataSummary.md. As its name suggests, it is a summary of the dataset, synopsizing the things it covers. It was difficult to get ChatGPT to do more than just echoing back the headers of each table, but with a bit of coaxing it was willing to push forward and describe the big picture.

The first part of DataSummary.md

Next we get to the data dictionary. This is pretty long, since it outlines a bit of information about each field in each file in the dataset. Fortunately, the dataset isn’t so big, so this fits into one medium-long report.

A snippet from the data dictionary for Syngenta’s Good Growth Plan Data.

Finally, this is my favorite one, we have asked ChatGPT to think of business questions that the data might be used to answer, who would ask them and why.

Business questions for this dataset

In many cases, the questions just focus on a single file in the dataset. This is not surprising, since each of these studies was devised to respond to a particular Good Growth Goal. But some of them combine information from multiple sources to draw new insights.

The whole process is automated; if you create a dataset in data.world, and upload some CSVs (right now, the process only looks at CSVs, which is why one of the resources from the GGP data is not taken into account), I can run this program and create the annotations for you. If you are reading this, I want to ask you to give this a try. Using the instructions in my earlier blog, create a dataset in the data.world community site, upload some CSVs, then tell me about it in the comments below. I’ll run my ChatGPT process and annotate your dataset. Let’s see how well it does on your data.

I said that I would get back to the metadata aspect again. There is considerable concern among developers who might want to use ChatGPT that anything you send to a third party (like openai.com) will violate the data agreements you have with your users or customers. This isn’t a matter of distrust of openai or any other company, but just one of information control; I’m not allowed to share information that I receive in confidence to anyone, no matter how much I trust them.

But metadata has historically been treated as much less sensitive than data. It is a lot less of a concern for me to tell you that I keep first name, last name, address, phone and email information for my customers than it is for me to actually tell you what those values are for actual customers. This application only uses the metadata; not a single data point was examined to create this.

This also means that the application can be used to analyze a data catalog or data schema, where no actual data is available. This gives it much more applicability than a process that requires access to whole datasets.

Another point of interest for this whole process. The program that implements it is written in python, and it makes heavy use of the data.world API as well as the openai API, not to mention any number of python idioms. I can’t really say that I wrote this program; I told ChatGPT what I wanted to do, and it wrote the first draft of the program. It is quite conversant in both of those APIs, as well as common idioms for coding in Python. It even gave me instructions for how to use pip to install the required packages in my python deployment. As I wanted more specific features, I took over editing the code, but the first operational version was written purely by ChatGPT. This whole development took about a day to reach MVP, and a few more hours to hone so that I can feel good about writing a blog about it.

Some features that are currently missing, but coming soon:

  • Process XSLT files as well as CSVs. There’s no reason to limit it to CSVs.
  • For each of the business questions, have it write SPARQL queries that run them.
  • Allow the user to provide hints to focus the business questions, or even provide something like a dialog to support them.

Are there any other features you think would be cool or useful? Let me know in the comments.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.