3 Ways ChatGPT Can Help You Document Your Data Analytics Work

Maciej Gieparda
Microsoft Power BI
Published in
8 min readMar 22, 2023
That’s DALL-E again for a great visualization — that is “Writing Documentation”

Hello everybody! First, I am touched and thrilled by your huge interest in my previous article about my use cases of ChatGPT in Data Analytics Work! Thanks a lot! I also spot, that many of You were interested in using ChatGPT as support for writing documentation, so I decided to prepare a case study.

I will touch here on 3 aspects of writing documentation where ChatGPT can help:

  • Commenting code (Some people call it also a “clean code”, but it is also a part of documentation)
  • Writing a draft of documentation of Your code
  • Writing documentation of datasets/tables

Commenting Code

Let’s say that we did this analysis below, we wanted to predict the GDP of the countries in 2030 by applying simple Linear Regression.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

df = pd.read_csv('data1.csv')

def predict_gdp(df, country):
years = np.arange(2010, 2023).reshape(-1, 1)
gdp = df[df['Country'] == country].iloc[:, 1:14].values.flatten()
model = LinearRegression()
model.fit(years, gdp)
future_years = np.arange(2023, 2031).reshape(-1, 1)
predictions = model.predict(future_years)

return predictions

for year in range(2023, 2031):
df[str(year)] = 0

for country in df['Country']:
predictions = predict_gdp(df, country)
df.loc[df['Country'] == country, '2023':'2030'] = predictions

df['Percentage_Increase'] = ((df['2030'] - df['2022']) / df['2022']) * 100
df = df.sort_values(by='Percentage_Increase', ascending=False)

plt.figure(figsize=(12, 6))
years = list(map(str, range(2010, 2031)))

for _, row in df.head(5).iterrows():
plt.plot(years, row[1:-1], label=row['Country'], marker='o')

plt.xlabel('Years')
plt.ylabel('GDP')
plt.title('GDP Predictions for Fastest-Growing Economies (2023-2030)')
plt.legend()
plt.grid()
plt.show()

We did a great job, and we want to publish our results, but we also want to be sure that our colleagues will understand what is in the code… We need to add comments everywhere huh? No, we don’t, pasting to Chatgpt gives us quite good results, all I need to add is just a command “Return this code snippet with comments describing what is happening in this code”. Remember to be precise about Your needs, so use words like “Return this code”, “comments”, and “in this code”. ChatGPT is a Language model (probably everybody heard) so it needs a direct command. (ChatGPT 4 is quite better in this case).

So, we have the output like this:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Read in the CSV data file and store it in a Pandas DataFrame
df = pd.read_csv('data1.csv')

# Define a function to predict GDP for a given country using linear regression
def predict_gdp(df, country):
# Create a numpy array of years from 2010 to 2022
years = np.arange(2010, 2023).reshape(-1, 1)
# Extract the GDP values for the given country from the DataFrame and flatten them into a 1D array
gdp = df[df['Country'] == country].iloc[:, 1:14].values.flatten()
# Create a LinearRegression model and fit it to the years and GDP data
model = LinearRegression()
model.fit(years, gdp)
# Create a numpy array of future years from 2023 to 2030
future_years = np.arange(2023, 2031).reshape(-1, 1)
# Use the model to predict GDP values for the future years
predictions = model.predict(future_years)
# Return the predicted GDP values
return predictions

# Add columns for years 2023-2030 to the DataFrame, initialized to 0
for year in range(2023, 2031):
df[str(year)] = 0

# Loop through each country in the DataFrame and use the predict_gdp function to populate the GDP values for the future years
for country in df['Country']:
predictions = predict_gdp(df, country)
df.loc[df['Country'] == country, '2023':'2030'] = predictions

# Calculate the percentage increase in GDP from 2022 to 2030 for each country and add it as a new column to the DataFrame
df['Percentage_Increase'] = ((df['2030'] - df['2022']) / df['2022']) * 100
# Sort the DataFrame by the percentage increase column in descending order
df = df.sort_values(by='Percentage_Increase', ascending=False)

# Plot a line graph of the GDP predictions for the top 5 fastest-growing economies from 2023 to 2030
plt.figure(figsize=(12, 6))
years = list(map(str, range(2010, 2031)))
for _, row in df.head(5).iterrows():
plt.plot(years, row[1:-1], label=row['Country'], marker='o')
plt.xlabel('Years')
plt.ylabel('GDP')
plt.title('GDP Predictions for Fastest-Growing Economies (2023-2030)')
plt.legend()
plt.grid()
plt.show()

Nice! What is good, the comments are clear and very detailed, which will definitely help in understanding code when You will have to come back to it in 1–2 years. What I personally don’t like — the comments of the function are not aligned with the standards of clean code. Maybe I will ask ChatGPT to correct it.

def predict_gdp(df, country):
"""Predicts the GDP for a given country from 2023 to 2030 using linear regression.

Args:
df (pd.DataFrame): A Pandas DataFrame containing GDP data.
country (str): The name of the country to predict GDP for.

Returns:
np.ndarray: A numpy array containing the predicted GDP values for the given country.
"""

# Create a numpy array of years from 2010 to 2022
years = np.arange(2010, 2023).reshape(-1, 1)

# Extract the GDP values for the given country from the DataFrame and flatten them into a 1D array
gdp = df[df['Country'] == country].iloc[:, 1:14].values.flatten()

# Create a LinearRegression model and fit it to the years and GDP data
model = LinearRegression()
model.fit(years, gdp)

# Create a numpy array of future years from 2023 to 2030
future_years = np.arange(2023, 2031).reshape(-1, 1)

# Use the model to predict GDP values for the future years
predictions = model.predict(future_years)

# Return the predicted GDP values
return predictions

Much better! Sometimes ChatGPT needs more guidance from us about certain details.

Remember one thing! Never take everything that ChatGPT will generate for You as 100% correct! Always go through its answer and check if everything is okay. People make mistakes, machines make mistakes, just check each other :)

Writing a draft of documentation of Your code

Let’s say that we are satisfied with our commented code and we want to add documentation of it. It is also simple with ChatGPT. Very important here might be a template for Your README file. I am using one from my Udacity Nanodegrees that was proposed.


# Table of contents
1. [Installation](#introduction)
2. [Project Motivation](#paragraph1)
3. [File Descriptions](#paragraph2)
3. [Interaction](#paragraph3)
3. [Author](#paragraph4)

### Installation <a name="introduction"></a>

**Libraries:**<br/>


**Tools:**<br/>


### Project Motivation <a name="paragraph1"></a>


### Files Descriptions <a name="paragraph2"></a>


### Interaction <a name="paragraph3"></a>


### Author <a name="paragraph4"></a>

All that I did was:

  • Typing in ChatGPT “Create a README file of this code:”
  • Pasting the code above
  • Typing then “The template of README file looks like this:”
  • Pasting the template above and sending the message to chatGPT!

And we get from that this written code:

# README File

## Table of Contents
1. [Installation](#installation)
2. [Project Motivation](#project-motivation)
3. [File Descriptions](#file-descriptions)
4. [Usage](#usage)
5. [Author](#author)

## Installation <a name="installation"></a>
This code requires the following libraries:
- numpy
- pandas
- scikit-learn
- matplotlib

These libraries can be installed using pip package installer:
pip install numpy pandas scikit-learn matplotlib


## Project Motivation <a name="project-motivation"></a>
The purpose of this project is to predict the GDP of various countries for the years 2023 to 2030 using linear regression. The predicted values are then used to determine the percentage increase in GDP for each country from 2022 to 2030. Finally, a line graph is created to visualize the GDP predictions for the top 5 fastest-growing economies from 2023 to 2030.

## File Descriptions <a name="file-descriptions"></a>
- `data1.csv`: a CSV file containing historical GDP data for various countries
- `README.md`: this file
- `gdp_prediction.py`: Python script that reads in the data from `data1.csv`, creates a LinearRegression model for each country, predicts the GDP values for 2023-2030, calculates the percentage increase in GDP from 2022 to 2030 for each country, sorts the DataFrame by the percentage increase column in descending order, and plots a line graph of the GDP predictions for the top 5 fastest-growing economies.

## Usage <a name="usage"></a>
To use this code, simply run the `gdp_prediction.py` script in a Python environment with the required libraries installed. The script will output a line graph of the GDP predictions for the top 5 fastest-growing economies from 2023 to 2030.

## Author <a name="author"></a>
This code was written by ChatGPT, an AI language model developed by OpenAI.

That’s it! We have a ready-to-paste markdown file. We need to review it, edit it if needed, and check for mistakes before publishing.

Writing documentation of datasets/tables

That is the last example, ChatGPT can also help You in writing documentation drafts of the tables, and datasets that You are planning to keep for the future. Because the data1.csv dataset is quite simple (Just country name and columns with years), I will use a different example. If You are following me, I am recently starting my adventure with Football Data Analytics, and I will describe the dataset Coaches and it looks like this.

Let’s ask ChatGPT for making a draft of the documentation. Just Copy and paste the sample of the dataset and ask “Write a draft of documentation based on this sample”.

The result is — WOW!

IMPORTANT: Please keep in mind that You might be sharing fragile, private, or classified data with ChatGPT. I am mentioning this just because we should be aware of our digital footprint on the web, even if we are confident about security, and that ChatGPT uses conversations with the users to improve its algorithm (nothing bad, that is how Machine Learning works). Before sharing this kind of info, think about Your data, and consult it with the proper authorities (GDPR officer, etc.), You can also use some techniques like randomizing data, using dummy values, hashing, etc.

Summing Up

Here are some of the tips on how to use ChatGPT as a support for Your documentation process.

  • Be direct and descriptive at the same time about your needs, and use proper words like “Using this CODE SNIPPET IN PYTHON” If You use Python for example.
  • Always Read and Review the outcome from the ChatGPT — Always.
  • Keep in mind that You might be sharing with ChatGPT data that should not be shared, as I mentioned above ☝️

Check out my other posts:

Follow me on Twitter!

--

--

Maciej Gieparda
Microsoft Power BI

Product Analyst, Data Enthusiast. I like Football, Travel, good food and playing Football Manager. https://linktr.ee/maciej.gieparda