Streamlining Graph Data Analysis with GPT-4’s AI-Powered Code Interpreter

Shreyareddy Edulakanti
8 min readNov 6, 2023

--

Introduction

In the realm of data science, efficiency and precision are paramount. My recent endeavor involved a complex graph dataset where I harnessed the capabilities of OpenAI’s GPT-4, the latest AI model known for its sophisticated code interpretation and generation. This article narrates my journey, highlighting how GPT-4’s assistance transformed the typically arduous tasks of exploratory data analysis (EDA), data preprocessing, clustering, and anomaly detection into a seamless workflow.

The Graph Dataset Challenge

The dataset presented a world where developers are the stars and repositories the galaxies they dwell in. Each line of data held a story — a repository forked, a contribution made, a partnership formed. But raw data is a language not all can speak, so I turned to GPT-4 to translate this binary prose into actionable insights. Graph data structures are intricate by nature, embodying relationships and connections that standard data analysis can overlook. My objective was to uncover insights within a citation network, balancing between imbalanced and balanced datasets. The goal? To not only categorize these scholarly articles but also to pinpoint any anomalies that could signal irregularities in citation patterns.

Graph datasets are a fascinating frontier in data science, offering a canvas to portray the complex relationships and interactions between entities. In this project, the graph data structure was at the core of our exploration, encapsulating a citation network that potentially held patterns and anomalies indicative of the scholarly world’s dynamics.

The dataset comprised several key components:

Nodes: Representing entities such as scholarly articles, authors, or research topics.

Edges: Illustrating the citations between articles, collaborations among authors, or relationships between articles and topics.

Attributes: Each node and edge came with metadata. For articles, this could include the year of publication, the field of study, or the number of citations received.

Diving into this data, the goal was multifold:

Categorization: Identify distinct groups or communities within the network. This could reveal clusters of articles frequently citing each other or groups of researchers collaborating closely.

Trend Analysis: Understand the flow of information and influence through the network. Which articles or authors stood out as central or influential within this scholarly web?

Anomaly Detection: Spot unusual patterns that could signal irregular citation behavior or potential areas of emerging research that break away from established trends.

import pandas as pd

# Load the data for the repositories
repositories_path = '/mnt/data/repositories.csv'
repositories_df = pd.read_csv(repositories_path)

# Display the first few rows of the repositories dataframe
repositories_df.head()
repo_id owner_username          repo_name  \
0 0 karpathy nanoGPT
1 1 karpathy micrograd
2 2 karpathy arxiv-sanity-lite
3 3 karpathy convnetjs
4 4 janhuenermann Tesla-Simulator

description created_at \
0 The simplest, fastest repository for training/... 2022-12-28 00:51:12
1 A tiny scalar-valued autograd engine and a neu... 2020-04-13 04:31:18
2 arxiv-sanity lite: tag arxiv papers of interes... 2021-11-13 04:34:22
3 Deep Learning in Javascript. Train Convolution... 2014-01-05 00:12:15
4 A JavaScript deep learning and reinforcement l... 2022-12-16 20:34:56

pushed_at size stargazers_count has_projects has_wiki ... \
0 2023-07-11 12:30:39 935 22667 True True ...
1 2023-07-07 04:13:18 248 5652 True True ...
2 2023-06-19 16:23:02 1015 865 True True ...
3 2023-01-07 21:33:23 27806 10601 True True ...
4 2022-12-19 18:57:28 19511 17 True True ...

forks_count open_issues_count license is_template \
0 2877 151 MIT License False
1 682 26 MIT License False
2 102 10 MIT License False
3 2058 75 MIT License False
4 13 0 MIT License False

topics watching \
0 [] 283
1 [] 113
2 ['arxiv', 'deep-learning', 'machine-learning',... 21
3 [] 599
4 [] 1

contributors_count commits_count \
0 31 195
1 2 24
2 2 75
3 15 89
4 5 65

languages \
0 ['Python']
1 ['Jupyter Notebook', 'Python']
2 ['Python', 'HTML', 'CSS', 'JavaScript', 'Makef...
3 ['JavaScript', 'HTML', 'CSS']
4 ['JavaScript']

readme
0 \n# nanoGPT\n\n![nanoGPT](assets/nanogpt.jpg)\...
1 \n# micrograd\n\n![awww](puppy.jpg)\n\nA tiny ...
2 \n# arxiv-sanity-lite\n\nA much lighter-weight...
3 \n# ConvNetJS\n\nConvNetJS is a Javascript imp...
4 <img src="./images/icon.png" alt="neurojs" hei...

[5 rows x 21 columns]

Next, I will:

Provide a descriptive summary of the dataset.
Check for missing values.
Visualize the distributions of numeric features.
Visualize the distribution of categorical features such as license and languages.

# Descriptive summary of the dataset
repositories_summary = repositories_df.describe(include='all').transpose()

# Check for missing values
missing_values = repositories_df.isnull().sum()

# Compile the information into a single dataframe for a clear overview
summary_df = pd.DataFrame({
"Data Type": repositories_df.dtypes,
"Unique Values": repositories_df.nunique(),
"Missing Values": missing_values,
"Missing Values %": (missing_values / len(repositories_df)) * 100
}).join(repositories_summary.drop(columns=["count", "unique"]))

summary_df

The summary of the repositories.csv dataset reveals the following:

o There are a total of 2954 entries and 21 columns.

o There are no missing values for most of the columns. However, description has 78 missing values (2.64%), and readme has 1 missing value (0.034%).

o The repo_id is unique for every repository as expected.

o owner_username has 1111 unique values, indicating there are multiple repositories per owner.

o The size, stargazers_count, forks_count, open_issues_count, watching, contributors_count, and commits_count are numeric columns and vary widely in their ranges.

o The has_projects, has_wiki, has_pages, and is_template are boolean fields.

o license has 22 unique values, suggesting different types of licenses used across repositories.

o languages seems to contain lists, which may need to be processed further for analysis.

Now, let’s visualize the distributions of numeric features and the distribution of categorical features such as license and languages to gain deeper insights. We’ll start with histograms for numeric columns and bar charts for categorical columns that have a reasonable number of unique values.​

The Power of GPT-4 in EDA and Preprocessing

Data Cleaning and Preprocessing

Missing values were filled, anomalies were smoothed over, and features were engineered with precision. With each command, the AI peeled back layers of data, revealing the clean skeleton upon which our analysis would be built.My first encounter with GPT-4 involved preprocessing. GPT-4 guided me through each step:

Handling Missing Values: With GPT-4’s suggestions, I replaced or imputed missing data, ensuring no valuable information was lost.

# Handling missing values
# Fill missing 'description' with a placeholder string
repositories_df['description'].fillna('No description', inplace=True)

# Since there's only one missing 'readme', we could either fill it with a placeholder or drop it.
# Given that it's just one row, we will fill it with a placeholder as well.
repositories_df['readme'].fillna('No readme', inplace=True)

# Verify if the missing values are handled
missing_values_after = repositories_df.isnull().sum()
missing_values_after
repo_id               0
owner_username 0
repo_name 0
description 0
created_at 0
pushed_at 0
size 0
stargazers_count 0
has_projects 0
has_wiki 0
has_pages 0
forks_count 0
open_issues_count 0
license 0
is_template 0
topics 0
watching 0
contributors_count 0
commits_count 0
languages 0
readme 0
languages_list 0
dtype: int64

The missing values in the description and readme columns have been successfully handled by filling them with placeholder text. There are now no missing values in the repositories.csv dataset.

The next steps in data preprocessing before clustering and anomaly detection typically include:

Feature Engineering: Creating new features that could be helpful for analysis. For instance, we could extract the year from the created_at and pushed_at dates if we’re interested in temporal trends.

Feature Scaling: Since clustering algorithms like K-Means are sensitive to the scale of the data, we should scale numerical features.

Encoding Categorical Variables: If we plan to use categorical variables in clustering, we need to encode them, often using one-hot encoding or similar methods.

Feature Selection: GPT-4’s advice on feature engineering was invaluable, helping me to craft features that encapsulated the essence of the graph structure.

Data Scaling: GPT-4 provided code snippets for standardizing the feature set, a crucial step before any machine learning could commence.

Exploratory Data Analysis (EDA)

With GPT-4’s guidance, the EDA was not just a routine step but a revelation. Through visualizations, we saw the dance of data points, clustering into natural groups, their relationships mapped out in colors and shapes.

Visualizations: From generating code for complex visualizations to interpreting them, GPT-4’s insights allowed me to understand the data’s underlying patterns.

Statistical Summaries: GPT-4 effortlessly produced statistical summaries, interpreting the nuances with ease.

Clustering and Anomaly Detection

Clustering

Clustering was a tale of discovery. GPT-4’s algorithms scoured through the data, grouping repositories and developers into clusters that told stories of collaboration and innovation.

Elbow Method: By plotting the within-cluster sum of squares (WCSS) against the number of clusters, you look for the point where the rate of decrease in WCSS slows down significantly. This point suggests that adding more clusters doesn’t provide much better modeling of the data.

K-Means Clustering: After determining the optimal number of clusters, you would run the K-Means algorithm with that number to assign each repository to a cluster.

from sklearn.cluster import KMeans

# Determine the range of cluster numbers to try
cluster_range = range(1, 11)

# Calculate the within-cluster sum of squares for each number of clusters
wcss = []
for n_clusters in cluster_range:
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
kmeans.fit(repositories_df_scaled)
wcss.append(kmeans.inertia_)

# Plot the elbow graph
plt.figure(figsize=(10, 5))
plt.plot(cluster_range, wcss, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.xticks(cluster_range)
plt.show()

Anomaly Detection

Isolation Forests: GPT-4’s expertise extended to anomaly detection, where it furnished code for implementing Isolation Forests, ensuring outliers were identified accurately. Anomalies in data can be harbingers of innovation or signs of disorder, allowing us to probe into the unusual and the exceptional.

Reflections and Learnings

Accelerated Workflow

GPT-4’s role was akin to a co-pilot. It accelerated my workflow significantly, providing instant code solutions and interpretations that would otherwise require extensive manual effort.

AI as a Collaborator

My experience cemented the idea that AI can act as a collaborator, not just a tool. GPT-4’s conversational capabilities made it feel like a dialogue, a two-way interaction that enriched the analytical process. I have used AutoML tool(DataRobot) for modeling.

Limitations and Workarounds

No journey is without challenges. GPT-4’s limitations in executing long-running tasks led me to adapt, splitting complex tasks into manageable segments and sometimes resorting to local execution.

Conclusion: The Future of Data Science with AI Assistants

My venture with GPT-4 was enlightening. It demonstrated not just what AI can do today, but a future where data scientists are free from the drudgery of repetitive tasks, focusing instead on strategic analysis and decision-making. GPT-4 hasn’t just been a tool; it’s been a catalyst for a more profound and insightful exploration into data.

--

--