Explanatory Data Analysis(EDA) of the paper list in ICML2024

6 min readJul 30, 2024

In a previous article, I shared a tip on how to get the paper list for ICML 2024. This article, using the allPapers.json obtained from that list, I'd like to conduct a brief exploratory data analysis (EDA) on the trends in ICML 2024.

First, I load the JSON file using Python and remove some unnecessary columns from the dataframe.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import ujson as json
import plotly.express as px
from plotly.subplots import make_subplots

#Data loading
file_path = './allPapers.json'
df = pd.read_json(file_path)
# nothing in keywords...
df = df.drop(columns=['keywords', 'starttime', 'endtime','starttime_special', 'read', 'bookmarked'])
# create easiest feature
df['num_authors'] = df['authors'].apply(len)
print("No. of records:", len(df)) # No. of records: 2634

# calculate the number of authors listed in each paper
author_counts = df['num_authors'].value_counts().sort_index()
author_count_df = pd.DataFrame({
    'num_authors': author_counts.index,
    'count': author_counts.values
})

# calculate the percentage
total_papers = author_count_df['count'].sum()
author_count_df['percentage'] = (author_count_df['count'] / total_papers) * 100

# create a histogram
fig = px.bar(author_count_df, x='num_authors', y='count',
             text=author_count_df['count'].astype(str) + '<br>' + author_count_df['percentage'].round(1).astype(str) + '%',
             labels={'num_authors': 'Number of Authors', 'count': 'Number of Papers'},
             title='Number of Papers by Author Count')
fig.update_layout(xaxis_title='Number of Authors',
                  yaxis_title='Number of Papers',
                  xaxis=dict(tickmode='linear', tick0=0, dtick=5))
fig.update_traces(textposition='outside')
fig.show()

There is only one paper with more than 70 authors, but as you might expect, the typical number of authors for ICML papers is around 4.

Next, the JSON file includes some noise in the ‘sessions’ column, such as entries that start with a comma or are empty. Moreover, I corrected the ‘eventtype’ by using the ‘sessions’ column, as every entry in the ‘eventtype’ column was listed as ‘Poster’.

def clean_sessions(session):
    # skip empty
    if not session:
        return None
    # remove the data if it starts with comma
    if session.startswith(','):
        session = session[1:].strip()
    return session

# correct the sessions col
df['sessions'] = df['sessions'].apply(lambda x: [clean_sessions(session) for session in x if session])
df['sessions_str'] = df['sessions'].apply(lambda x: ', '.join(x))
df['sessions_str'] = df['sessions_str'].replace('', 'not set')

# change 'Oral' or 'not set' as eventtype, if sessions_str includes it
df.loc[df['sessions_str'].str.contains('Oral', na=False), 'eventtype'] = 'Oral'
df.loc[df['sessions_str'].str.contains('not set', na=False), 'eventtype'] = 'not set'

# check the data
df.head()

The entries in the ‘topic’ column are represented by ‘A → B’, where ‘A’ is the main topic and ‘B’ is the subtopic. Let’s split the colum into ‘main_topic’ and ‘sub_topic’.

# create new cols from 'topic'
df[['main_topic', 'sub_topic']] = df['topic'].str.split('->', expand=True)

# set 'not set' to None
df['main_topic'] = df['main_topic'].fillna('not set')
df['sub_topic'] = df['sub_topic'].fillna('not set')

# check the result
df[['topic', 'main_topic', 'sub_topic']].head()

Okay, let’s visualize the data!

main_topics = df['main_topic'].unique()
# set colors to match these topics for consistency
color_map = {topic: color for topic, color in zip(main_topics, px.colors.qualitative.Plotly)}

# create sunburst chart
fig = px.sunburst(df, path=['main_topic', 'num_authors'], values=df.index, color='main_topic', color_discrete_map=color_map)
fig.update_traces(textinfo="label+percent entry")
fig.update_layout(title='Distribution of Papers by Main and Author Numbers')
fig.show()

# boxplot
fig = px.box(df, x='main_topic', y='num_authors', title='Distribution of Number of Authors by Main Topic', 
            color='main_topic', color_discrete_map=color_map
            )
fig.update_layout(
    xaxis_title='Main Topic',
    yaxis_title='Number of Authors',
    yaxis=dict(tickmode='linear', dtick=5),
    xaxis_tickangle=-45
)
fig.show()

ICML 2024 has 8 main topics plus ‘not set’ for unknown data. The topic with the highest number of papers is Deep Learning, accounting for 31%, followed by Applications and Social Aspect at 12%. The topic with the fewest papers is Probabilistic Methods, with 4%.

Additionally, when examining the distribution of the number of authors per paper, it appears that the Theory area tends to have fewer authors compared to other fields.

Next, let’s look into the relationship between subtopics and presentation types.

# sunburst(all data)
fig = px.sunburst(df, path=['main_topic', 'eventtype', 'sub_topic'], values=df.index,
                  color='main_topic', color_discrete_map=color_map)
fig.update_traces(textinfo="label+percent entry")
fig.update_layout(title='Distribution of Papers by Main and Sub Topics(All)')
fig.show()

# sunburst(Poster)
fig_poster = px.sunburst(df[df['eventtype']=='Poster'], path=['main_topic', 'sub_topic'], values=df[df['eventtype']=='Poster'].index,
                         color='main_topic', color_discrete_map=color_map)
fig_poster.update_traces(textinfo="label+percent entry")
fig_poster.update_layout(title='Distribution of Papers by Main and Sub Topics (Poster)')

# sunburst(Oral)
fig_oral = px.sunburst(df[df['eventtype']=='Oral'], path=['main_topic', 'sub_topic'], values=df[df['eventtype']=='Oral'].index,
                       color='main_topic', color_discrete_map=color_map
                      )
fig_oral.update_traces(textinfo="label+percent entry")
fig_oral.update_layout(title='Distribution of Papers by Main and Sub Topics (Oral)')

# create subplot
fig = make_subplots(rows=1, cols=2, subplot_titles=('Poster', 'Oral'), specs=[[{'type': 'sunburst'}, {'type': 'sunburst'}]])

# add these chart to subplot
for trace in fig_poster.data:
    fig.add_trace(trace, row=1, col=1)
for trace in fig_oral.data:
    fig.add_trace(trace, row=1, col=2)

fig.update_layout(title_text="Distribution of Papers by Event Type", showlegend=True)
fig.show()

Oral presentations are extremely rare across all fields, but the graph below highlights the topic of Deep Learning, which has the highest number of submissions.

Combining LLMs and Generative Models and Autoencoders accounts for nearly 71% of this topic, indicating its significant level of attention.

Here is the graph that separates Poster and Oral presentations.

In Poster presentations, Applications is the second most common topic, while Social Aspects ranks second in Oral presentations. This suggests that Applications likely had more demonstrations. When focusing on the second most common topics, Computer Vision tops the list for Poster presentations, while Privacy-related research leads in Oral presentations.

Finally, let’s create a word cloud using titles and abstracts. Without applying stemming or lemmatization, I used a simple set of stopwords as declared below for this visualization.

# stemming or lemmatizer would be better, but the easiest way is the following
custom_stopwords = {
    "model", "models","learning", "based", "via", "using", "method", "methods", "approach", "approaches","data",
    "toward", "towards","Position","problem","problems","ai", "AI", "artificial intelligence",
    "algorithm","algorithms","task", "tasks","show","shows", "shown","propose", "proposes","proposed",
    "result","results","setting", "settings","exist","existing","exists",
    "improve","improves","improved","improvement","improvements"
    "function", "novel","novels","machine","performance","performancec","solution","solutions", "work", "worked", "works",
    "paper","papers","introduce","introduced","train", "trained", "training",
    "one", "two", "three","first", "second", "third",
    "framework", "frameworks","demonstrate","demonstrates","demonstrated",
    "new", "dataset", "datasets","new","code","codes","function","functions","information",
}

# add 
stopwords = set(STOPWORDS).union(custom_stopwords)

# If you look at the word cloud as a whole, there are only common words, so focus on individual fields.
# Simply look at the most common words in the title
# Create a word cloud for each main_topic
main_topics = df['main_topic'].unique()

fig, axes = plt.subplots(3, 3, figsize=(12, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

# generate WordCloud
for ax, topic in zip(axes.flatten(), main_topics):
    titles = ' '.join(df[df['main_topic'] == topic]['title'])
    wordcloud = WordCloud(width=300, height=200, background_color='white',stopwords=stopwords).generate(titles)
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis('off')
    ax.set_title(topic, fontsize=12)

plt.suptitle('Common Keywords in Paper Titles by Main Topic', fontsize=16)
plt.show()

# wordcloud for abstract
fig, axes = plt.subplots(3, 3, figsize=(12, 8)) 
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for ax, topic in zip(axes.flatten(), main_topics):
    titles = ' '.join(df[df['main_topic'] == topic]['abstract'])
    wordcloud = WordCloud(width=300, height=200, background_color='white',stopwords=stopwords).generate(titles)
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis('off')
    ax.set_title(topic, fontsize=12)

plt.suptitle('Common Keywords in Paper Abstract by Main Topic', fontsize=16)
plt.show()

We can get a good sense of the trends for each topic. In Deep Learning, which has the highest number of presentations, terms like Transformer, Graph, LLM, and Diffusion are prevalent. In Social Aspects, which we previously highlighted in oral presentations, terms like Federated, Private, and Fairness stand out.

In this article, we performed a brief EDA to explore the trends in ICML 2024.

I hope this helps those who are undecided about which field of papers to read and that you find papers in the areas of your interest based on the EDA results!

Thank you!

Written by Taks.skyfoliage.com

This post is republished from skyfoliage.com

Explanatory Data Analysis(EDA) of the paper list in ICML2024

Written by Taks@skyfoliage.com