Exploring the AI Landscape. Part 1: Crafting the Data FoundationData Selection and Preparation

Tatiana Petrova
8 min readMar 22, 2023

--

Image generated by the author using Stable Diffusion.

Welcome to the first part in our three-part series, designed to serve as a tutorial for data analysts. Each article in this series covers the fundamentals of working with data, empowering you to create and implement your own solutions:

  • Part 1: Crafting the Data Foundation — Data Selection and Preparation.
    In this first article, we dive into the process of collecting and preparing data from the popular scientific repository arXiv.org. Focusing on articles from 2018 to 2022, we set the stage for an engaging journey into the world of AI topics.
  • Part 2: Words of Change — Gaining Insights from Article Titles.
    This section delves into the language of innovation by examining AI topics through trigrams in article titles. We will clarify the methods used and provide helpful tables to visualize and understand the data.
  • Part 3: Picturing the Present — Visualizing the Rise and Fall of Topics.
    In the final (and the most exciting) piece of the series, we will interpret our findings, and reveal the crucial topics and trends that are shaping the AI landscape in 2022. Using the power of visualization, we will present an exciting view of AI’s present and future.

By following this guide, you’ll not only gain a solid understanding of data analysis and visualization techniques but also learn how to apply them to the ever-evolving world of artificial intelligence. This series offers a streamlined approach to swiftly analyze trends and provides an overview of the current state and future direction of AI.

Our method is grounded in the analysis of trigram frequency in article titles. We posit that when certain trigrams frequently emerge in research article titles, they can establish stable terms commonly employed to describe specific areas or techniques in a given knowledge domain. The prevalence of these trigrams can act as a qualitative measure of the popularity of particular research topics, and in this series, we apply this method to the titles of computer science articles.

Part 1: Selecting and Preparing Data consists of five sections:

  • Selecting Data: Why we use arXiv.org? Why group named “Computer Science”?
  • Downloading Data,
  • Filtering Data,
  • Visualising Data: Number of Articles by Year, Popularity of Categories Over the Years,
  • Conclusion.

Selecting Data

Why we use arXiv.org?

ArХiv.org is a well-known preprint repository in fields like physics, mathematics, computer science, and more. It provides a fast publishing platform for preprints, bypassing the delays of traditional peer-review processes. This means articles on arXiv.org are often accessible before being published in peer-reviewed journals, granting early access to cutting-edge research and insights. This is especially valuable in rapidly evolving fields like AI, where staying current with emerging trends is crucial.

Why group named “Computer Science”?

We chose articles from the arХiv.org group called “Computer Science.” This group covers a wide array of areas closely related to AI, including machine learning, computer vision, natural language processing, robotics, and more. By considering all categories within this group, we can capture a diverse spectrum of research and data that might not be as evident if we were to limit our selection to just a few specific categories. However, it’s important to note that not all topics within the Computer Science group may be directly relate to AI. Therefore, while interpreting our results, we’ll evaluatethe topics for AI relevance and exclude those that aren’t.

The analysis targets articles from 2018 to 2022, a five-year period that encapsulates the latest trends and advancements in AI.

Downloading Data

In this analysis, we use a freely available arXiv.org dataset from Kaggle, which comprises over 1.7 million articles and includes useful features such as article titles, authors, categories, abstracts, full text PDFs, and more. We download the arXiv dataset from Kaggle and store it as a JSON file called arxiv-metadata-oai-snapshot.json in the current working directory.

To process the data, we first read the JSON file and extract specific fields like article ‘ID’, ‘title’, ‘categories’, and ‘versions’ into a list of dictionaries. The code consists of two functions: parse_article_data and create_article_dataframe.

The parse_article_data function reads the file, iterates through each line, and appends the desired data to the article_list.

import json
import pandas as pd

def parse_article_data(file_path):
article_list = []

with open(file_path, "r") as file:
for line in file:
data = json.loads(line)

article_list.append(
{
"ID": article_id,
"title": data["title"],
"categories": data["categories"],
"versions": data["versions"],
}
)

return article_list

Then, the create_article_dataframe function converts the list of dictionaries into a pandas DataFrame for further analysis. The functions are called with the appropriate file path, resulting in an article_df DataFrame containing the parsed data:

def create_article_dataframe(article_data_list):
return pd.DataFrame.from_records(article_data_list)


file_path = "arxiv-metadata-oai-snapshot.json"
article_data = parse_article_data(file_path)
article_df = create_article_dataframe(article_data)

Filtering Data

Next, we create a function called filter_cs_articles to filter a DataFrame of articles, retaining only those belonging to the Computer Science category (with categories starting with “cs”). Within the function, we first extract the version and creation date corresponding to the first version of each article. Next, we calculate the publication year and store it in a new ‘year’ column. We then filter the DataFrame to keep articles published between the specified start_year and end_year. Finally, we remove unnecessary columns from the DataFrame and return the filtered cs_articles DataFrame. The function is then called with the original article_df DataFrame and the desired year range (2018 to 2022) to obtain the filtered cs_articles DataFrame:

def filter_cs_articles(article_df: pd.DataFrame, start_year: int, end_year: int) -> pd.DataFrame:
cs_articles = article_df[article_df['categories'].str.contains('^cs', regex=True)]

# Extract the version and created date for the first version
cs_articles[['version', 'created']] = cs_articles['versions'].apply(
lambda x: pd.Series([x[0]['version'], x[0]['created']])
)

cs_articles['year'] = pd.to_datetime(
cs_articles['created'], format='%a, %d %b %Y %H:%M:%S GMT'
).dt.year

cs_articles = cs_articles[
(cs_articles["year"] >= start_year) & (cs_articles["year"] <= end_year)
]

cs_articles = cs_articles.drop(
columns=["versions", "version", "created"], axis=1
)

return cs_articles

cs_articles = filter_cs_articles(article_df, 2018, 2022)

cs_articles.head()
cs_articles.info()
The cs_articles DataFrame, including articles from arXiv.org CS from 2018 to 2022. Table created by the Author.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 259577 entries, 929308 to 1795204
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 259577 non-null object
1 title 259577 non-null object
2 categories 259577 non-null object
3 year 259577 non-null int64

To continue arXiv.org Computer Science analysis, we saved DataFrame to cs_articles.csv file:

cs_articles.to_csv('cs_articles.csv', sep = ",")

Visualising Data

Number of Articles by Year

Before diving into a more detailed analysis, it’s essential to understand the structure of the data we’re working with. We’ll create a visualization to display how the number of articles in the Computer Science group on arХiv.org has evolved over the years.
First, we import the necessary libraries, such as Plotly Express. Next, we count the number of articles by year and sort them in ascending order. Then, we create a bar plot to represent the year-wise article counts. After that, we set x and y-axis labels, and the plot’s title. Finally, we customize the font size and family for the title and tick labels:

import plotly.express as px

# Count the number of articles by year
year_counts = cs_articles['year'].value_counts()

# Sort the counts by year
year_counts = year_counts.sort_index()

# Create a bar plot of the year counts
fig = px.bar(
x=year_counts.index,
y=year_counts.values,
color=year_counts.index,
title='Number of Articles by Year'
)

# Set the x and y axis labels
fig.update_xaxes(title_text='Year')
fig.update_yaxes(title_text='Number of Articles')

# Set the title of the plot
fig.update_layout(
title={
'text': 'Number of Computer Science Articles on arXiv.org by Year',
'font': {'size': 16, 'family': 'Helvetica'}
}
)

# Set the x and y axis ticks font size
fig.update_xaxes(tickfont=dict(size=12))
fig.update_yaxes(tickfont=dict(size=12))


fig.show()
Number of Articles in the arXiv.org group Computer Science from 2018 to 2022. Image by the author.

The number of articles in the arXiv.org group Computer Science has been rising each year, growing from approximately 36,000 in 2018 to nearly 65,000 in 2022.

Popularity of Categories Over the Years

We aim to observe the popularity of the TOP-15 categories in 2022 and track changes over the years. We loop through the years 2018 to 2022, count the articles by category, and store the top 15 categories for each year in separate DataFrames. If the article is assigned to multiple categories, we added all the categories listed in the ‘category’ column to the popularity table. Then, we concatenate these DataFrames and create an animated histogram using Plotly, where the x-axis represents categories, the y-axis represents the number of articles, and the animation frame represents the year:

# Define the years to loop over
years = [2022, 2021, 2020, 2019, 2018]

# Create an empty list to store the dataframes for each year
dfs = []

# Loop over each year
for year in years:
# Count the number of articles by category for the current year
category_counts = pd.Series(
cs_articles[cs_articles['year'] == year]['categories'].str.split(expand=True).values.ravel()).value_counts()

# Keep only the top 10 categories
category_counts = category_counts[:15]

# Create a dataframe for the current year
df = pd.DataFrame(
{'Category': category_counts.index,
'Number of Articles': category_counts.values, 'Year': year}
)

# Append the dataframe to the list of dataframes
dfs.append(df)

# Concatenate the dataframes into a single dataframe
df = pd.concat(dfs, ignore_index=True)

# Create an animated histogram using Plotly
fig = px.histogram(
df,
x='Category',
y='Number of Articles',
color='Category',
animation_frame='Year',
nbins=len(df['Category'].unique())
)

# Set the layout of the plot
fig.update_layout(
title='Top 15 Categories by Number of Articles',
xaxis_title='Category',
yaxis_title='Number of Articles',
yaxis=dict(range=[0, 24000])
)

fig.show()
Top-15 Categories of 2022: Popularity Change in arXiv.org Computer Science Group (2018–2022). Image by the author.
  • The most popular category over the last four years is cs.LG (Machine Learning), which covers a wide range of machine learning research, including supervised, unsupervised, reinforcement learning, bandit problems, and more. It also addresses topics such as robustness, explanation, fairness, and methodology, making it a fitting primary category for machine learning applications.
  • The second most popular category is cs.CV (Computer Vision and Pattern Recognition), encompassing research on image processing, computer vision, pattern recognition, and scene understanding. Interestingly, cs.CV was the most popular category in 2018.
  • In recent years, the third most popular category has been cs.AI (Artificial Intelligence), which covers a broad range of AI topics, excluding Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), each of which has separate subject areas. This category includes topics such as Expert Systems, Theorem Proving, Knowledge Representation, Planning, and Uncertainty in AI.

Conclusion

We have successfully collected and processed 259,577 articles in the field of computer science published on arXiv.org between 2018 and 2022 for trend analysis. This substantial dataset provides us with a strong foundation for discovering key patterns and emerging topics in the AI landscape. We also explored the process of transforming JSON data into a CSV format, employed various filtering techniques, and conducted elementary visualization of DataFrame content. The complete code for this process can be found in my GitHub repository.

In the upcoming Part 2: Words of Change — Gaining Insights from Article Titles, we will delve deeper into the hot topics and trends within AI by examining the trigrams present in article titles. This approach will allow us to uncover the most influential and innovative areas of research in the ever-evolving field of artificial intelligence. Stay tuned for a fascinating exploration of the language of innovation and its impact on the future of AI.

--

--