Exploring the AI Landscape. Part 2: Words of ChangeGaining Insights from Article Titles

Tatiana Petrova
14 min readMar 22, 2023

--

Image generated by the author using Stable Diffusion.

In Part 1: Crafting the Data Foundation — Data Selection and Preparation, we explored computer science articles on arXiv.org from 2018 to 2022, prepared the relevant dataset for analysis.

In Part 2: Words of Change — Gaining Insights from Article Titles, we will delve deeper into AI topics and trends using trigrams in article titles. By investigating this method of analysis and presenting supporting tables, we offer a straightforward and effective way to understand the current state of AI and pinpoint the trends shaping the industry in 2022.

This part consists of the following sections:

  • Word Clouds,
  • Understanding Trigrams,
  • Titles Preparation: Text Cleaning, Find and Replace Abbreviations, Stopwords Removing, Lemmatization,
  • Extracting TOP-1000 Trigrams from Article Titles: Getting Frequency Table, Getting Rank Table,
  • Trigram Trends Uncovered,
  • Conclusion.

Word Clouds

Before preparing the article titles for trend analysis, we will utilize a simple yet powerful technique for visualizing word frequencies — Word Clouds.

  • What is a Word Cloud?

Word Cloud is a visual representations of words, where the size of each word corresponds to its frequency or importance in the dataset. The larger the word appears in the cloud, the more frequently it occurs in the data.

Word Clouds are often used in data analysis, presentations, and marketing to highlight key themes, ideas, or concepts. They are especially useful for presenting complex data in an easily digestible format, allowing viewers to quickly grasp the main points of the analysis.

  • Generate Word Clouds

First, we load the dataset obtained in the first part of the series:

import pandas as pd

df = pd.read_csv("cs_articles.csv", sep = ",")

We use the wordcloud and matplotlib libraries to create multiple word cloud visualizations. The code loops through each unique year in the ‘year’ column of a Pandas DataFrame and generates a separate word cloud for each year:

from wordcloud import WordCloud
import matplotlib.pyplot as plt

%matplotlib inline

def generate_wordcloud(year, titles):
words = ' '.join(titles)
wordcloud = WordCloud(
width=800,
height=400,
random_state=21,
max_font_size=110,
background_color='white'
).generate(words)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title(f'{year}', fontsize=20)
plt.show()

for year in df['year'].unique():
year_titles = df[df['year'] == year]['title'].tolist()
generate_wordcloud(year, year_titles)
Word Cloud of Titles from Computer Science Papers on arXiv.org (2018–2022) for Each Year. Image by the author.

Though Word Clouds offer a visually captivating way to present information, they fall short when it comes to effectively tracking changes in word frequency and overall topics over time. To delve deeper into the evolution of topics across the years, we’ll turn our attention to the powerful method of identifying trigrams. This approach will allow us to better understand the shifting landscape of AI trends and interests.

Understanding Trigrams

Define trigrams

Trigrams are sequences of three consecutive words found within a text. They play a crucial role in natural language processing, assisting in various tasks such as language modeling, text classification, and information retrieval. Generating trigrams involves sliding a window of three words across the text.

In our analysis, we’ll focus on extracting trigrams from article titles. We are planning to transform article titles into sets of trigrams in the following manner:

Table of Article Titles and their Corresponding Trigrams. Table created by author.

So before diving in, we’ll first prepare the text data and store it in a new column. Then, we’ll create a table showcasing the popularity of trigrams between 2018 and 2022, highlighting the hottest topics and trends in the AI landscape.

Titles Preparation

Text Cleaning

Text cleaning is an essential step in data preprocessing. By removing unnecessary characters such as punctuation and multiple spaces, we can standardize the text and make it easier to analyze. The following code defines a function called process_title that removes all punctuation except hyphens from a given title and eliminates any excess spaces:

import string

def process_title(title):
# Create a translation table that replaces all punctuation except hyphen with a space
translation_table = str.maketrans(
{char: ' ' if char not in '-' else char for char in string.punctuation}
)

# Replace punctuation in the title with a space, except for hyphen
title_processed = title.translate(translation_table)

# Remove multiple spaces from the title
title_processed = ' '.join(title_processed.split())

return title_processed

df['title_processed'] = df['title'].apply(process_title)

We preserve hyphens since, when highlighting trigrams, it is more reasonable not to separate words like ‘multi-agent’ or ‘pre-trained’ and consider them as one word. Hyphenated words often represent a single concept or entity, and splitting them into multiple words can lead to incorrect interpretations and inaccurate results in natural language processing tasks. Therefore, by preserving hyphens, the process_title function ensures that these hyphenated words are not split and are considered as one word, allowing for more accurate analysis of titles.

Find and Replace Abbreviations

After the titles have been essentially cleaned, we identify any abbreviations that may have been used. Abbreviations consist of stable combinations of words, and they need to be replaced with their full versions before we extract the trigrams. Identifying and standardizing abbreviations helps improve the accuracy and interpretability of text analysis results.

  • Extract Abbreviations

We have developed a custom function find_abbreviations that scans through the titles and extracts all words that are comprised of more than two letters and have their letters in uppercase:

def find_abbreviations(text):
words = text.split()
result = []

for word in words:
if len(word) > 2 and word.isupper():
result.append(word)

return ", ".join(result)

The function then returns the list of abbreviations as a single string with each abbreviation separated by space. By using this function on the title column of our dataset, we are able to easily generate a new column containing all of the abbreviations found in each title:

df["title_abbreviations"] = df["title"].apply(find_abbreviations)
df.head(7)
Data with added column of abbreviations present in each title (title_abbreviations column). Table generated by author.
  • Create a List of the Most Common Abbreviations

In the next step, we extract the TOP-200 most frequently occurring abbreviations from the “title_abbreviations” column and store them in a list:

abbs = [
word
for row in df['title_abbreviations']
for word in row.split(", ")
]

abbs_freq = pd.Series(abbs).value_counts()
top_200_abbs = abbs_freq.head(200)

abbs_list = " ".join(top_200_abbs.index)
abbs_list = abbs_list.lstrip(" ")

We used ChatGPT to decode the frequently used abbreviations and saved the expanded output in a word_replacements.txt file. Some abbreviations may contain prepositions and punctuation marks, but we will handle this during the final stage of text processing:

An Example of Decoding Frequently used Abbreviations. List generated by author using chatGPT.
  • Replace Abbreviations

To replace abbreviations with their full versions, we use a Python dictionary and apply it to the “title_processed” column of a Pandas DataFrame using the provided code. The dictionary is created by reading a list of word replacements from a text file, and each key-value pair specifies an abbreviation to be replaced and its corresponding transcript:

# Create a dictionary to replace abbreviations in the 'title_lem' column
word_replacements = {}
with open('word_replacements.txt') as file:
for line in file:
key, value = line.strip().split(':', 1)
word_replacements[key] = value

# Replace abbreviations in the 'title_processed' column
def replace_words(text):
return ' '.join([word_replacements.get(word, word) for word in text.split()])

df['title_processed'] = df['title_processed'].apply(replace_words)

Next, we convert the “title_processed” column to lowercase, remove punctuation except hyphens (using the previously created function process_title), and eliminate extra spaces to clean and standardize the text data:

df['title_processed'] = df['title_processed'].str.lower().apply(process_title)

Stopwords Removing

At this step, we need to remove stop words from the titles. Stop words are commonly occurring words in a language that are often filtered out from text analysis because they do not carry significant meaning. Examples of typical stop words in English include “a”, “the”, “and”, “in”, “is”, etc. By removing stop words, we can improve the accuracy and interpretability of text analysis results.

First, we download and import the “stopwords” module from the NLTK library, which provides a list of stop words in English and then add any custom stop words to the set using the ‘update’ method. The next step is to apply the stop words set to the ‘title_processed’ column using a lambda function with list comprehension:

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

# Get a set of stop words from the NLTK library
stop_words = set(stopwords.words('english'))

# Add custom stop words to the set
custom_stop_words = [
'driving','driven','via','based','using'
]
stop_words.update(custom_stop_words)

def remove_stop_words(text, stop_words):
return ' '.join([word for word in text.split() if word not in stop_words])

# Remove stop words from the 'title_processed' column
df['title_processed'] = df['title_processed'].apply(
lambda x: remove_stop_words(x, stop_words)
)

Lemmatization

Lemmatization is the process of reducing words to their base or root form, which can help standardize text data and improve text analysis results.

For example, the words “optimizing”, “optimized”, and “optimizes” all have the same base form “optimize”. By applying lemmatization to these words, we can reduce them to their common base form, which can help eliminate redundant or variant forms of the same word. This process improves the accuracy and consistency of text analysis results.

We implement lemmatization on the ‘title_processed’ column using the “WordNetLemmatizer” module from the NLTK library. We create the WordNetLemmatizer object and apply it to the ‘title_processed’ column through a lambda function. Finally, we join the resulting words into a string and store it in a fresh new ‘title_lem’ column:

import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

# Create a WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text, lemmatizer):
return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

# Perform lemmatization on the 'title_processed' column
df['title_lem'] = df['title_processed'].apply(
lambda x: lemmatize_text(x, lemmatizer)
)

Extracting TOP-1000 Trigrams from Article Titles

In this central stage of our analysis, we shift our focus to extracting trigrams from article titles to uncover trends for each year under consideration. Trigrams are instrumental in text analysis tasks, such as topic modeling, which involves detecting patterns and themes within vast text collections. By extracting trigrams, we can identify recurring sequences of three words, offering valuable insights into the primary topics and themes present in the text data.

Our approach is based on the examination of trigram frequency within article titles. We suggest that when specific trigrams appear frequently in research article titles, they help establish stable terms commonly used to describe particular areas or techniques within a given knowledge domain. The prevalence of these trigrams can serve as a qualitative indicator of the popularity of specific research topics

Extracting Trigrams

To do this, we’ll define a function called get_trigrams that generates trigrams from a given string. The function first splits the input string into a list of individual words using the ‘split’ method. Then, it uses list comprehension to iterate over each word in the list, creating a trigram by concatenating the current word with the two subsequent words using string interpolation.

We’ll apply the get_trigrams function to the ‘title_lem’ column using the ‘apply’ method with a lambda function. The resulting list of trigrams is stored in a new ‘trigrams’ column in the DataFrame and saved to cs_articles_processed.csv:

def get_trigrams(text):
words = text.split()
trigrams = [
f"{words[i]} {words[i + 1]} {words[i + 2]}"
for i in range(len(words) - 2)
if len(words) >= 2
]
return trigrams

df['trigrams'] = df['title_lem'].apply(get_trigrams)
df.to_csv("cs_articles_processed.csv", sep = ",")

Getting Frequency Table

As we approach the end of our analysis, our focus shifts to determining the popularity of trigrams for each year.

Initially, we extract the most common trigrams for 2022 and create an empty table with these trigrams as the index. Subsequently, we iterate through each year, calculating the relative frequency of these trigrams and populating the table with these values. Ultimately, we remove rows with all zeros to create a concise and informative frequency table:

from collections import Counter

# Filter data for year 2022
df_2022 = df[df['year'] == 2022]

# Get the 1000 most common trigrams for 2022
trigram_list = [
trigram
for trigrams in df_2022['trigrams']
for trigram in trigrams
]

trigram_counter = Counter(trigram_list)
top_trigrams = trigram_counter.most_common(1100)

# Create a table with top trigrams 2022 as left column and relative frequencies of these trigrams as column headings
table = pd.DataFrame(top_trigrams, columns=['Trigram', 'Frequency'])
table.set_index('Trigram', inplace=True)
table.drop('Frequency', axis=1, inplace=True)

# Fill the table with relative frequencies for each year
years = [2022, 2021, 2020, 2019, 2018]
for year in years:
year_df = df[df['year'] == year]
year_trigram_list = [
trigram
for trigrams in year_df['trigrams']
for trigram in trigrams
]

year_trigram_counter = Counter(year_trigram_list)
total_trigrams = len(year_trigram_list)
table[year] = 0

for trigram, frequency in year_trigram_counter.items():
if (
trigram in table.index
and frequency > 0
and not all(word.strip() == '' for word in trigram.split())
):
table.loc[trigram, year] = frequency / total_trigrams

# Drop rows with all zeros
table_freq = table.loc[(table != 0).any(axis=1)]

Some trigrams can be combined into a single quadrigram. Such trigrams are located close to each other in the frequency table — their frequency for each year is almost identical (differing by no more than one percent), and their names can be combined. That’s one of the reasons why we initially took 1100 of the most common trigrams to ultimately obtain the top 1000 at the end of this part.

For example, the two trigrams bidirectional encoder representation and encoder representation transformer together form the composite quadrigram bidirectional encoder representation (transformer) (e.g. BERT).

Let’s implement such a combination as a function. The compare_trigrams function takes a frequency table as input and combines trigrams that have almost the same frequency across all years. It iterates over all pairs of trigrams and checks if their frequencies are similar within a 1% margin. If so, the trigrams are combined into a single entry, and the second trigram is removed from the table. The function is called recursively until no more trigrams can be combined. The resulting table is a more concise representation of the trigram frequencies. Then we store this table to trigram_frequency_table.csv file:

def compare_trigrams(table):
# Check if there are at least 2 trigrams to compare
if len(table) < 2:
return table

# Iterate over all pairs of trigrams
for i in range(len(table)):
for j in range(i + 1, len(table)):
trigram_1 = table.index[i]
trigram_2 = table.index[j]
same_freq = True

# Check if the frequency is the same for both trigrams in every year
for year in table.columns:
freq_1 = table.loc[trigram_1, year]
freq_2 = table.loc[trigram_2, year]
if abs(freq_1 - freq_2) > 0.01 * freq_1:
same_freq = False
break

# If the frequencies are the same, combine the trigrams
if same_freq:
new_trigram = f"{trigram_1} ({trigram_2.split()[-1]})"
# Replace the first trigram with the combined trigram
table = table.rename(index={trigram_1: new_trigram})
# Delete the row with the second trigram
table = table.drop(trigram_2)
# Recursively call the function with the updated table
return compare_trigrams(table)

return table

table_freq = compare_trigrams(table_freq)
table_freq.head(10)
Frequency Table of the 10 Most Popular Trigrams for 2022 in Computer Science Articles Titles from arXiv.org (2018–2022). Table created by author.

Getting Rank Table

The rank table offers a more insightful representation than the frequency table, as it highlights the shifts in a specific trigram’s position within the overall trigram rankings for each year. In some cases, multiple trigrams may have identical frequencies, and therefore, they should be assigned the same rank. To refine the 2022 rankings, we use the 2021 frequencies: if the trigrams have the same frequencies in 2022, the trigram with a higher frequency in 2021 is assigned a higher rank.

To construct the rank table, we first create a copy of the frequency table and initialize a new DataFrame to store the ranks. The code proceeds to loop through the years, sorting the frequencies in descending order and allocating ranks to each trigram based on their positions. To break ties in the 2022 rankings, we add a small negative value to the 2021 frequencies. Finally, the ranks are incorporated into the table and converted to integers for easier analysis. We obtain the TOP-1000 highest-ranked trigrams by filtering the table:

import numpy as np

# Create a copy of table_freq to work with
table_copy = table_freq.copy()

# Create a DataFrame to store the ranks
table_rank = pd.DataFrame(columns=years, index=table_copy.index)

# Iterate over the years
for year in years:
# Get the frequencies for the year
freqs = table_copy[year]

if year == 2022:
# Add a small negative value to the frequencies of 2021 to break ties
tiebreaker = table_copy[2021] * -1e-10
freqs = freqs + tiebreaker

# Sort the frequencies in descending order
sorted_freqs = freqs.sort_values(ascending=False)

# Get the rank for each trigram
ranks = sorted_freqs.rank(ascending=False, method='first')

# Add the ranks to the table_rank DataFrame
table_rank[year] = ranks

table_rank = table_rank.astype(int)
table_rank = table_rank[table_rank[2022] <= 1000]
table_rank = table_rank.sort_values(by=2022, ascending=True)
table_rank.head(10)
Rank Table of the 10 Most Popular Trigrams for 2022 in Computer Science Articles Titles from arXiv.org (2018–2022). Table created by author.

We store the TOP-1000 trigrams frequencies in the trigram_frequency_table.csv file and ranks in the trigram_rank_table.csv file:

table_freq.to_csv("trigram_frequency_table.csv", index=True, sep=',')
table_rank.to_csv("trigram_rank_table.csv", index=True, sep=',')

We’ve just successfully generated and saved the 1000 Most Popular Trigrams for 2022 in Computer Science Articles from arXiv.org (2018–2022).

Trigram Trends Uncovered

Diving into the world of trigrams, we’ve seen a fascinating evolution in the TOP-5 over the years, reflecting the ever-changing landscape of computer science:

  1. Graph Neural network (GNN) is the new hot topic, skyrocketing from a humble rank 44 in 2018 all the way to the top spot in 2022. Graph-based machine learning techniques are now all the rage.
  2. The reliable Deep Neural Network (DNN) has kept a firm grip on its high-ranking status, holding the 2nd position for most years — deep learning techniques remain crucial in AI research.
  3. Deep Reinforcement Learning (Deep RL) has consistently held its own among the top positions, bobbing between ranks 3 and 5 throughout the years. This shows the ongoing dedication to reinforcement learning as a way to tackle complex decision-making challenges.
  4. Convolutional Neural Network (CNN) was the darling of the computer science world from 2018 to 2020, but it slipped to rank 4 in 2022. With new techniques like transformers entering the computer vision arena, the field is ripe for innovation.
  5. Multiple Input Multiple Output (MIMO) has been the steady workhorse in the rankings, securing spots between ranks 5 and 7 over the years. This demonstrates a lasting interest in multi-modal and multi-task learning approaches.

As we can see, trigram rankings offer a wealth of insights into the shifting focus and priorities within computer science. To uncover even more valuable information about AI topics and trends, we need to visualize the frequency table (trigram_frequency_table.csv) and rank table (trigram_rank_table.csv).

Conclusion

We have successfully extracted the 1000 Most Popular Trigrams for 2022 from titles of computer science articles on arXiv.org (2018–2022), providing a solid foundation for a deeper exploration of the most prominent trends and subjects in the field. We also gained initial insights into hot topics in the field of AI. Throughout this process, we learned valuable techniques for text preprocessing, including:

  • Text cleaning,
  • Finding and Replacing Abbreviations,
  • Stopwords Removal,
  • Lemmatization,
  • Extracting Trigrams,
  • Generating Frequency and Rank Tables.

The complete code for this process can be found in my GitHub repository.

Don’t miss Part 3: Picturing the Present — Visualizing the Rise and Fall of Topics, where we’ll transform these data tables into striking visualizations that effectively showcase the most influential topics and trends in artificial intelligence, making it easy to grasp the essence of the current state of the field. Join us as we delve into the fascinating world of AI through visually compelling representations, building on the knowledge and skills we have acquired in text preprocessing and analysis!

--

--