Analyzing ArXiv data using Neo4j — Part 1

6 min readSep 6, 2020

Exploring the public ArXiv dataset with Neo4j

All scientists know the famous website ArXiv, which makes accessible over 1.7 millions scientific papers in the fields of mathematics, physics, computer science or economy (and the list is not exhaustive!).

Recently, the Cornell University, who has been managing ArXiv for 30 years, released a dataset containing all the articles of the platform in the public domain. Information about this dataset can be found here: https://www.kaggle.com/Cornell-University/arxiv (see also, the introduction blog post here).

The dataset uploaded to Kaggle contains metadata for each article (DOI, title, authors, categories, abstract…). Even if the full PDFs are also accessible, we will only use this metadata file in this post.

In this post, we will go through this dataset and import the data into Neo4j for further analysis. The steps we are going to follow are:

Import the data into Neo4j using the Neo4j import tool
Simple data analysis

Data import

Since the dataset is quite large, we will import it using the Neo4j import tool, which is super fast. It processes input CSV files containing the nodes and relationships definition. Some data parsing and formatting is needed beforehand for the data to be understood by this tool so that we are going to:

Data parsing: read the raw data
Data cleaning: remove duplicates
Data formatting: format the data for Neo4j
And finally, import the data

Parsing raw json

Each row in the raw data contains a JSON element with the following keys:

{
    "id": str,
    "submitter": str, 
    "authors": str,  # authors as they appear in the paper
    "title": str,
    "comments": str,
    "journal-ref": str,
    "doi": str,
    "report-no": str,
    "categories": str,   # comma separated list of categories
    "licence": str,
    "abstract": str,
    "version": list,
    "update_date": str,
    "author_parsed": list
}

The bold fields are the ones we are going to use in the following to create a graph whose schema is:

Author — WROTE -> Article

As a first step, we will parse each row as JSON and extract the information we need to create a graph:

The article ID (that will be its unique identifier), its title and DOI
Article’s authors. The author ID is the concatenation of all the elements in the author_parsed entry of each author.
E.g.

[
  ["Balázs", "C.",""],   # ==> "Balázs_C."
  ["Berger", "E. L.", ""],  ==> "Berger_E.L."
]

Probably, we will have some homonyms, but let’s ignore this for now.

A list of article to author ID (the relationships)

So, here is the Python code to extract these three lists out of the raw data:

import json# extracting data
author_list = []
article_list = []
article_to_author_list = []with open("arxiv-metadata-oai-snapshot.json", "r") as f:
    for l in f:
        d = json.loads(l)
        article_id = d["id"]
        
        article_list.append(
            {
                "articleId": article_id,
                "doi": d["doi"],
                "title": d["title"],
            }
        )
        
        authors = d["authors_parsed"]
        for a in authors:
            author_id = "_".join(a)
            
            author_list.append({
                "authorId": author_id, 
                "name": a[0]
            })
            article_to_author_list.append({
                "articleId": article_id, 
                "authorId": author_id,
            })

Some more steps are still required to be able to process the data with Neo4j, so let’s create pandas DataFrame from theses lists, which will make the data processing step easier:

article_df = pd.DataFrame.from_records(article_list)
author_df = pd.DataFrame.from_records(author_list)
article_to_author_df = pd.DataFrame.from_records(article_to_author_list)

The lengths of these dataframes are:

len(article_df)=1753042
len(author_df)=7268183
len(article_to_author_df)=7268183

Data cleaning

By default, the Neo4j import tool fails if several rows have the same node ID, so we will take care of cleaning the data and removing these duplicates. For the author file, we can just keep the first occurrence and use the drop_duplicates method of pandas. Even if the article ID should be unique, it turns out that there are a couple of duplicates, probably to mistakes in data extraction. Since there are very few examples, we will just apply the same rule and keep only the first occurrence. Finally, we make sure that the IDs in the relationship data are present in each of the node files:

article_df = article_df.drop_duplicates(subset=["articleId"])
author_df = author_df.drop_duplicates(subset=["authorId"])
article_to_author_df = article_to_author_df[
    article_to_author_df.articleId.isin(article_df.articleId)
    & article_to_author_df.authorId.isin(author_df.authorId)
]

NB: we could also have used the option --skip-duplicate-nodes=true when importing the data, but I prefer to manage here whose rows are kept.

Preparing data for the Neo4j import tool

To use this tool, we will need three CSV files:

authors.csv containing unique authors, with following headers: authorId:ID,name
articles.csv containing unique articles: articleId:ID,doi,title
wrote.csv containing the relationship between authors and articles: :START_ID,:END_ID

The :ID , :START_ID and :END_ID annotations are required to tell Neo4j which column of the CSV file is the unique node identifier and whose columns contain the identifiers of the start and end nodes in the relationship.

So let’s rename the columns of our DataFrame that require such annotations:

article_df = article_df.rename(columns={"articleId": "articleId:ID"})
author_df = author_df.rename(columns={"authorId": "authorId:ID"})
article_to_author_df = article_to_author_df.rename(columns={
    "authorId": ":START_ID",
    "articleId": ":END_ID",
})

Finally, our dataframes are ready and we can save them to CSV, excluding the index which is not relevant here:

article_df.to_csv("articles.csv", index=False)
author_df.to_csv("authors.csv", index=False)
article_to_author_df.to_csv("wrote.csv", index=False)

At the end, we have three CSV files that we will be able to import into a newly created Neo4j graph. Let’s go ahead.

Finally, importing the data

To import this data, follow these steps:

Create a new Neo4j graph WITHOUT STARTING IT YET
Through the management tab, open a terminal and copy the CSV files we have just created into the import folder
Run the following command from the NEO4J_HOME (default location when you open a terminal from your graph management panel):

bin/neo4j-admin import \
    --nodes=Article=import/articles.csv \
    --nodes=Author=import/authors.csv \
    --relationships=WROTE=import/wrote.csv \
    --multiline-fields=true

Neo4j will process the data for you, and after a few seconds, you should see something like:

IMPORT DONE in 16s 109ms. 
Imported:
  3155051 nodes
  7268183 relationships
  7223379 properties
Peak memory usage: 755.5MiB

which means the import completed successfully.

You can start the database and it’s now time for some data analysis!

Analyzing data

Number of articles per author

There are many things we can do with a graph, we can for instance compute the number of articles per author:

MATCH (author:Author)-[:WROTE]->(paper:Article)  // pattern matching
WITH author, count(paper) as numberOfArticles  // aggregation
RETURN author.name, numberOfArticles    // result

We can run this query from Python to extract the data again in a dataframe:

import neo4jdriver = neo4j.GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "<password>"))with driver.session() as session:
    res = session.run(
        """MATCH (author:Author)-[:WROTE]->(paper:Article)
           WITH author, count(paper) as numberOfArticles
           RETURN author.name, numberOfArticles
        """
    )
    data = res.data()
    df = pd.DataFrame.from_records(data)
df.head()

Let’s get an idea of the distribution of this numberOfArticles variable with a df.describe():

           numberOfArticles
count      1.402012e+06
mean       5.184109e+00
std        1.743480e+01
min        1.000000e+00
25%        1.000000e+00
50%        1.000000e+00
75%        3.000000e+00
max        1.513000e+03

Apparently, we have at least one outlier with more than 1000 papers (A LOT!), but most of the distribution is between 1 and 3 papers per author. The distribution in the range [0, 20] is reproduced below:

If we look more carefully at the authors with the highest number of papers:

df.sort_values("numberOfArticles", ascending=False).head(20)

we’ll find the ATLAS and CMS collaborations, which are international groups made of hundreds of physicists, so the fact that they have issued almost 1000 papers each is not surprising (fun fact: I was part of the ATLAS collaboration during my Ph.D). On the other hand, we find names such as Zhang, Li or Smith, which are very common and likely to be related to the homonym issue we talked about at the beginning of this post.

In a following post, we will study the community structure of this dataset using the Neo4j Graph Data Science library!

Stay tuned!

PS: want to practice by yourself? Try and import the articles’ categories as well! You will then be able to answer questions such as: “how many articles are there for each category?” or “how many categories a given author contributes to in average?”