Text Classification of Quantum Physics Papers

Published in

The Startup

10 min readSep 8, 2020

Have you ever been looking for the most recent research findings in machine learning or AI, and found yourself on arxiv.org, perusing the Computer Science section? (If you haven’t you should definitely check it out, it’s a great open-source website of the latest research findings in a variety of topics.) How about browsing vixra.org?

If you have heard about the first, but not of the second, you’re not alone. For an in-depth comparison of the two I refer the reader to each website’s Wikipedia articles (arXiv, viXra); what’s important to us, however, is that the two sites are very different. It’s enough to take a look at the first paragraphs of two papers next to each other to see a distinct style:

While there is no strict peer-review process on either website, publishing on arxiv.org has stricter guidelines, and as a result papers on this website are usually high-quality research papers. Meanwhile, the guidelines of vixra.org are less strict, and notes published there tend to be less mathematically rigorous, and more speculative in nature.

Would an algorithm be able to tell whether a paper originates from arxiv.org or from vixra.org?

The question I wanted to answer is: while a researcher in Quantum Physics can usually tell the difference between two papers from each of these websites, would an algorithm be able to tell whether a paper originates from arxiv.org or from vixra.org?

Acquiring data

For this classification task, I decided to focus on a research area that exists on both arXiv and viXra: quantum physics. This way, the classification won’t focus on the topic (for example, astrophysics vs. quantum physics papers will have wildly different verbiage and style).

Both websites require all uploads to present a rendered and readable pdf file. Most arXiv uploads also have source files from which the pdf was compiled, these are pure-text tex files. As I ultimately need text from the papers, the best approach would be to try to get these text-based source files. However, not all arXiv papers have these, and most viXra papers don’t have them either (they are usually pdf versions of a Microsoft Word document).

A quick solution is to download the pdfs only from both sources, and use the pdfminer Python library to extract the text.

I used Python’s requests library together with Beautifulsoup to find the specific pdf urls and download 600+ papers from each website. Here’s a code snippet for finding all papers in the “quant-ph” category of arXiv, uploaded in August 2020:

import re
import requests
from bs4 import BeautifulSoup# Get list of all papers submitted in August 2020
topic = 'quant-ph'
year = '20'
month = '08'
max_records = '1000'
base_url = f"https://arxiv.org/list/{topic}/{year}{month}?show={max_records}"r = requests.get(base_url)
soup = BeautifulSoup(r.content)
sub_urls = soup.find_all("a")  # Find all urls on this website# Cycle through all urls and find the ones pointing at a paper
quant_ph_papers = []
for item in sub_urls:
    url = item.get('href')
    if url:  # If this item is a url, grab the paper identifier
        result = re.search(r'/pdf/(\d\d\d\d.\d+)', url)
        if result:
            quant_ph_papers.append(result.group(1))

After we have a list of papers we’re interested in downloading, we simply need to cycle through them and download them one-by-one:

import requests
import time# Cycle through all paper identifiers found, and download each pdf
for paper_id in quant_ph_papers:
    r = requests.get(f"https://arxiv.org/pdf/{paper_id}.pdf")
    if r.status_code == 200:
        with open(f"./arxiv/{paper_id}.pdf", 'wb') as f:
            f.write(r.content)    # Sleep for 1 second after each download
    time.sleep(1)

Downloading papers from viXra follows the same pattern, with the addition that trying to use the requests library results in a 406 response when accessing vixra.org. This is likely due to the maintainers of viXra trying to prevent programmatic, mass access to their website, as changing the User Agent from the default “python-requests” one circumvents the problem. Here is a list of potential User Agents you might use for your GET requests, for example:

base_url = f"https://vixra.org/{topic}/{year}{month}"r = requests.get(base_url, headers={"User-Agent": "Mozilla/5.0 (Linux; Android 7.0; SM-T827R4 Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.116 Safari/537.36"})

Extracting text

After acquiring the pdf files, the next step is to extract the text. Loading and performing the extraction on a pdf file is a slow process, so it’s best to do this upfront, and only once:

import os
from pdfminer.high_level import extract_textarxiv_files = [f for f in os.listdir('./arxiv')]for fn in arxiv_files:
    # Load pdf and save the extracted text as a text file
    try:
        full_text = extract_text('./arxiv/' + fn)
        failed = False
    except Exception as e:
        print(f"Error {e} happened:")
        print(f">>Failed to parse {fn}")
        failed = True
    
    if not failed:
        # Save the extracted text
        with open(f"texts/arxiv/{fn[0:-4]}.txt", 'wb') as f:
            f.write(full_text.encode('utf8'))

Represent each paper as a set of features

We have all our labeled data on our hard drive — now what?

For the purposes of text classification, we’ll need to create a set of features from each paper. For this part of the tutorial, I will assume that the reader is familiar with basic NLP concepts like stop words, tokenization, vector representation of tokens, but haven’t done text classification before (as I haven’t!).

Creating features can be approached in several ways. Some ideas for features I found helpful are:

Use the number of words in the paper and the average number of characters in a word;
Bag-of-words model, i.e., counting the occurrence of each unique word in the text;
Creating a vector representation of each word, and averaging all representations within the text;
Feeding vector representations of words through a trained neural net to create a representation of the text.

As I wanted to try advanced techniques in this project, I started with 3 and 4.

Using spaCy to create an averaged vector representation

Spacy is an NLP package with pre-trained models that allows you to tokenize and vectorize words quite easily. I used their large English model:

import numpy as np
import spacy
nlp = spacy.load('en_vectors_web_lg')def get_average_token_for_text(text: str) -> np.ndarray:
    tokens = nlp(text)
    vectors = [token.vector for token in tokens]
    all_vecs = np.array(vectors)    vec = np.mean(all_vecs, axis=0)
    return vec

This approach vectorizes each word/token individually, then takes an element-wise average of the resulting vectors throughout the text.

The resulting representation is a single vector, and can be interpreted as a single “master token”, which stands for the entire text. It is essentially a single-word summary of the entire text.

I applied this method to all texts downloaded from arXiv and viXra.

Without further preprocessing, I trained a couple of different classifiers (Logistic Regression, Random Forest, K Nearest Neighbors, Naive Bayes) on this data, with a 70–30% train-test split (training set: 942, test set: 405 papers, with a close to 50–50% allocation for each category).

Logistic Regression produced the best results, by far overperforming the others by about 5–10 percentage points for the resulting F1-score (0.897).

Using BERT with first 512 words

BERT is another, state-of-the-art language model, developed by Google [published in 2018 — look, an arXiv paper!]. Using the pre-trained tokenizers and models of BERT allows you to leverage already learned representations of words, as well as representations of texts. The latter, contrary to the simplistic approach we took with spaCy above, takes into account structure and word order.

I used the base cased BERT model, available through the Python “huggingface” library. This base model can take at most 512 tokens. Then, it creates a representation of these by feeding a combination of the word vectors (a tensor) through a pre-trained neural network. The output of the network is a single vector, representing the entire text that was fed into it.

from typing import List
import numpy as np
import torch
from transformers import BertModel, BertTokenizer, BertConfig# Load BERT tokenizer and model
model_name = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
config = BertConfig.from_pretrained(model_name)def get_tokens(text: str, tokenizer: BertTokenizer, config: BertConfig) -> List[str]:
    tokens = tokenizer.tokenize(text)
    max_length = config.max_position_embeddings
    tokens = tokens[:max_length-1]
    tokens = [tokenizer.cls_token] + tokens
    return tokensdef make_vector(text: str) -> np.ndarray:
    tokens = get_tokens(text, tokenizer, config)
    token_ids: List[int] = tokenizer.convert_tokens_to_ids(tokens)
    token_ids_tensor = torch.tensor(token_ids)
    token_ids_tensor = torch.unsqueeze(token_ids_tensor, 0)
    last_hidden_state, pooler_output = model(token_ids_tensor)
    vector = pooler_output
    np_vector = vector.detach().numpy()
    np_vector = np_vector.squeeze()
    return np_vector

Applying this representation to all papers, Logistic Regression is still the best choice for classifier, and it yields results comparable to the spaCy model.

However, as this implementation of BERT only takes into account the first 512 words of every paper, I wanted to try out explicitly adding the length of the text as a feature. I chose character length, and tried both adding it as a raw feature, as well as applying standard scaling to the features before classification. While this step in itself didn’t improve the model much, it nicely illustrates the need for feature scaling!

Performance of different feature encoding approaches [3]

How to improve

In order to further improve the models, I inspected the misclassified papers. Some arXiv papers that were misclassified are: a typical single-author arXiv paper, a paper on the consequences of living inside a computer simulation, a paper touching on philosophy and Gödel’s theorem, among others.

Some of these papers have a style or topic that can indeed be misconstrued as the other category. However, the terms and rigor used throughout the papers should be sufficient indication enough to identify these correctly.

Maybe I tried to use word order and correlations when really I should have used simply the list of words as features? I took a step back and tried a simple bag-of-words model.

Bag-of-words model

This model represents each text as a simple “dictionary” of sorts: each word is assigned an id, and each id in turn has a number for each text, representing the number of times that word appears in the text.

It is not too difficult to write an implementation of bag-of-words, but for this exercise I used BERT’s already available tokenizer and identifier. I needed to modify the previous “get_tokens” method so it doesn’t cut off words after reaching a length of 512:

from typing import List
from transformers import BertTokenizerdef get_tokens_unlimited(text: str, tokenizer: BertTokenizer) -> List[str]:
    tokens = tokenizer.tokenize(text)
    tokens = [tokenizer.cls_token] + tokens
    return tokensdef token_ids_to_bow(id_list: List[int]) -> dict:
    bow = dict()
    for i in id_list:
        if i in bow.keys():
            bow[i] += 1
        else:
            bow[i] = 1
    return bow

Then, looping through all files, and creating a matrix with all possible token ids:

import pandas as pd
import pickle
from transformers import BertTokenizer# Load BERT tokenizer and model
model_name = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)arxiv_files = [f for f in os.listdir('./texts/arxiv')]
df = pd.DataFrame()i = 0
for fn in arxiv_files:
    try:
        with open(f"texts/arxiv/{fn}", 'r', encoding='utf8') as f:
            full_text = f.read()
        failed = False
    except Exception as e:
        print(f"Error {e} happened:")
        print(f">>Failed to parse {fn}")
        failed = True    if not failed:
        tokens = get_tokens_unlimited(full_text, tokenizer)
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        bow = token_ids_to_bow(token_ids)
        
        df_i = pd.DataFrame(data=bow, index=[i])
        df = pd.concat([df, df_i])
    i += 1with open('bow/arxiv/arxiv_bow_features.p', 'wb') as f:
    pickle.dump(df, f)

Filling all missing values with “0” and applying standard scaling before classification, this model outperforms all previous models.

Here is a comparison of ROC curves for each model considered:

ROC curves of Logistic Regression classifier, with different feature encodings [4]

What do we still misclassify?

Some papers from the arXiv we still misclassified are: a single-author paper on a unique interpretation of quantum mechanics, a short note meant to be a “Comment” to a previous paper, a response to a previous comment.

A number of these indeed have a different, and more informal format and tone!

Some papers from viXra.org we misclassified are: a paper about compressed data, measurements and observations, one about applying quantum annealing to a scheduling problem, and an Addendum by a former colleague of mine to a previous arXiv paper.

All in all, not bad! A number of these seem to be rigorous enough that could be published on the arXiv just as well.

Conclusions

In summary, we have achieved quite good classification results with most models. Typically, arXiv papers and viXra papers are different in length, structure, rigor, and verbiage and so it is quite clear which category any given paper falls into. This allows us to approach the problem with different models and achieve similar results.

It turns out that the classic bag-of-words model for feature encoding, and Logistic Regression for classification performed best. This should reinforce the practice to always start simple, and add complexity only as needed!

A bit of data exploration also tells us that bag-of-words is the way to go. This is based on 600+ papers from each category [5]

What’s next? Well, if you’re interested in research paper classification, there’s a number of avenues to explore:

Can we build a good classifier with even simpler features, like (e.g.) length and most frequent 50 words (only)? What are the most important features?
Can we improve on the current classifiers by toggling capitalization, removing stop words, and other preprocessing steps?
Can we extend this classifier to other topics? I looked at quantum physics only. What about a topic-agnostic arXiv vs. viXra classifier?
Can we build something more sophisticated for arXiv papers, like finding the closest paper (stylistically) to another paper?
What about being able to tell who the author(s) is (are) of a new paper, based on similarity to other papers already published?

References

Some of the code and intuition for how to use spaCy and BERT for encoding documents into a single vector comes from Jonathan Mugan’s NLP videos.

Figure sources:

[1] R. Yousefjani, A. Bayat, Parallel entangling gate operations and two-way quantum communication in spin chains (2020), arxiv.org

[2] J. Deutsch, Duality or Unity in Quantum Mechanics? (2020), vixra.org

[3] A. Kómár, Model Performance: Precision and Recall (2020)

[4] A. Kómár, Model Performance: ROC Curves (2020)

[5] A. Kómár, Most Frequent Words (2020)