Coleridge Initiative — NLP-based Automated Dataset Discovery in Scientific Publications

20 min readJan 21, 2024

Introduction

Our project taps into the rich dataset from the “Coleridge Initiative” Kaggle competition. This competition seeks to reveal the ways in which public data is employed in scientific research, aiding governments in making more informed and transparent decisions regarding public investments. It also strives to streamline the identification of datasets used in problem-solving, the metrics produced, and key researchers in specific fields.

The dataset provided includes the complete texts of scientific papers across diverse research areas, obtained from CHORUS publisher members and additional sources. Our objective is to pinpoint the specific datasets cited by authors in these scientific publications.

Methodology

This project aims to not only recognize known dataset strings but also to extend its capability to detect new, unencountered datasets. To accomplish this, we will employ a diverse methodology:

N-gram Models: These will be utilized to discern patterns and associations in word sequences within scientific text.
RNN Models: Specifically, the implementation of two types of RNN models, Bidirectional LSTM and GRU, will be essential in capturing the sequential nature of the text and accurately classifying it.
CNN Model: A sep-CNN model will be used for pinpointing specific features within the text that may signify the mention of a dataset.
spaCy NER: The Named Entity Recognition (NER) functionality in spaCy will aid in tagging and pinpointing mentions of datasets as named entities.

It’s important to clarify that this project is not intended as a solution for the Kaggle competition. Instead, it is an exploration into various NLP and deep learning techniques for text analysis and sequence labeling, specifically in the realm of scientific literature. The project aims to shed light on methods to process unstructured text data, identify pertinent scientific articles, and categorize them using established labels.

Dataset

For this project, we utilized the “Coleridge Initiative — Show US the Data” dataset from Kaggle (https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data). The key components of this dataset that were crucial for our study included:

train.csv, which provides labels and metadata for the training set.
A train folder, containing 14.3k JSON files, each holding a scientific publication.

The structure of train.csv includes several columns:

id: This is the publication id. It's important to note that some training documents are listed multiple times, reflecting the mention of multiple datasets.
pub_title: The title of the publication. A small subset of these publications may share identical titles.
dataset_title: This denotes the title of the dataset mentioned in the publication.
dataset_label: This is a fragment of the text that references the dataset.
cleaned_label: The dataset_label, as passed through the clean_text function.

A sample JSON from the train folder:

[
  {
    "section_title": "",
    "text": "1 Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://www.loni.ucla.edu/ADNI). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. ADNI investigators include (complete listing available at http://adni.loni.ucla.edu/research/active-investigators/). "
  }
]

Data Processing, Integration, and Analysis

First, we undertook the task of synthesizing a wealth of scientific information for our project, sourced from the “Coleridge Initiative — Show US the Data” dataset on Kaggle. The challenge was to create a unified dataset that not only contains metadata from research publications but also integrates the full text of these documents, which were originally in a JSON format.

The heart of our dataset_analsys Jupyter Notebook is a function that meticulously reads through each JSON file, extracting and concatenating text from various sections of the scientific articles. By iterating over each row in the train.csv DataFrame, we mapped the metadata to its corresponding full text, ensuring that every snippet of information was accounted for.

This process resulted in a new, comprehensive DataFrame.

While it is common practice in NLP projects to preprocess text by removing stop words, tokenizing, and other cleaning methods, we had concerns that such processes might alter the dataset labels, which we wanted to preserve intact. Therefore, we limited our text cleaning to the bare minimum, choosing only to strip away punctuation and convert both the dataset_label and combined_text fields to lowercase. This cautious approach ensured that we maintained the original context and integrity of the dataset labels, which are pivotal for accurate identification and analysis in our project.

Then we outputted the dataset to a CSV file. This file is not just a collection of data; it’s the synthesis of context and content, a confluence of identifiers, titles, dataset labels, and the combined textual data from numerous scientific publications.

import json
import pandas as pd
import os
import re

# Constants for file paths
TRAIN_CSV_PATH = 'dataset/train.csv'
TRAIN_FOLDER_PATH = 'dataset/train/'
OUTPUT_CSV_PATH = 'dataset/full.csv'

# Function to read and combine text from a JSON file
def read_and_combine_text(json_file_path):
    with open(json_file_path, 'r') as file:
        data = json.load(file)
        combined_text = " ".join([f"{section['section_title']}: {section['text']}" for section in data])

    return combined_text

# we do not want to remove punctuation or stop words, because maybe that would alter the dataset labels as well
def clean_text(text):
    # Lowercase and removing punctuation
    text = re.sub(r'[^\w\s]', '', text.lower())
    return text

# Load train.csv
train_df = pd.read_csv(TRAIN_CSV_PATH)

# Create a new DataFrame
new_data = []

# Iterate over rows in train_df
for index, row in train_df.iterrows():
    json_file_path = os.path.join(TRAIN_FOLDER_PATH, f"{row['Id']}.json")
    if os.path.exists(json_file_path):
        combined_text = read_and_combine_text(json_file_path)
        new_data.append({
            "id": row["Id"],
            "pub_title": row["pub_title"],
            "dataset_label": row["dataset_label"],
            "combined_text": combined_text
        })
        
# Convert new data to DataFrame
new_df = pd.DataFrame(new_data)

# Clean the text
new_df['dataset_label'] = new_df['dataset_label'].apply(lambda x: clean_text(x))
new_df['combined_text'] = new_df['combined_text'].apply(lambda x: clean_text(x))

new_df.to_csv(OUTPUT_CSV_PATH, index=False)

Now we have a large (1.1 GB) combined/unified dataset full.csv, that we can use in the further steps of the project. Columns of this new file: id, pub_title, dataset_label, combined_text.

N-gram models

N-gram models are a simple yet powerful tool for text analysis, particularly in the field of Natural Language Processing (NLP). An N-gram is a contiguous sequence of ’n’ items from a given sample of text or speech. The ‘items’ here can be phonemes, syllables, letters, words, or base pairs according to the application. When dealing with text, as in our project, these items are typically words.

For instance, in the sentence “The quick brown fox jumps over the lazy dog,” a 1-gram (or unigram) model would break down the text into individual words like “The”, “quick”, “brown”, and so on. A 2-gram (or bigram) model would look at pairs of consecutive words, such as “The quick”, “quick brown”, “brown fox”, and the like. Similarly, a 3-gram (or trigram) model would involve triplets of words like “The quick brown”, “quick brown fox”, and so on.

N-gram models are built on the assumption that the probability of a word depends only on the previous ‘n-1’ words. This is a Markovian assumption, implying a memory-less model of text generation where the syntax and semantics are captured by considering local context. In practice, n-grams are used to develop language models which can predict the next item in a sequence, making them useful for tasks like text prediction, spelling correction, and even for generating new text that mimics a given style or corpus.

In the context of our project, we utilized N-gram models to detect patterns and correlations in the sequences of words within scientific texts. This is crucial because certain phrases or combinations of words are more likely to indicate the mention of a dataset, which is exactly what we need to identify.

Histogram of Word Counts in Dataset Labels

To get a better intuition, and to be able to select the appropriate word count for N-grams, we needed to understand the distribution of word counts across dataset labels. This histogram will provide a visual representation of how many words typically make up these labels, aiding in determining the optimal N-gram size for our analysis.

import matplotlib.pyplot as plt

# Calculate word count for each dataset label
df['word_count'] = df['dataset_label'].apply(lambda x: len(str(x).split()))

# Plotting the histogram
plt.figure(figsize=(10,6))
plt.hist(df['word_count'], bins=range(1, df['word_count'].max()+1), edgecolor='black')
plt.title('Histogram of Word Counts in Dataset Labels')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.xticks(range(1, df['word_count'].max()+1))
plt.show()

Histogram of Word Counts in df[‘dataset_label’]

We see, that dataset labels with 5 words is the most common, so we can do some analysis with 5-grams.

N-gram calculation and analysis

from collections import Counter
from nltk import ngrams

def generate_ngrams(text, n):
    words = text.split()
    return list(ngrams(words, n))

all_ngrams = []

for text in df['combined_text']:
    n_grams = generate_ngrams(text, 5)
    all_ngrams.extend(n_grams)

# Counting the frequency of each ngram
ngram_freq = Counter(all_ngrams)

# Display the most common ngrams
print(ngram_freq.most_common(10))

# Display the total number of 5-grams found
total_ngrams = len(all_ngrams)
print(f"Total number of 5-grams found in combined_text: {total_ngrams}")

# Filter entries where word_count is 5
five_word_labels = df[df['word_count'] == 5]['dataset_label']

# How many dataset_labels are there with 5 words
five_word_count = len(five_word_labels)
print(f"Number of dataset labels with 5 words: {five_word_count}")

# How many unique dataset_labels are there
unique_five_word_labels = set(five_word_labels)
unique_label_count = len(unique_five_word_labels)
print(f"Number of unique dataset labels with 5 words: {unique_label_count}")

# Create a set of 5-gram strings for comparison
all_ngram_strings = {' '.join(ngram) for ngram in all_ngrams}

# How many matches exist in the unique dataset_labels with the found 5-grams
matches = unique_five_word_labels.intersection(all_ngram_strings)
match_count = len(matches)
print(f"Number of matches between unique dataset labels and 5-grams: {match_count}")

The findings from our analysis are as follows:

We have successfully extracted a total of 163,370,884 5-grams from the combined_text. Within this vast collection, there are 4,803 instances of dataset labels composed of exactly five words. Delving deeper, we've identified that out of these, there are 16 unique 5-word dataset labels. Matching these unique labels against the 5-grams present in our texts, we found 15 instances of alignment.

Initially, these numbers might seem trivial; however, they play a critical role in validating the accuracy of the dataset labels provided in the train.csv from Kaggle. This confirmation is significant—it establishes that the dataset labels found in the training data do indeed correspond to those mentioned within the scientific articles and papers supplied.

Moving forward, this groundwork lays a foundation for more advanced applications. The N-grams hold potential value as features for machine learning models. Since models that process text require the input to be in a numerical format, N-grams can be transformative. They encapsulate the structural nuances of language, offering the models a more nuanced understanding of language patterns, which is indispensable for tasks such as text classification, search, and prediction.

Thus, while the immediate results may seem modest, they are a stepping stone. They assure us of our data’s reliability and open the door to employing N-grams in machine learning models to harness their full predictive and analytical power.

RNN models

Recurrent Neural Networks (RNNs) are a class of neural networks that are particularly suited to processing sequences, such as sentences in a text. They are designed to maintain a ‘memory’ of previous inputs by using their internal state (hidden layers) to process sequences of inputs. This makes them ideal for tasks like language modeling and text generation. However, RNNs often face challenges with long sequences due to problems like vanishing gradients, where the influence of inputs becomes exponentially weaker with each additional time step, making it difficult for the network to learn long-range dependencies.

To address this, variants of RNNs such as Gated Recurrent Units (GRUs) and Long Short-Term Memory networks (LSTMs) were developed. Both are designed to better capture long-term dependencies within the data.

GRU (Gated Recurrent Unit):

GRUs are a more streamlined version of LSTMs as they use fewer parameters.
They combine the forget and input gates into a single “update gate.”
They also merge the cell state and hidden state, and make several other changes that make the model simpler.
They are generally faster to compute and can perform better or on par with LSTMs if the sequence length is not too long or the problem is not too complex.

Bidirectional LSTM (Long Short-Term Memory):

LSTMs have a more complex structure with three gates (input, output, and forget gates) and a cell state that runs horizontally on top of them.
They are designed to capture information from both past (backward) and future (forward) states. This is particularly useful for tasks where the context from both directions is crucial for understanding the content.
Bidirectional LSTMs are essentially two LSTMs stacked on top of each other, processing the data in both directions.
They tend to have a higher capacity and are better suited for more complex problems with longer dependencies.

Which is Simpler? The GRU is simpler in terms of architecture and the number of parameters than the LSTM, which makes it computationally more efficient.

Which is Better? There’s no definitive answer to which model is better; it depends on the specific application and the nature of the data. For simpler tasks or when training time and computational resources are limited, GRUs may be preferable. For more complex tasks with longer sequences, Bidirectional LSTMs may provide better performance due to their ability to capture information from both past and future contexts.

How They Fit Into the Current Problem: For the task of identifying dataset names in scientific papers, both GRU and Bidirectional LSTM models can be useful:

GRUs could efficiently process the text to identify dataset labels, especially if the sequences are not excessively long and the complexity of the task is moderate. They can learn which words or phrases are likely to be part of a dataset name and predict them accordingly.
Bidirectional LSTMs would be particularly beneficial if the context before and after the mention of a dataset is important for its identification. Since dataset names can be influenced by the surrounding text, having both preceding and following context can significantly improve the model’s ability to recognize dataset labels accurately.

In conclusion, the choice between a GRU and a Bidirectional LSTM for this task would largely depend on the length and complexity of the dataset names and the context in which they appear. Experimentation and performance evaluation on a development set are typically required to determine the best model for a given problem.

Implementation of GRU (Gated Recurrent Unit)

Data Splitting: We commenced by splitting our dataset into training and testing sets using the train_test_split function from sklearn.model_selection. This is crucial for evaluating the model's performance on unseen data. We allocated 80% of our data for training and reserved 20% for testing, ensuring a randomized split with a set seed for reproducibility.

Data Preprocessing for GRU: Our raw text data underwent tokenization, a process where text is split into tokens (in this case, words), and each token is replaced with a corresponding numerical index. We used the Tokenizer from tensorflow.keras.preprocessing.text, limiting our vocabulary to the top 10,000 words for efficiency.

Following tokenization, we converted our text into sequences of these indices. To maintain consistency in input size, we padded these sequences to a fixed length. For our labels, which are categorical, we employed a LabelEncoder to transform them into numerical format, and then used to_categorical to convert the integer-encoded labels into a binary class matrix, which is a format suitable for classification with a neural network.

Building and Training the GRU Model: We constructed a sequential GRU model using tensorflow.keras. The model starts with an Embedding layer that turns positive integers (indexes) into dense vectors of fixed size, which is a standard approach for handling text data in neural networks. Next, the GRU layer was added, which is adept at processing sequences due to its internal gating mechanism. We also incorporated dropout in the GRU to prevent overfitting. The final layer is a Dense layer with a 'softmax' activation function that outputs a probability distribution over our encoded labels.

The model was compiled with the ‘adam’ optimizer and ‘categorical_crossentropy’ loss function, both standard choices for multi-class classification problems. We then trained the model on our preprocessed training data, tuning the batch size and epochs to fit our computational resources and the specific demands of our dataset.

The model’s performance is evaluated on the validation set during training, giving us insight into how well the GRU is generalizing to new data. The hope is that this model will effectively learn from the textual patterns to accurately classify text based on the presence of dataset names within scientific papers.

# Split the Data into Train and Test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['combined_text'], df['dataset_label'], test_size=0.2, random_state=42
)

# Preprocess the Data for GRU
# Text data must be tokenized and converted to sequences that the GRU can process. 
# The labels also need to be encoded if they are categorical.

from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense

# Define a number of words parameter for the tokenizer
num_words = 10000

# Tokenize the text
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(X_train)

# Convert text to sequences of integers
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad the sequences so they are all the same length
max_seq_length = 50  # You can choose a different length if necessary
X_train_pad = pad_sequences(X_train_seq, maxlen=max_seq_length)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_seq_length)

# Encode the labels
label_encoder = LabelEncoder()
label_encoder.fit(df['dataset_label'])  # Fit on the entire dataset

y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Convert integer-encoded labels to binary class matrix
y_train_categorical = to_categorical(y_train_encoded)
y_test_categorical = to_categorical(y_test_encoded)

# Build the GRU Model
gru_model = Sequential(name="gru_model")
gru_model.add(Embedding(input_dim=num_words, output_dim=64, input_length=max_seq_length))
gru_model.add(GRU(units=64, dropout=0.2, recurrent_dropout=0.2))
gru_model.add(Dense(y_train_categorical.shape[1], activation='softmax'))

gru_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = gru_model.fit(
    X_train_pad, y_train_categorical,
    batch_size=16,  # Batch size can be adjusted depending on memory availability
    epochs=10,  # The number of epochs can be adjusted based on observed performance
    validation_data=(X_test_pad, y_test_categorical),
    verbose=1
)

The two charts provided show the training and testing progress of a machine learning model across epochs, measured in terms of accuracy and loss:

In the accuracy chart, we observe that the training accuracy continues to increase as the number of epochs grows, suggesting that the model is effectively learning from the training data. However, the testing accuracy plateaus after the third epoch, indicating that subsequent training does not lead to better generalization on the test set. This could be a sign of overfitting, where the model is becoming too tailored to the training data and is not improving its performance on unseen data.

The loss chart reinforces this interpretation. The training loss decreases, which is expected as the model optimizes its weights. The test loss decreases until around the third epoch, after which it begins to stabilize or slightly increase. This suggests that the model’s ability to generalize (i.e., its performance on new, unseen data) is not improving after the third epoch.

The model seems to reach its optimal generalization performance at around the third epoch. Further training doesn’t improve the model’s performance on the test set, which is the real-world proxy for how well the model will perform. This can be used as a cue to stop training the model to prevent overfitting and to save computational resources.

Testing the trained model on a text:

def predict_from_text(model, sample_text):
    # Tokenize and pad the sample text
    sample_seq = tokenizer.texts_to_sequences([sample_text])
    sample_pad = pad_sequences(sample_seq, maxlen=max_seq_length)

    # Predict using the trained model
    predicted_categorical = model.predict(sample_pad)

    # Get the class with the highest probability
    predicted_class_index = predicted_categorical.argmax(axis=-1)[0]

    # Decode the prediction
    predicted_label = label_encoder.inverse_transform([predicted_class_index])
    print(f"Model: {model.name}")
    print(f"Predicted dataset label: {predicted_label[0]}")

# Sample text
sample_text = "what is this study about this study used data from the national education longitudinal study nels88 to examine the effects of dual enrollment programs for high school students on college degree attainment the study also reported whether the impacts of dual enrollment programs were different for first generation college students versus students whose parents had attended at least some college in addition a supplemental analysis reports on the impact of different amounts of dual enrollment coursetaking and college degree attainment dual enrollment programs offer collegelevel learning experiences for high school students the programs offer college courses andor the opportunity to earn college credits for students while still in high school the intervention group in the study was comprised of nels participants who attended a postsecondary school and who participated in a dual enrollment program while in high school n  880 the study author used propensity score matching methods to create a comparison group of nels participants who also attended a postsecondary school but who did not participate in a dual enrollment program in high school n  7920 features of dual enrollment programs dual enrollment programs allow high school students to take college courses and earn college credits while still in high school these programs are intended to improve college attainment especially among lowincome students by helping students prepare academically for the rigors of college coursework and enabling students to accumulate college credits toward a degree the study reported program impacts on two outcomes attainment of any college degree and attainment of a bachelors degree these impacts were examined for various subgroups of students which are described below wwc single study review what did the study find"

predict_from_text(gru_model, sample_text)

Predicted dataset label: national education longitudinal study

Conclusion on GRU

Although the current model is adept at identifying dataset names that it has encountered during training, the tokenization process it relies on limits its ability to recognize dataset names that it has not seen before, rendering it unable to predict new or unseen dataset names in fresh texts.

Implementation of Bidirectional LSTM (Long Short-Term Memory)

For the BiLSTM, we only need to change the model architecture as follows:

from tensorflow.keras.layers import Bidirectional, LSTM

bilstm_model = Sequential(name="bilstm_model")
bilstm_model.add(Embedding(input_dim=num_words, output_dim=64, input_length=max_seq_length))
bilstm_model.add(Bidirectional(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2)))
bilstm_model.add(Dense(y_train_categorical.shape[1], activation='softmax'))

bilstm_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

After training for 10 epochs, checking the accuracy and loss history, we can conclude, that with current hyper-parameters, it is enough to train this model for ~4 epochs:

Conclusion on GRU and BiLSTM models

The Bidirectional LSTM model, like the GRU model, faces the same fundamental challenge when it comes to detecting completely new dataset names in fresh texts, due to the tokenization issue. This challenge stems from the way these models, and most standard neural network architectures in NLP, handle vocabulary:

Fixed Vocabulary: Both GRU and Bidirectional LSTM models typically rely on a fixed vocabulary set during training. This vocabulary is used to tokenize the text and convert words into numerical representations (like word embeddings). Words not in this vocabulary at training time are often represented as unknown tokens.
Generalization Limitation: While these models are good at understanding and generalizing patterns they have seen during training, their ability to recognize entirely new words or phrases not seen in the training data is limited. If a dataset name did not appear in the training data, the model might not recognize it in new texts.
Tokenization of Unseen Words: Words not present in the training vocabulary are usually replaced with an “out-of-vocabulary” token or simply ignored, depending on the implementation. This means that entirely new dataset names composed of words not in the training vocabulary will not be effectively recognized.

Sep-CNN Model

The sep-CNN model, short for “separable convolutional neural network,” refers to a type of CNN that uses depthwise separable convolutions. These are a more efficient variant of the standard convolution operation. The sep-CNN model is particularly known for its efficiency and reduced computational cost because it breaks down the learning process into two parts:

Depthwise Convolution: Applies a single filter per input channel (input depth).
Pointwise Convolution: Then, a 1x1 convolution is applied to combine the outputs of the depthwise convolution.

This model has been successfully used in tasks such as image recognition and processing but can also be adapted for NLP tasks, including identifying dataset mentions in text.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SeparableConv1D, GlobalMaxPooling1D, Dense

# Parameters
vocab_size = 10000  # This should be the size of your vocabulary
embedding_dim = 50  # This should match the size of your word embeddings
max_length = 50  # This should match the length of your padded sequences
num_classes =  y_train_categorical.shape[1]  # This should be the number of classes in your dataset

# Build the sep-CNN model
sepcnn_model = Sequential(name="sepcnn_model")
sepcnn_model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
sepcnn_model.add(SeparableConv1D(filters=32, kernel_size=3, activation='relu', strides=1, padding='same'))
sepcnn_model.add(GlobalMaxPooling1D())
sepcnn_model.add(Dense(units=num_classes, activation='softmax'))

# Compile the model
sepcnn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Comparing the GRU, BiLSTM and sep-CNN

def compare_models(models, X_test, y_test):
    """
    Evaluates a list of models on the same test set and displays their performance metrics.

    Parameters:
    models (list): List of Keras models to be evaluated.
    X_test: Test features.
    y_test: Test labels.
    """
    for model in models:
        scores = model.evaluate(X_test, y_test, verbose=0)
        print(f"Model: {model.name} - Loss: {scores[0]:.4f}, Accuracy: {scores[1]:.4f}")

models = [gru_model, bilstm_model, sepcnn_model]
compare_models(models, X_test_pad, y_test_categorical)

Model: gru_model — Loss: 2.0622, Accuracy: 0.3956
Model: bilstm_model — Loss: 2.0016, Accuracy: 0.4063
Model: sepcnn_model — Loss: 2.1569, Accuracy: 0.3781

The performance metrics for the three models — gru_model, bilstm_model, and sepcnn_model — are relatively similar, with losses and accuracies clustering around the same values. The GRU model shows a loss of 2.0622 and an accuracy of 0.3956, the Bidirectional LSTM model a slightly lower loss of 2.0016 and an accuracy of 0.4063, and the sepCNN model a loss of 2.1569 with an accuracy of 0.3781. The tokenization approach used in these models hampers their ability to effectively predict new, previously unseen dataset names in novel scientific articles.

spaCy NER

spaCy is a powerful and flexible NLP library that can be particularly effective for tasks like Named Entity Recognition (NER).

spaCy`s NER is good atidentifying entities such as names of people, locations, and more. However, the spaCy NER model, as it comes pre-configured, does not have the capability to identify datasets. This limitation led us to the decision to fine-tune via transfer learning xisting model to specifically recognize dataset entities.

Prepare the Training Data

For NER, spaCy expects training data in a specific format. Each sample should be a tuple consisting of the text and a dictionary. The dictionary should have a key 'entities' with a list of tuples, each representing (start_offset, end_offset, label).

# Example training data format
TRAIN_DATA = [
    ("Text of the first document", {"entities": [(start, end, "DATASET")]}),
    ("Text of the second document", {"entities": [(start, end, "DATASET")]}),
    # And so on...
]

Because of our limited hardware capacities, we had to do this training data preparation in batches:

`import spacy
from spacy.tokens import DocBin
import math
from pathlib import Path

# Load a blank English model
nlp = spacy.blank("en")

def create_docbin(df_subset, nlp, label_name):
    doc_bin = DocBin()
    for _, row in df_subset.iterrows():
        text = row['combined_text']
        label = row['dataset_label']
        start = text.find(label)
        end = start + len(label)
        if start != -1 and len(text) <= nlp.max_length:
            doc = nlp.make_doc(text)
            ents = [(start, end, label_name)]
            doc.ents = [doc.char_span(start, end, label=label_name) for start, end, label_name in ents if doc.char_span(start, end, label=label_name) is not None]
            doc_bin.add(doc)
    return doc_bin

import math

# Function to split DataFrame into chunks
def split_dataframe(df, chunk_size):
    num_chunks = math.ceil(len(df) / chunk_size)
    return (df[i * chunk_size:(i + 1) * chunk_size] for i in range(num_chunks))
# Splitting the DataFrame
df_chunks = split_dataframe(df, chunk_size=1000)  # Adjust chunk_size as needed

# Create a DocBin for each chunk and save to disk
for i, df_chunk in enumerate(df_chunks):
    doc_bin = create_docbin(df_chunk, nlp, label_name='DATASET')
    doc_bin.to_disk(f"dataset/spacy/spacy_data_part_{i}.spacy")

After that, we downloaded the small English model ( en_core_web_sm), and tried to fine-tune the NER model to recognize our new entity DATASET:

# Load annotated data
nlp = spacy.blank("en")

# Load the annotated data
def load_data_from_disk(path_to_directory):
    train_data = []
    for path in Path(path_to_directory).rglob('*.spacy'):
        doc_bin = DocBin().from_disk(path)
        train_data.extend(list(doc_bin.get_docs(nlp.vocab)))
    return train_data

# Example path to your annotated data
path_to_data = 'dataset/spacy/'
train_data = load_data_from_disk(path_to_data)

# Load the pre-trained spaCy mode

# Load the large English model
nlp = spacy.load("en_core_web_sm")

# Get the NER component
if "ner" not in nlp.pipe_names:
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner)
else:
    ner = nlp.get_pipe("ner")

ner.add_label("DATASET")
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

from spacy.training import Example
from spacy.util import minibatch, compounding

import random

with nlp.disable_pipes(*other_pipes):  # Only train NER
    optimizer = nlp.resume_training()
    for itn in range(10):  # Number of training iterations
        random.shuffle(train_data)
        losses = {}
        for batch in minibatch(train_data, size=compounding(4., 32., 1.001)):
            examples = []
            for doc in batch:
                # Create Example objects
                annotations = {"entities": [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]}
                examples.append(Example.from_dict(nlp.make_doc(doc.text), annotations))
            # Update the model
            nlp.update(
                examples,
                drop=0.5,  # Dropout rate
                losses=losses,
                sgd=optimizer
            )
        print(f"Iteration {itn} - Losses: {losses}")

Nonetheless, following three days of continuous training, our model was yet to complete its learning process, leaving us unable to evaluate the effectiveness of the newly trained model.

Conculsion on spaCy`s NER

While spaCy’s NER technology is capable of identifying new dataset mentions in many cases, its success is not guaranteed, especially for mentions that are significantly different from the training data. The model’s performance can be enhanced by carefully preparing the training data to be as representative as possible of the texts it will encounter and by continuously updating the model with new data.

For tasks requiring high precision in identifying very specialized or constantly evolving entities, combining NER with other approaches (like rule-based systems or manual verification) might be necessary.

Further research

In conclusion, our exploration suggests that the quest to effectively generalize and identify new, unseen datasets — as outlined in our objective — may find its answer in the latest advancements in machine learning. Based on our current understanding and the state of the art in the field, a transformer-based model appears to be the most promising approach for tackling this challenge.

Transformers have revolutionized the landscape of NLP by leveraging self-attention mechanisms, which allow models to weigh the importance of different parts of the input data differently. This capability is particularly advantageous when attempting to discern the subtle patterns and contexts that signify dataset references in scientific texts.

Further investigation and experimentation with transformer architectures, such as BERT or GPT, which are pre-trained on large corpora and fine-tuned for specific tasks, could potentially yield a model with superior performance in recognizing and generalizing dataset names across diverse scientific documents. This approach could harness the power of deep learning to capture the intricate nuances of language and dataset citation styles, significantly advancing our current methodologies.