Using NLP To Improve Grantee Discovery

7 min readOct 18, 2022

How I built a simple data science subject classifier to tag The Rockefeller Foundation’s grant database.

Written by Dustin Marshall, this post builds upon many of the themes presented in an interview between the Towards Data Science Editors and The Rockefeller Foundation from last year’s Community Spotlight series.

Data Science Team at the Rockefeller Foundation

I’ve spent the past summer embedded in The Rockefeller Foundation’s Data Science Team, which is a part of their Innovation Team. One of the team’s focuses for the year was to monitor all the innovative ways the social sector uses artificial intelligence for social good. This work requires the team to look at data science techniques and expertise within RF’s grantee network.

As their summer Data Science Associate, I was tasked with building a simple but effective classification model to determine whether or not a grant is data science-related. This model will apply a binary tag -is or is not data science-related — to the foundation’s grants. Knowing which organizations use data-science techniques will help us identify opportunities for future applications of data science. I hope that sharing the process (and code) below will be instructive to novice data users looking to adapt this for their own project needs.

Building a Data Science Classification Model

Text classification is at the foundation of most modern applications in Natural Language Processing (NLP), whether it be identifying the most relevant links in a web search or filtering spam out of your email inbox. In the case of this Data Science Classification Model, it needn’t be as complicated.

When selecting a data source for an NLP model, texts that have a lot of industry-specific vocabulary will usually perform best in the simplest of models. For this model, I noticed that our internal administrative documents contained too many redundancies, whereas the proposals submitted at the start of the grant stood out as good candidates for text classification given their tendency to use more jargon.

In supervised machine learning, in order for a model to make predictions on unknown data, it needs to be trained on known data. For this reason, after extracting the 1k most recent proposals submitted to the Foundation, roughly 100 were hand-labeled as being data science related. The criteria for which was that a proposal needed to be using grant money to develop or improve a data product.

# Convert csv with data science proposals text to pandas dataframe & add classification label
data_science_proposals = pd.read_csv("DataScienceProposals.csv", header=None)
data_science_proposals.columns = ["raw_text"]
data_science_proposals["label"] = np.ones(len(data_science_proposals), dtype=np.int32)# Convert csv with other proposals text to pandas dataframe & add classification label
other_proposals = pd.read_csv("OtherProposals.csv", header=None)
other_proposals.columns = ["raw_text"]
other_proposals["label"] = np.zeros(len(other_proposals), dtype=np.int32)

For NLP models to perform the best, it is important that the data be pre-processed, removing unnecessary and inconsistent details. For this model, the Python data analysis library Pandas and the mathematical library NumPy were used to import and prepare the data. Then the library Natural Language Toolkit (NLKT) was used to pre-process the data by removing special characters and punctuation. Next, using the same NLKT package, in a process called Lemmatization, different forms of the same word were reduced to their root word.

# Create function to pre-process text to improve model performance
def text_preprocessing(text):
    text = text.lower()
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.split() 
    text = [WordNetLemmatizer().lemmatize(word) for word in text if 
        not word in set(stopwords.words('english'))]
    text = ' '.join(text)
    return text

In order for a model to be trained on data, it must be passed through the model in a way that can be understood. To do this, the model includes a pipeline that first transforms each unique word into numbers based on how frequently the word appears in the document and how frequently the word appears in the set of documents labeled as being data-science related. This process is called Term Frequency-Inverse Document Frequency (TF-IDF) text vectorization and can be done simply using the Python machine learning package Scikit-learn (sklearn). To improve the model’s ability to understand the data further, the pipeline uses sklearn again to perform Singular Value Decomposition (SVD) on the vectors, which further removes redundancies in the data, then standardizes the data, converting the numbers so that the mean is 0 and the standard deviation is 1. Once the data was processed, it was split into training and testing sets.

# Build pipeline to transform and process text
model = Pipeline(steps=[
    ("tfidf", TfidfVectorizer(ngram_range=(2,4), min_df=5, 
        max_df=0.5, sublinear_tf=True, stop_words="english")), 
    ("svd", TruncatedSVD(n_components=64, n_iter=10, 
        random_state=1)),
    ("scaler", StandardScaler()))

Since there are multiple classification models that can be used, and within each model there are hyperparameters that can be tuned to improve performance, it is important to set aside data within the training set to test and validate your choices. For this model, the training set was split again to include an additional validation set, then evaluated using a variety of classification models from the sklearn library, with a Support Vector Machine (SVM) classifier having the highest performance measure. The model’s hyperparameters were then tuned using the same approach.

# Iterate through classifiers, building pipeline that transforms text and passes it through classifier
for name, classifier in [
          ('LogReg', LogisticRegression()), 
          ('RF', RandomForestClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('SVM', SVC(probability=True)), 
          ('GNB', GaussianNB()),
          ('XGB', XGBClassifier())]:
    model = Pipeline(steps=[
        ("tfidf", TfidfVectorizer(ngram_range=(2,4), min_df=5, 
            max_df=0.5, sublinear_tf=True, stop_words="english")),
        ("svd", TruncatedSVD(n_components=64, n_iter=10, 
            random_state=1)),
        ("scaler", StandardScaler()),
        (name, classifier)])

Deciding how to evaluate the performance of a classification model depends on where the needs of the use case fall on the precision/recall tradeoff. In this case, it’s most important that the model flag all of the data science proposals as being data science-related, even if there are some proposals flagged that aren’t. This is an example of a recall-biased model, which requires an evaluation measure that takes this into account. For this reason, the model is being evaluated using an F2-Score measure, which proportionally combines precision and recall measures while also overweighting recall.

# Evaluate average model performance across 10 random states (to account for the small sample size)
f2_scores_byrandomstate = []
for n in range(1,11):
    # split data into training, validation, and testing sets
    X = data["processed_text"].values
    y = data["label"].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
        test_size=0.20, random_state=n, shuffle=True, stratify=y)
    train_X, validate_X, train_y, validate_y =  
        train_test_split(X_train, y_train, test_size=0.25, 
        random_state=n, shuffle=True, stratify=y_train)
    # fit model on the training set
    model.fit(train_X, train_y)
    # find max F2-Score measure for the model
    y_score = model.predict_proba(validate_X)[:, 1]
    precision, recall, thresholds = 
        precision_recall_curve(validate_y, y_score)
    f2_score = (5 * recall * precision) / (4 * recall + precision) + 
        1e-10
    f2_scores_byrandomstate.append(np.max(f2_score))
average_model_performance = np.average(f2_scores_byrandomstate)

After all of the text pre-processing, vectorization, model selection, and hyperparameter tuning, the best-performing model received an F2-Score of 0.828. Any score above .7 is considered above average, but a score greater than .9 would be ideal.

The complete code for this model can be found here.

Applying the Model to The Rockefeller Foundation’s Corpus of Proposals

After training the model on the most recent 1k proposals submitted to The Rockefeller Foundation to find data science-related applications, we can now turn it on the remaining 4k+ proposals stored in our grant database, as well as any future proposals submitted. What we receive in return is predicted classifications across all of the proposals submitted to the foundation and going forward. This will help The Rockefeller Foundation streamline the process of identifying their grantees using data science techniques.

# Plot predicted data science proposals across time
additional_proposals = additional_proposals.groupby("year").sum()
plt.figure(figsize =(20, 10))
plt.bar(additional_proposals.index.values, 
    additional_proposals["predicted_label"])
plt.xlabel("Year", fontsize = 15)
plt.ylabel("Number of Predicted Data Science Proposals", fontsize = 
    15)
plt.title("Number of Predicted Data Science Proposals by Year", 
    fontweight ='bold', fontsize = 20)
plt.show()

Data Science Grants at The Rockefeller Foundation

The Rockefeller Foundation has taken a hypothesis and data-driven approach to philanthropy since its founding, but only recently has it doubled down on its commitment to funding initiatives and organizations focused on promoting and applying data science for social impact.

To close out, I’ve highlighted a few stand-out grantees from the Innovation Team in recent years that are taking techniques and models from data science–not too dissimilar from the one shared above–and using them to innovate on solutions to long-standing problems facing communities across the globe.

DataKind: Since 2017, The Rockefeller Foundation has granted over 1 million USD to the US-based non-profit DataKind. This ongoing grant has been used to support the organization’s varied work using artificial intelligence and predictive analytics to improve lives around the world, and in general, its mission to connect the data science and nonprofit communities for greater social impact. To read more about their most recent grant, see this post.
Data.org: Since 2021, The Rockefeller Foundation has granted over 10 million USD to the US-based non-profit Data.org to support their efforts in building the field of data science for social impact and establish high-impact use cases. To read more about their amazing work, see this post.
AtlasAI: In the last month, The Rockefeller Foundation awarded a 1.8 million USD grant to the US-based B-Corporation AtlasAI with the purpose of helping them build the next generation of the Human and Economic Atlas, a data science tool to measure poverty and model development progress. To read more about this grant, see this post.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com