Helping the United Nations Infer Document Labels Using Knowledge Graphs

Andrew Marmon
Slalom Technology
Published in
11 min readOct 7, 2020


Photo by Kyle Glenn on Unsplash

The United Nations Office for the Coordination of Humanitarian Affairs (OCHA) manages a multibillion-dollar fund intended for projects related to humanitarian emergencies. Donors provide these funds to OCHA with a list of strategic priorities that they‘d like their donations to be aligned with, including initiatives centered around education in crisis, women and girls, protection, and disability. There’s a challenge in prioritizing where these funds land, however. Non-governmental organizations (NGOs) request funding from this pool by submitting lengthy project proposals that then must be aligned to donor funds. Once projects are selected and funded, the OCHA team must also justify their selection decisions by aligning key outcomes to the goals of the donor fund.

Currently, the entire process of project proposal ingestion, alignment determination, and reporting is manual. The Emerging Technologies Lab (ETL) at the UN partnered with Slalom to find ways that innovative technology like AI/ML and knowledge graphs can:

· Reduce the time required to process these project proposals

· Increase the rate at which these proposals successfully align with the key strategic initiatives

· Explain how and why these decisions were made

The end result was a highly explainable natural language processing pipeline that leveraged keyword entity extraction and an open source knowledge graph to achieve these goals.

Learn more about this project in our story on

How we work

For every project Slalom takes on, we follow a standard framework to solve the problem meaningfully and responsibly. We:

i) Determine metrics that constitute a successful outcome

ii) Identify ethical considerations to address during data profiling and modeling

iii) Explore the dataset and prototype approaches

iv) Implement a solution using metrics as a benchmark

Measure what matters

Ensuring that funds are well spent is paramount for OCHA, its donors, and ultimately the people in need. A manual process obviously imposes limitations on the speed of proposal assessments and the variety and granularity in reporting on finished projects. An automated tool to help humans in this process would need to be accurate, fast, and flexible. We defined success as achieving a > 85% accuracy in classification.

An important component of every solution we propose is model explainability. Any funding that OCHA provides must have an associated, discernible reason behind that decision that can be reported back to the donors. In contrast with many modern black box models, the text extraction and knowledge graph pipeline we implemented embeds a high level of transparency into the inference process, allowing for a robust reporting structure that ties these model outcomes to key terms and phrases that had a disproportionate impact on the assigned label.

Do what’s right

A key question to ask during any machine learning engagement is, what kind of outcomes are we driving? Is it possible that our dataset is inherently biased, and if so, do we have the tools to mitigate this bias that could lead to the perpetuation of unjust outcomes for a subset of people? As machine learning practitioners on the front lines of innovation, we have a responsibility to ask these questions to prevent machine learning models from perpetuating biased learned from historical data. For this engagement, we investigated the ethical considerations relating to the project’s purpose, the dataset, and the machine learning models.

The core of this collaboration was to help the United Nations coordinate assistance for people in need more efficiently. When this pipeline we built is deployed into production, it will have a significant positive impact on people in need of this OCHA funding. Money designated for the strategic initiatives will be more efficiently allocated, allowing NGOs to leverage this funding to help people faster. Taking these points into consideration, we concluded that this was an ethical engagement that will make a positive difference as long as we implement it correctly.

For the dataset and machine learning models, we prepared a set of ethical considerations to investigate while exploring and prototyping approaches. We knew we would be working with NGO proposal documents and detailed descriptions of the strategic initiatives. These are already heavily vetted by humanitarian experts, so we assume that these are mostly free of any glaring inherent biases. However, the machine learning approach we settled upon was an open source knowledge graph trained on Wiktionary called ConceptNet. These models trained on open source data are notorious for containing bias extracted from users using discriminatory language on these platforms. One of the most striking examples of this is Microsoft Tay, which quickly became racist, homophobic, and antisemitic after being exposed to Twitter users for a few hours. During our prototyping phase, we committed to eliminating bias like this in ConceptNet before using the model for our purposes. Read more about our findings in the model prototyping section.

Begin with the end in mind

The OCHA dataset

The solution required data from NGO project proposals and formal definitions for the four strategic initiatives we targeted. Each NGO that applies for funding must submit an extensive (30+ pages) proposal which includes a statement of purpose, detailed descriptions of its mission, metrics that show why it needs funding, and a rich set of diagrams and images that help visualize the information. These proposals have no associated labels linking them to the strategic initiatives: education in crisis, women and girls, protection, and disability.

Every project proposal that the UN receives needs to be tagged with one of these strategic initiative labels by our NLP pipeline. Each of these topics is accompanied by a descriptive supporting document that outlines the definition of that strategic initiative and what OCHA’s responsibility is in addressing it. For example, see the definition of protection.

Linking strategic initiatives to project proposals

So how do you create an explainable document classification algorithm with no explicitly assigned labels? The idea is fairly simple: use commonsense word relationships derived from an open source knowledge graph to link key words extracted from project proposal documents to words related to the strategic initiatives ‘education in crisis,’ ‘protection,’ ‘women and girls,’ and ‘disability.’

The first step is to extract key words from the project proposals that are representative of the documents. Once we have these terms, we can start building a pipeline that labels which strategic initiative these terms are closest to. There are many well-documented methods to find words or phrases that are important to a document. Methods like TF-IDF, LDA, and other variants are commonly used for this purpose. In this case, the ETL had already developed a robust implementation of YAKE (Yet Another Keyword Extractor) to gather these phrases from its documents. In addition to this extraction method, Slalom implemented LDA to gather topics from these documents and the keywords related to those topics. Below is an extract of terms found in one of the UN project proposals:

keywords = ['wfp', 'provides', 'dispatch', 'rations', 'idps', 'delivery', 'assistance', 'severely', 'situation', 'percent', 'programme', 'partners', 'hudaydah', 'insecurity', 'security', 'insecure', 'million', 'humanitarian', 'food', 'emergency', 'general', 'world', 'response']

The million-dollar question is: how do we map these key terms to the strategic initiatives the UN cares about? Our proposed solution was to create a knowledge graph centered around those humanitarian concepts (education in crisis, women and girls, protection, and disability) and implement a distance metric that finds the closest initiative given a set of key words.

For example, if we extracted these keywords from a UN proposal:

keywords = ['assist', 'safe', 'defend', 'guard']

We’d want to label that document as ‘protection.’ Alternatively, if the keywords we extracted were:

keywords = ['learning', 'traumatic', 'schooling', 'unstable']

Our mechanism for labeling these documents would ideally label the document as ‘education in crisis.’

We used an open source knowledge graph trained on Wiktionary called ConceptNet to represent the logical similarities between words and phrases. However, this knowledge graph contains many words and relationships that are not useful to us because of the vast number of topics it was trained on through Wiktionary. To narrow down the scope of what was represented in this graph, we used the ConceptNet API to gather words related to our strategic initiatives. Below is an extract of the first-depth graph centered around the strategic initiative ‘protection.’

‘Protection’ ConceptNet Extract

As you can see, each one of these terms is closely associated with the protection initiative that OCHA is invested in. These relationships are best thought of in terms of sentences found in these NGO documents. For example, in the Yemen humanitarian needs funding proposal, the organization explicitly calls out its need for protection:

“The escalation of the conflict since March 2015 has dramatically aggravated the protection crisis in which millions face risks to their safety and basic rights.”

Similarly, the South Sudan humanitarian needs proposal shows the need for protection, but doesn’t call it out specifically:

“Five years of the most recent conflict has forced almost 4.2 million people to flee their homes in search of safety, nearly 2 million of them within and nearly 2.2 million outside the country.”

In both of these situations, funding for protection is paramount. However, notice how in this South Sudan excerpt, ‘protection’ isn’t mentioned, but a related word, ‘safety’, is. This knowledge graph is key for linking the implicit need for protection highlighted in, “flee their homes in search of safety,” to the initiative ‘protection’ that will help provide them with funding. We expanded on this simple 8 node knowledge graph to encompass 4,538 concepts by including each initiative in the knowledge graph and running a depth 3 breadth first search.

Full ConceptNet Extract with 4,538 Nodes

The last piece is to implement a function that takes keywords and scores extracted from documents as an input and outputs an initiative label based on concepts embedded in the knowledge graph. To do this, we use a simple softmax of keyword scores divided by the distance of each keyword to the initiative in the knowledge graph.

Where ri is the relevance score for initiative i over the n keywords extracted from a document, s(k) is the score for each keyword k, and d(k, i) is the distance from word k in the knowledge graph to the specific initiative i.
Take the softmax for all relevance scores ri, where m = {education in crisis, protection, disability, women and girls} and rj is the relevance score of the strategic initiative at the index j in set m.

For example, given a set of terms and their associated scores:

entity_scores = {'learning': .93, 'traumatic': .87, 'schooling': .80, 'unstable': .23, 'money': .77, 'poverty': .52}

We can calculate the shortest path between each of these terms and an associated initiative in the knowledge graph. The following shows the distances between these terms and ‘education in crisis’:

entity_distances = {'learning': 1, 'traumatic': 1, 'schooling': 1, 'unstable': 1, 'money': 2, 'poverty': 3}

Based on the entity scores and distances, we get the following relevance scores r for each initiative:

initiative_relevance_scores = {'women_and_girls': 1.17, 'education_in_crisis': 2.58, 'protection': 1.26, 'disability': 0.99}

We then apply softmax for each initiative i:

initiative_softmax_scores = {'women_and_girls': 0.142, 'education_in_crisis': 0.583, 'protection': 0.156, 'disability': 0.119}

Which gives an upper-bounded score that maps the relevance of the entities extracted from a given document to the strategic initiatives that the documents will be labeled with. In this case, the document would be labeled ‘education in crisis’ due to its high score of 0.583. The graph below depicts the holistic pipeline:

Process to Calculate Document Label Likelihood

Addressing open source model bias

We quickly identified discriminatory language and bias while using ConceptNet to find terms related to the key strategic initiatives. We found the most prominent biases when we looked for terms related to ‘women and girls.’ Here are some of the terms we found while using the ConceptNet API: ‘woman is a gold digger,’ ‘woman is a nymphet,’ ‘woman is a baggage,’ ‘girl is a lolicon,’ ‘woman is an attendant,’ etc. We then looked at terms like ‘refugee’ and ‘disability,’ which are related to the initiatives, and found some disturbing results: ‘refugee related to refujew’ and ‘disability related to useless eater.’ Note that these are specific examples taken out of many valuable results, but regardless, it’s inexcusable to allow these associations to remain.

To address these issues, the first step we took was to remove all verbiage not found in the project proposal vocabulary. This removed much of the specific slang and discriminatory words that would never be found in an official document unless it was specifically relevant. The second was to remove associations that we deemed biased from the associations to women, including ‘baggage,’ ‘servant,’ and ‘attendant.’ We completed a manual review of all concepts directly related to the strategic initiatives and ensured this bias would not persist into any future iterations of the knowledge graph we built. That being said, we believe the ideal future process includes a group of subject matter experts of various backgrounds reviewing this terminology to promote inter-rater reliability and ensure all model bias is removed.

ConceptNet has worked to address many of the issues we outlined here and has specifically highlighted its ability to reduce inherent gender bias in its model in a 2017 blog post. But despite its best efforts, there’s still clear and present bias that we were fortunately able to mitigate with our small knowledge graph subset.

There’s quite a bit of work left to do in the machine learning field to automatically remove bias from tainted datasets. An example of recent work aimed at reducing inherent bias outlines a method using neutralizing edits in Wikipedia articles to find instances of demographic bias, like using the term ‘mankind’ instead of ‘humankind’ or ‘his career’ as opposed to ‘their careers’ when describing a group of people (Pryzant et al., 2020). Although this work did not directly serve our use case, it’s encouraging to see the research community start to address this problem plaguing many of the most used machine learning datasets.

Tying it all together

Our implementation was incredibly fast and inexpensive. With the current limited-size knowledge graph, we were able to run it on a micro-instance virtual machine far cheaper than our $150,000 benchmark. As the solution is scaled to accommodate real-time document classification, this cost will increase but not above the target amount. Also, through our initial discovery, we saw very promising results for meeting the 85% required classification accuracy necessary to justify the scaling of this model. However, there’s still a significant amount of expert review required to verify these results.

There are many future refinements that can be implemented to improve this current pipeline. The UN team is interested in leveraging expertise in its organization to embed domain specific knowledge into the existing knowledge graph. This would seamlessly integrate into the current pipeline by allowing the distance metric to better link these extracted key words to the strategic initiatives that are most relevant to a given document. The distance metric itself is fairly simplistic and could be modified to better represent how graph distance between keywords and initiatives maps to how relevant that term is. The current solution is powerful and is truly just the first step in what will become a scalable, explainable natural language processing pipeline that will help the UN more efficiently allocate donor funds to the humanitarian crises that need it.

Learn about us

This work was delivered by Slalom’s AI Center of Purpose which is dedicated to improving the world with artificial intelligence. Learn more about this team and the associated Innovation for Good initiative.


Big thanks to Lambert Hogenhout, Kevin Bradley, Mulin Tang, and Jiayue Wu at the United Nations Emerging Technology Lab for partnering with our team at Slalom. Also, to Raki Rahman, Benjamin Chan, Adrien Galamez, Michelle Yi, Tony Ko, Zahra Zahid, Parul Patel, and many others at Slalom for making this solution come to life.