Re-thinking grant data: a graph approach?

Matthew Upson
Wellcome Data
Published in
12 min readNov 30, 2020

Networks, also called graphs, are everywhere. It’s probably not an overstatement to say that the 21st century is the graph century; many of the innovations we take for granted are possible because of, or can be described by, graph theory. The internet, the network of wires and network infrastructure that underlies the world wide web, can be represented by a graph. The world wide web itself can also be represented by a graph of websites connected by hyperlinks; indeed Google’s first innovation was to use this graph as a means of measuring the importance of websites to sort search results (see the PageRank algorithm). Social networks are probably the most obvious application of graphs in the public imagination; of course these networks existed in the real world long before Facebook and Twitter rendered them in software. This blog post is about how we are experimenting with similar graph approaches applied to the Wellcome Trust’s portfolio of grants

Graphs for Wellcome grants

When thinking about Wellcome’s grants, graphs are useful because they allow us to model relationships between entities and use those relationships to make inferences. For example, one problem of interest to the Wellcome Trust is how to measure the similarity between grant applications. This is useful because it can help to: suggest an appropriate grant scheme to an applicant; improve search through the database of grants and grantees; find similar grants under different programs; or identify gaps in funding that are not well served.

Wellcome’s Data Labs have adopted a number of approaches for measuring grant similarity, but they have mostly used the textual information that accompanies a grant application: for example the title, the synopsis, or the full text of the grant. Natural language processing (NLP) approaches have been used to measure the similarity between the grants, for example by measuring the distance between vectors that represent each grant that have been generated from word embedding methods. Whilst these approaches have been successful, they are not using all the data that are available in the accompanying metadata, for example when two similar grants share an author, or when they originate from the same organisation. Making use of this ‘structural’ information may allow us to improve our estimations of similarity, whilst providing other benefits such as allowing more complex queries to be made of the data.

Wellcome has a great deal of data about the many thousands of successful and unsuccessful grant applications. A simple approach to representing this data in a graph is to represent grants as the nodes, and connect the grants to each other based on the grant metadata. For instance, we could use co-occurrence of grant contacts (i.e. researchers named as contacts on the grant), for instance, by connecting nodes (representing the grants) by edges (the connections), when grants have the same principal author; this might look as follows.

A simple graph of grant nodes connected by edges of contact co-occurence.

The intuition behind this approach is that researchers typically work in one or few subject areas, hence if the same researcher participates in two grants, they are likely to be similar.

Another approach is to look at how papers arising in grants are cited by papers arising from other grants. This works on the assumption that any paper arising from a grant will reflect its content, and will likely cite papers predominantly from the same field. We would create an edge between two grants if a paper arising from a grant is cited in a paper arising from a second grant. For example in the below illustration we use the graph of co-citations (a.) in which papers (P) arise from the funding of grants (G), and collapse this information into a simpler graph (b.) where the ‘weight’ of the edge is dependent on the number of citations in a. Note that in practice we are unlikely to use a co-citation graph on its own because it would not allow us to capture information about grants that do not produce scientific papers as an output.

A graph of grants connect by co-citations in articles arising from the grant.

These simplified graphs, while potentially useful, still do not capture all of the information that we have about a grant; to do this we can use a heterogeneous entity graph or knowledge graph which represents all of the information we hold on a grant. Expanding on the previous graph, the below example shows a more complete graph including entity types for authors (A), organisations (O), grants (G), and papers (P), and edge types denoting the various relationships between these entities. We could grow this graph further by including the authors and institutional affiliations of all papers arising from a grant, or by linking authors to individual papers, not just individual grants. We might also include the department from which a researcher comes rather than the entire organisation.

A simple knowledge graph incorporating organisations, contacts, and grants as nodes, with edges of various types.

In the proof of concept project, we tried all three of the above graphs, which we call the ‘author-grant graph’, the ‘co-citation graph’, and the ‘knowledge graph’. We used a slightly different architecture to the knowledge graph than shown above, which included authors, organisations, grants, and funding area (a Wellcome applied tag denoting the rough funding area).

Graphs also allow us to set attributes on each one of the nodes (and the edges), which can be numerical matrices of arbitrary size. For each of our three graphs, we added to the grant nodes the vector representing the grant synopsis generated using SciBERT. We created the graphs using a python framework called networkx.

Using the graph

Now that we have our data represented in a graph, how can we make inferences with them? In this proof of concept, we were primarily interested in how we could generate vector representations of the entities in our graphs, analogous to word embeddings used in NLP. For each node in the graph, we would create a vector of numbers that captures useful information about the relationship of that node to other nodes in the network. Our motivation for this approach was that it would allow us to create a library of grant ‘node embeddings’ that could be used as a drop in replacement for our existing word embeddings in any downstream machine learning task. This meant that we needed to create our embeddings in an unsupervised manner, i.e. without a definitive downstream task in mind, and that we would need to generate a vector for every Wellcome grant.

This approach has the problem that there is no one single way to assess the quality of the node embeddings. If we had intended to use the node embedding in a single downstream machine learning task, we could simply have measured the performance of our embeddings on this task against a baseline model. Since we wanted to develop embeddings that were task agnostic, we developed three baseline models for problems of varying difficulty and measured node embedding performance against all of them. Our baseline models were simple classifiers with the SciBERT embeddings of the grant synopses as features — the same SciBERT embeddings that we attached to the grant nodes as attributes in our graph. To test the graph embeddings we substituted the SciBERT embeddings for the node embeddings and re-ran the same models using a replicated k-fold cross validation strategy.

A table showing three machine learning tasks of varying difficulty.

Creating node embeddings

Once we had a framework for evaluation in place, we created the embeddings from the graph nodes using the StellarGraph framework to implement graph neural networks (although there are multiple other options including Spektral, Deep Graph Library, and Graph Nets). Within these frameworks there are many options for creating node embeddings; since around 2013 the field of graph neural networks has taken off and it has become a popular subject of research.

Early approaches to creating a vector approximation of a node’s neighbourhood used a ‘random walk’ approach by essentially treating a node as a word within a random ‘sentence’ of nodes incorporating the target node and several nodes in its neighbourhood. Vector representations for each node are then created by using a word2vec algorithm to generate an embedding in the same way that could be done for a corpus of natural language sentences.

We experimented with the random walk approach with our knowledge graph since this was a highly connected graph that was likely to benefit from such an approach (more on this later). We also experimented with Graph Convolutional Networks (GCNs), and a particular implementation of a GCN called GraphSAGE. GraphSAGE is a particularly promising approach because it is ‘inductive’, i.e. it allows for embeddings to be calculated for previously unseen nodes. Since the Wellcome Trust receives many thousands of grant applications each year, inductivity is a desirable characteristic

Unlike random walk algorithms which only capture structural information about the graph, GCNs also capture information about the attributes of a graph. These attributes can be any numerical information relevant to a particular node, in our case for grant nodes: the SciBert embeddings of the grant synopsis. A GCN combines this information with information from a target node’s neighbours; this combined information is then passed to the next ‘layer’ of nodes which complete the same process, and so on. Hence, the convolutional aspect of the neural network comes not from a typical convolutional neural network (CNN) structure, but from the way that information from the previous node and its neighbours is combined with information from the present node and its neighbours, and then passed on to the next node. Similarly, when we talk about the concept of depth within a CNN, this has nothing to do with the amount of hidden layers in our neural networks, it relates to how many steps are made across the graph, i.e. how many times information from one node is combined with its neighbours and passed on.

GraphSAGE works in a similar way but SAmples the neighbourhood around a node (as opposed to combining information from all neighbours) before AGgrEgating the information. This approach ensures both that the computational complexity of the problem remains constant for each node, and that embeddings can be calculated for unseen nodes, since it is not necessary to know (and aggregate) all a node’s neighbours in advance: we just need a sample.

In the example below (reproduced from a lecture by Jure Leskovec, co-author on the original GraphSAGE paper) we can see that the embeddings for nodes A and B comprise information from every other node in the graph (including themselves). In a real application this neighbourhood around A and B would be sampled, rather than including every node. The blank boxes represent a neural network which could be a fully connected network with a single hidden layer. Inductivity is ensured by sharing the parameters used in all of these neural networks, rather than learning them independently for each blank box (neural network).

A visual explanation of the graphSAGE algorithm.

Adapted from Hamilton, 2017 and a lecture by co-author Jure Leskovec given at Stanford University.

Note that GraphSAGE as described by Hamilton (2017) is not readily applicable to knowledge graphs, so for this application we use Heterogeneous GraphSAGE (HinSAGE), an adaptation of GraphSAGE available in the StellarGraph framework that uses a unique parameter matrix W for each combination of distinct nodes. In our knowledge graph for example, unique W matrices would be learnt for each combination of our node types: grants-authors, grants-organisations, grants-funding area, organisation-authors, organisation-funding area, etc.

Results and outcomes

The grant-author graph

Our best performing embedding was created using a GCN applied to the grant-author graph. This embedding performed similar to baseline embedding with a mean f1 score of 0.76 compared to the baseline of 0.77 (calculated from 30 replicates of 5-fold cross validation, giving 150 replicates) on the easy task. This performance was particularly good given that the node vectors had just 128 dimensions (compared to the baseline 768) of which on average 80 were zeros. Against the medium and hard problems, the grant-author embeddings performed less well.

A table showing the results derived from the grant-author graph and the baseline

The figure below shows part of the grant-author graph, which is made up of a large number of unconnected subgraphs, some of which probably represent subject areas, or research groups. That these embeddings were the best performing is surprising given that we expected the algorithms to perform better on a more fully connected graph.

Hundreds of clusters of nodes, well connected in their clusters, but disconnected from each other from the grant-author graph

The co-citation graph

Embeddings derived from the co-citation graph did not perform well, but the approach suffered from missing data. In order to produce a complete graph we need a record of all papers arising from all Wellcome grants and the papers that they cite. Not all grants have an outcome of published research papers, and even when they do, they do not always get reported and linked to the Wellcome grant. Furthermore, we accessed the Europe PubMed Central (EPMC) database to find the citations for each paper, and this information is not always complete. The result was that we could only generate embeddings for approximately 13% of grants used in our baseline model.

A table showing the results of the author co-citation graph.

Like the grant-author graph, the co-citation graph was made up of thousands of unconnected subgraphs, typically largely than the subgraphs of the grant-author paper, but still unconnected. We expect that this information is likely to be most useful when incorporated into a larger knowledge graph.

The knowledge graph

As expected, the knowledge graph showed a much higher degree of connectivity than the two simpler graphs (note the below image shows only about 15% of the whole graph).

A visualisation of the Wellcome Trust knowledge graph showing a large central cluster of nodes that are well connected.

Unfortunately, the knowledge graph node embeddings did not perform as well as the baseline, but this approach also suffered from missing data. In this case, using the HinSAGE algorithm, each node of each type must have attributes connected to it in order for the algorithm to work effectively. As noted, we used SciBERT embeddings of the grant synopsis to add a 768 dimensional vector to each grant node.

Finding suitable attributes for the organisation and author nodes was a challenge. For the author nodes, a sensible approach would have been to use SciBERT embeddings of the abstracts of that author’s papers, which would ensure a similar type of information to the SciBERT vectors of the grant synopses. This solution was not practicable in the relatively short timescale of the proof of concept, as there are a number of non-trivial adjacent problems that would need to be solved first, such as: author disambiguation and handling missing data. In the event, we experimented with initialising the non grant entity nodes randomly or with zero vectors, neither solution was particularly successful. It’s clear that to generate effective HinSAGE embeddings, we need to spend time to develop suitable attributes for all the nodes.

A table showing results for emebddings produced from the knowledge graph by node2vec and HinSAGE.

Knowledge graphs are increasingly widely used, and several exist in academia which we could incorporate into our Wellcome knowledge graph — not least Microsoft’s Academic Knowledge Graph. However, the sheer size of this graph creates a computational challenge that was well beyond the scope of this initial project.

Future directions

Whilst our initial research into using graphs did not provide us with new embeddings to supplant the existing ones, the project has opened up a previously unexplored set of tools and approaches.

As a means of storing data, graph representations enable more intuitive but sophisticated queries to be made of data based on semantic relationships. This project has developed interest and a first codebase to take this work forward. Storing Wellcome’s grant data in a graph database such as Neo4j or Amazon Neptune would not only allow us to interrogate our data in new ways directly on the database, but it would provide a basis for a more complete knowledge graph from which we could generate better node embeddings.

Finally, one of the areas we did not explore in this proof of concept was graph native algorithms. These are algorithms that can be executed on the graph itself without the need to develop embeddings from the graph first. Such algorithms could provide powerful and performant solutions to problems that the Wellcome Trust Data Labs are trying to solve, for example community detection.

--

--