Finding shoulders to stand on: predicting scientific discovery with graph networks

Will G
One Cool Thing
Published in
3 min readMay 10, 2024
Photo by Alina Grubnyak on Unsplash

Paper here (Note: paywall). Another synopsis available from The Batch here (bottom of page).

Executive Summary for Managers/Leaders:

What is it?: An ML method that predicts likely scientific discoveries by learning a structured relationship between the subject of the investigation, the desired properties of the subject, and scientists. By encoding the relationships between authors/investigators as explicit connections, the method achieves higher performance at identifying new discoveries than just from the text of prior studies alone.

Why should you care?: Many organizations encode some of their most important institutional knowledge in text, but it’s the context for those documents that gives them significance. People wrote those documents, and those people work most closely with other people. Graph networks provide information on those social/structgural connections. However good knowledge retrieval with an LLM may be, providing known structure or context that exists outside of the texts themselves can likely make that retrieval better.

What questions should you be asking your DS/ML folks?:

  • In addition to text documents, what other information or data sources do you need? Or what else might be important for context related to those documents?
  • (If they’re presenting a graph network) How have you selected what is a node or an edge? What decisions led you to that design?
  • What connections might be important that we aren’t thinking about right now, or that aren’t already part of your design?

Summary for Data Scientists/ML Engineers/The-Technically-Curious

What is it?: the authors created a graph that linked materials science researchers/paper authors to materials and the properties of those materials. They developed a similar structure for pharmaceutical discovery. The goal was to identify areas in the graph that lacked coverage, but that were potentially connected to desirable outcomes in terms of material property innovation. They used these gaps as predicted connections. By applying this graph embedding method, they were able to exceed the ability to predict novel discoveries relative to just predicting discoveries using Word2Vec on prior publications.

What is cool about it?: One cool thing is the framing/encoding of the social aspects/dynamics of research into the prediction problem. Science isn’t just a process of reading prior knowledge in isolation and then doing something that extends or enhances it. It’s done in collaboration, and those networks matter. The other cool thing was identifying a useful embedding that captured these dynamics.

Questions I am thinking about:

  • What about bias? A graph constructed in this manner would code relationships as they have been, but would not necessarily drive as-yet unidentified collaboration that could be even more beneficial. And unexpected connection is frequently a driver of innovation.
  • What about funding sources? Underneath this network of research collaborators is a system that provides the money to perform that research. “Successful” research begets funding begets more research, etc. What impact might funding have on this picture that is not explicitly called out?
  • In what other applications could this method be applied? The materials science and drug discovery examples are good and practical ones, but I wonder what other areas could potentially benefit from derivation of this approach?

--

--

Will G
One Cool Thing

I write about the joys of fatherhood and motoring, and some cool things in the world of AI/ML