Visual interrogation of a cancer knowledge graph using Neo4j Bloom

Live Data Concepts
6 min readMay 18, 2019

--

What is a knowledge graph?

Being a trained molecular biologist I always wondered how one could capture knowledge about a particular domain of interest. For myself this domain constitutes the discovery and development of drugs for cancer. Like for most my endeavor started by setting up a list of drugs and the companies that develop them. Soon, however, it became apparent that a simple list in Excel is not going to suffice to capture an understanding of this knowledge domain.

Having worked in an Information Technology environment for years I knew that a list typically ends up growing into a database of sorts with tables and relationships connecting them. Naturally as my understanding of the domain increased so did the list, which now had become a collection of lists. Clearly a database was needed to take the next step. Having seen many databases getting defined and populated I knew about the problems relational databases faced when a domain is evolving, but cannot be clearly defined at the outset. I wanted to avoid this problem and got fascinated by the emerging idea of defining nodes and connecting them via relationships, a graph database.

Luckily for me I came across a nascent database vendor called Neo4j and its database product, which as a community version was and still is completely free. How I managed to get my knowledge from lists into a novel graph database can be a topic for another story. This story is about looking at an accumulated knowledge over several years about biotech and pharma companies, their drugs in development and the clinical trials they run to determine safety and efficacy before getting approval to market and sell them.

Neo4j Desktop and Bloom

Neo4j Desktop is as the name implies a desktop application to manage a graph database locally or remotely. For simplicity sake I run my database locally, but it could easily be deployed into the cloud. Neo4j comes with a browser application that is setup to visualize its data as a graph and this works well for simple graphs and clearly defined questions. For deeper exploration, especially visual and interactive exploration of database content, from a broad perspective as well as looking at details, another tool is needed. Meet Bloom, a plug-in to Neo4j Desktop. Here is a first look into what I learned visually by interacting with my knowledge domain using Bloom.

Biotechs and Pharmas focused on Oncology

The universe of companies developing drugs for cancer is substantial, my database covers over 1000 companies, even though I don’t consider my database to be a large database, definitely not BigData. However it is highly connected, meaning there are a lot of relationships connecting various types of data. So an attempt to look at all Companies and their immediate connections results in the following, which I refer to as the typical graph hairball problem.

Fig.1: An artful hairball of nodes and relationships

Luckily Bloom has tools that allowed me do the necessary cleanup by removing nodes of types I don’t care about at this time. I wanted to limit my visual to Companies and Drugs only, to get a high level picture of connections in this simplified landscape.

Fig.2: Established, mid-stage and start-up companies all together
Fig.3: A close-nit network

Connections between companies are represented via molecules two companies are working together on. This can be due to an Event, such as a collaboration or as most often is the case via a Clinical Trial. As one would expect the center is made up of companies that represent the household names of biotech and pharma companies, such as Novartis, Merck, Amgen, Celgene and more. They are highly connected since they tend to work together on the same molecules used in clinical trials. The center is surrounded by smaller companies with a few drugs in development making up biotechs with an early drug development pipeline. Even further outside are companies which do not yet have any drugs in their pipeline, but have received early funding to develop candidate molecules. They are waiting to be connected once they reveal the molecules selected to be developed into a drug. Depending on someones interest one can dedicate their focus of interest on any of these knowledge landscape domains.

Delving into the details of an early stage company

Fig.4: Company details

Bloom enables a dive into the details for each node and relationship. Sometimes not much is known about a company, but the knowledge of its existence by itself is important to acknowledge. As more information becomes available it can easily be added to the database and connected with the respective company record or records. The flexibility of dealing with information whether or not it is available and the option of dealing with one or many relationships is a big selling point for Neo4j and makes it super easy to extend an existing data model.

Understanding a knowledge neighborhood

Molecular Targets are what drug molecules typically act on to influence a disease process. As novel targets get discovered companies start to emerge that see an opportunity to develop therapeutic molecules. By setting up a predefined, parameterized search with auto-suggestions in Bloom one can start to interactively investigate the data landscape around a molecular target.

Fig.5: Defining a basic lookup for Molecular Target by Name

Starting with a single node I can expand and grow the landscape relationship by relationship, which is crucial as we have seen earlier, hairball graphs are an all too common distraction.

Fig.6: Expanding a target neighborhood

A few steps of expanding the graph as well as shrinking it enables me to generate a picture of a connected data landscape representing companies working on developing drugs that target a molecule called CD47, I call it the CD47 star system, others might refer to it as the Molecular Target 360 degree view. Bloom’s tools to expand and dismiss nodes are excellent to trim a graph to a desired shape, almost like getting a great haircut.

Fig.7: A more complete picture of target, molecules and companies with added events

My wishlist of Bloom features

Like with any tool there are usually a few things that could be added to increase its utility even further. As I was navigating this database to use it for further knowledge discovery I came across the following I would love to see available in Bloom:

  • Hyperlinks on property panel to enable links to outside resources, not all information is typically in a single place. Getting to it via single click is crucial.
  • Ability to expand multiple nodes by a particular relationship or related node types only would allow for fewer expand and dismiss steps
  • Launch multiple windows enabling simultaneous multiple perspectives of the same data. Sometimes different perspectives are needed to fully understand a data landscape, like a browser has many tabs.
  • Panning on a layout to view items that disappeared from view
  • Availability of additional layout algorithms to change the way the graph layout is organized
  • Enable picture overlay on nodes based on picture url. An icon is great to identify what kind of node is represented, but in some cases a logo is preferred.

Closing thoughts

Connectivity is the name of the game with graphs. In Biology in particular, most “things” in the end will feature a many-to-many relationship between each other even if they don’t start out that way. Schema-based data models, such as the ones developed with relational databases, cannot easily be modified after they have been defined and populated. Graph-based models can evolve easily as more knowledge accumulates. Graph-based, interactive explorations work well to generate 360 views of “things”, such as a Company, Molecular Target, Therapeutic Molecule, Event and so forth, each forming a knowledge star system. Once we connect these star systems and view them from a distance we view a galaxy of knowledge eventually making a knowledge universe. Neo4j Bloom is a lens that lets us view and ask about “things” in this universe from a distance or close up.

--

--