The Graph Computing Differentiator for Life Sciences

Farshid Sabet
Katana Graph
Published in
6 min readMay 11, 2022
Image by geralt from Pixabay

Few industries are as data-intensive — and as highly regulated — as life sciences. Organizations in this space routinely contend with data at an unparalleled scale compared to those in other verticals, especially for mission-critical use cases like precision medicine and bringing new pharmaceuticals to market.

Moreover, life science applications oftentimes involve sensitive data (including PII) about patients and their health. The regulatory repercussions about who can access this data, where it is, and how it’s secured, are as extensive as they are severe.

Therefore, the horizontal utility of efficiency and effectiveness takes on new meaning in this industry. These advantages aren’t simply about completing tasks faster, decreasing time to market, reducing costs, and boosting revenues.

They’re ultimately about saving lives and considerably improving them for patients in need.

By employing progressive graph computing paradigms, organizations can realize these societal benefits. These methods enable life science companies to improve predictions and do them more easily than they could before. For example, longstanding predictions in this space were recently outperformed by a graph neural network.

These graph computing paradigms are helpful for chemical compound property predictions that make precision medicine a reality and bring new pharmaceuticals to patients. Pairing graph computing with data flow technologies enables firms to complete the entire pipeline for these use cases in a single platform, which organically integrates with tools common to this vertical like RDKit, Dask, and others.

By letting organizations access these external resources through a comprehensive graph intelligence platform, this solution results in negligible data movement, more effective treatments for society, and increased revenues.

Flawless Tooling Integrations

Chemical compound property predictions for precision medicine and creating new drugs have historically used narrow, siloed training sets with limited applicability. With Graph Intelligence there is now the opportunity to include elaborate data pipelines with different data and computational models. These can include integration with external toolsets, including advanced analytics frameworks like Python’s Dask. This approach is useful for predicting a wide range of biological activities, side effects, toxicology, and properties using not just preclinical data (like chemistry, screening results, etc.) but phenotypic, clinical, and post-market data, too. Access to bespoke capabilities like RDKit within the graph platform is critical for affecting competitive advantage in this space.

RDKit is nice to have when working in life sciences, particularly for chemical compound property predictions (which impact precision medicine and new pharmaceutical developments). It’s an industry-specific, open-source informatics library that’s relevant for running analytics on chemicals and molecules. Top graph computing solutions natively integrate with this valued resource via a specific RDKit cartridge. The flawlessness of the ensuing integration means users can deploy RDKit on the same database — and data — of their data flow graph solution.

With almost any other graph computing option (like those heavily reliant on Java), users would have to write a query, move data to the library installed on additional compute resources, perform computations there, then move the data back again to that graph platform. Within the confines of the graph data flow environment, however, everything can be done without data movement for better efficiency and throughput than those of the approach described above. The lack of data movement is ideal for safekeeping data and fortifying its protection. Moreover, organizations can work on the same libraries that have been used in this industry for decades without reinventing the proverbial wheel.

Precision Medicine

Precision medicine is based on implementing treatment measures for specific groups or subsets of the population for a tailored, personalized approach to medicine. The reality is companies usually spend approximately $250,000 on each patient in clinical drug trials for precision medicine. Recruiting the wrong patient for these trials (someone firms know won’t benefit from the drug) wastes a considerable amount of money. Graph data flow platforms can craft Graph Neural Networks (GNNs) that accurately predict patient responses to drugs in clinical trials, which produces a huge monetary impact for companies routinely investing $10 million over a period of years to bring new drugs to market.

The GNN approach lets them save $250,000 per patient, which otherwise may have potentially been lost. This gain is quantifiable, accelerates time to market, and increases the wellness consequences of this method. Although it’s only a modest part of the drug development lifecycle, this part is one of the most impactful. The datasets required to build these GNNs are predicated on patients’ physiological and genomic data — which encompasses millions of data points. This data is used to create a similarity graph based on these features as they exist between patients. Ultimately, these GNNs are deployed for binary classifications for a clinical trial revolving around a new pharmaceutical targeting a specific element within the population.

New Pharmaceutical Development

As the above precision medicine use case reveals, the scale of developing new drugs in life sciences is substantial. For example, it’s not unusual for companies to work with datasets involving one to two million molecules for chemical compound property predictions-both within and outside the realm of precision medicine. Data flow graph computing platforms that seamlessly integrate with other life science platforms are able to create molecular fingerprints on that data and store them in their database. If there are perhaps three or four fingerprints for each molecule, the scale of this application vastly exceeds that of the aforementioned two million molecules. Users of this platform can drastically enhance the performance of the overall pipeline by indexing the fingerprints in the database, which can accelerate downstream computations for everything from traditional analytics to machine learning models.

A multi-sharded RDKit implementation works well with this approach. It’s very good for dealing with combinatorial/virtual libraries, or maybe DNA Encoded Libraries (DELs), where you want to test lots of virtual compounds on the fly. When multi-sharded implementations are applied to High-Performance Computing (HPC), they manifest at an enterprise scale that outstrips that of options bereft of these techniques. For example, other solutions involving PostgreSQL and MongoDB only have a single implementation. Depending on how the indexing was done for the aforementioned use case involving the fingerprints of two million molecules, data flow graph intelligence platforms might access as many as 10 machines to expedite this task. The fingerprints are necessary to make similarity graphs on which the GNNs operate for predicting which molecules make credible compounds for developing drugs. The GNNs identify node features such as the number of bonds, the number of double bonds, and more. Those features are the basis for predicting specific molecule properties that quicken the research for bringing new drugs to market, as compared to using graph techniques without multi-sharding, HPC, and involve moving data in and out of PostgreSQL.

Without this graph data flow approach, organizations might have to integrate a Jupyter notebook, for example, with PostgreSQL. They’d need a driver to extract the list of molecules and then run similarity fingerprints on the molecules they’re extracting. Then, they’d have to move the data into another tool to construct a graph. Additionally, they’d have to do a separate call to RDKit, which would require additional data movement. However, with modern graph computing methods, they could do everything within a single platform. Choosing this option also enables users to write far fewer lines of code than they’d otherwise have to for the entire pipeline. In fact, one can get a GNN prediction while writing just a few lines of code.

The Way to Go

There are a plethora of benefits for leveraging a holistic graph computing platform for typical life sciences use cases involving chemical compound property predictions for pharmaceutical development and precision medicine. Data stays within the platform — which improves data privacy and data security — while advancing through a pipeline involving graph query, graph mining, graph analytics, and graph AI. Moreover, data still never leaves the platform while integrating with external resources like RDKit and others.

Users get the added advantages of relying on a distributed architecture for increased scalability and performance-which is supplemented by HPC techniques. The result is a time to market that’s vastly superior to that involving other solutions, which maximizes the yield of these applications for both the organizations deploying them and the countless patients that depend on their efforts.

Originally published at https://blog.katanagraph.com on May 11, 2022.

--

--