Full-Text Search in 197M Chemical Names Graph Database

Tom Nijhof-Verhees
Neo4j Developer Blog
6 min readJul 28, 2022

--

PubChem is a database with millions of chemical compounds. All these can be downloaded and put into your graph database as a basis for your project.

I downloaded 197M synonyms related to 57M compounds for the Open Measurement project.

Open Measurement

This project is being built as part of “Open Measurement”, a platform to share measurements of biological experiments with others. The challenge comes from the many different formats and structures people use. This makes it hard to find out if other people did a similar experiment to yours.
In this blog, I will look at building a compound searcher. The goal is that different synonyms of the same compounds can be linked to each other.
Techniques: Neo4j graph database, full-text search, Lucene, data wrangling

Loading the Data

The data is given in turtle (.ttl) format. For the synonyms, there are 18 files with each over 10M rows. Each row connects a synonym value (name as a string) and an ID (MD5 encoding of the name).
Next to those are 11 turtle files that link synonyms to compounds. The first 10 files also have 10M rows with the synonym ID and the compound ID (PubChem ID).
The data has a few challenges:

  • Empty strings for synonyms (957 synonyms)
  • Duplicates of synonyms (1,591,056 synonyms)

--

--

Tom Nijhof-Verhees
Neo4j Developer Blog

I do have a focus in my blog: "whatever intrestes me this week"