Downloading PubChem Bioassays made easy

Gemma Turon
ersiliaio
Published in
5 min readJan 26, 2022

A simple script to download bioactivity data for small molecules

The PubChem Bioassay Database is the largest open-source repository of bioactivity data. With over 290 million data points from well-described biological experiments, it is an invaluable source of information for data scientists. At Ersilia, we use it often to complement the training sets of our machine learning models, but downloading the data of interest is not always straightforward. Here, we describe a code snippet to download the bioassay of interest in a format amenable for subsequent model training: https://github.com/ersilia-os/bioassay-db. Please note that it is focused on small molecule-based bioassays, and RNAi experiments are not accessible with this protocol.

Modified from Kim et al, 2016

Before jumping into the code, it is important to understand the relation between PubChem Data Types. Bioassays, labelled with an Assay ID (AID), contain a general description of the experiment (Description), the description of the results columns (Assay Result Field Type ID or TID), and the bioactivity results itself. Each data point is linked to a unique Substance, identified by the Substance ID (SID). SIDs are subsequently standardized to unique chemical structures, the Compounds, identified by the Compound ID (CID). Multiple SIDS can have the same CID, therefore, but multiple CIDs cannot be assigned to the same SID. Importantly, we need the CID to obtain the SMILES string of each molecule (Wang et al, 2009, Kim et al, 2016).

Here, we will use the bioassay AID1851, a High-Throughput screening experiment that measured Cytochrome P450 inhibition, as an example, since it is a fairly large (>15000 molecules) and complex (>30 TID) record. We access the database using a PUG-REST request, and download the records in JSON format. JSON are basically nested dictionaries, which allow us to retrieve the information of interest using specific keys. If you are not familiar with JSON files, you can skip directly to the CSV conversion section for a more human-readable format!

Json Download

  1. Clone the GitHub repository or download and save the relevant scripts (src/pubchem.py, download_json.py)
  2. Modify the download_json.py as needed. It has three commands: import the PubChemBioAssayRecord class from pubchem.py (change the path if necessary), instantiate the class specifying the desired AID (in the example,1851) and call the save_json() function passing the path to the folder of destination.
  3. From the command-line interface (command prompt in windows) navigate to folder containing the scripts and run python download_json.py

In practice, what is happening here is that we download a list of all SIDs tested in the Bioassay using the PUG-REST API. One one hand, we obtain the corresponding CID and Canonical Smiles for each SID, and, on the other hand, we retrieve all results associated to each SID. This process has to be done in batches to avoid hitting the retrieval limit of the request.

Caption of the code snippet necessary to download a bioassay in JSONformat

This will download a JSON file in the specified folder, under the name: “PUBCHEM1851.json”. This file is organized in three main objects:

  • Assay ID: AID number.
  • Description: bioassay text description, which includes several fields (aid, description, protocol, results…). Each item of the results list correspond to a specific TID (i.e IC50, % of inhibition, etc)
  • Data: a list containing all results. Each item of the list represents an experimental data point (i.e: one substance tested against a number of bioactivity assays) and holds the following information in a dictionary format: SID / CID / SMILES / Outcome (active-inactive) / Version / Rank / Data (the value associated to each TID, if existing)
# example of the .json file downloaded:{"AssayID": "PUBCHEM1851",
"Description": {"aid":{}, "aid_source":{}, "name":"...",
"description":[], "protocol":[], "xref":[],
"results": [{"tid":1, ...},{"tid":2 ...}, {...}]}
"Data":[{"sid":10321992,
"version":0,
"outcome":2,
"rank":95,
"data":[{"tid":1, "name":"...", "description": "...",
"type":1, "unit": 11},{"tid":2, ...}, ... ],
"cid":647501
"smiles":"CCN1C2=NC(=O)N(C(=O)C2=NC(=N1)C3=CC=CC=C3)C"}

Not all SIDs might have been tested against all experimental conditions, therefore some TID fields might be empty.

If you need further clarification on what each field represents, check the PubChemDocs.

CSV Conversion

JSON files are ideal for subsequent processing, but not very easy to read by humans. To convert the downloaded JSON file into a CSV table, follow the instructions:

  1. Clone the GitHub repository or download and save the relevant scripts (src/json2df.py, save_df.py)
  2. Modify the save_df.py script as needed: it imports the PubChemBioAssayJsonConverter class from json2df.py (you might need to change the path to the downloaded folder), instantiates the class (specify the folder where the JSONfile is stored and its name), and calls the get_all_results() function. This function retrieves a Pandas Dataframe that is then saved to the destination folder. A .txt file with the assay description is also saved to the destination folder.
  3. From the command-line interface (command prompt in windows) navigate to folder containing the scripts and run python save_df.py

In practice, what is happening here is that we convert the dict-like format of the Data object in the JSON file to a CSV file where each row corresponds to one SID/CID/SMILES, an each TID corresponds to a column with the bioactivity results.

Caption of the code snippet necessary to convert a bioassay from .json to csv

The CSV file we have created has the following columns:

  • SID
  • CID
  • Canonical SMILES
  • Outcome: the primary screening result (1: Inactive, 2: Active, 3: Inconclusive, 4: Unspecified, 5: Probe)
  • TID columns: as many columns as assay results specified. Each TID column is represented by its name. Further information on each result can be found on the description file, including measurement units if not specified in the column name. Note that units are also codified in numbers. (1: ppt (parts per thousand), 2 : ppm (parts per million), 3 : ppb (parts per billion), 4: mm (milliM), 5: um (microM), 6 : nm (nanoM), 7 : pm (picoM), 8: fm (femtoM), 9: mgml (milligrams per mL), 10: ugml (micrograms per mL), 11 : ngml ( nanograms per mL), 12: pgml (picograms per mL), 13: fgml (femtograms per mL), 14: m (Molar), 15: percent (percent ratio), 16: Ratio, 17: sec (seconds), 18: rsec (reciprocal seconds), 19: min (minutes), 20: rmin (reciprocal minutes), 21: day (days), 22 : rday (reciprocal days), 23: mlminkg (milliliter / minute / kilogram), 24: lkg (liter / kilogram), 25: hrngml (hour * nanogram / milliliter), 26: cmsec (centimeter / second), 27: mgkg (milligram / kilogram), 254: none, 255: unspecified)

So, finally, with two simple commands you can download any desired bioassay from PubChem with a easy-to-read CSV output!

I hope you found this useful, leave any questions in the comments or directly on GitHub issues if you encounter any problems. We will keep posting small hacks that make our lives easier when working with biomedical data, stay tuned to our channel for more!

--

--