Kipoi: utilizing machine learning models for genomics

Within 4 years, the number of papers with deep learning models for genomics has increased by 40-fold.

Number of publications per year containing keywords ‘deep learning’ and ‘genomics’. Credit Gokcen Eraslan. Source app.dimensions.ai.

Using deep learning (and more broadly machine learning), researchers model how DNA sequence encodes molecular phenotypes, and how a ‘bug’ in this code may disrupt those phenotypes and lead to diseases.

The explosion of machine learning models in genomics has created an environment where researchers struggle to keep up with the increasing number of models published.

We developed Kipoi (kipoi.org), a model zoo for genomics, to facilitate using, sharing, archiving and building such models. We standardized the definition of all required steps for making model prediction on new data, pre-seeded a repository with 2,000 models from the literature from 19 different publications, implemented an API to use these models, and set up continuous end-to-end testing of all the models.

Making model predictions also involves data-loading and pre-processing

To make model prediction on new data, one needs to

  1. obtain model parameters
  2. install all the required software packages
  3. extract and pre-process the relevant information from the raw files
  4. run model predictions.

The major difference between genomics and other fields like computer vision is in the third step. In genomics, the data are extracted from domain-specific file formats and processed them using bioinformatics tools. These tools implement the required operations like overlapping intervals, extracting sequences or parsing genome annotations. Since all four steps are required to to successfully apply the model it can often take days or even weeks to obtain and re-run a published model on new data. We built Kipoi to remove all the obstacles in this process and reduce the time of making model predictions to seconds or minutes.

Main ingredients of Kipoi

1. Standardization of trained models

A Kipoi model is a directory containing files that describe two main components: data-loader and model.

Data-loader

Data-loader loads the data from canonical file formats, pre-processes them and returns arrays consumable by the model. It can be implemented as a python function returning the whole dataset or as a generator returning batches of data. Specification files:

  • dataloader.yaml description (example)
  • dataloader.py implementation (example)
  • dataloader_files/(optional) directory with further required files

Model

Model takes one or multiple arrays and makes prediction. It can be implemented in various frameworks like Keras, PyTorch or Tensorflow. Alternatively, it can be also implemented using arbitrary python code which allows to use other frameworks or even invoke command-line tools written in other programming languages. Specification files:

  • model.yaml description (example)
  • model.py (optional) class implementing predict_on_batch(x)
  • model_files/ directory with required files like model parameters

Software environment

Required software dependencies — conda and pip packages — are specified in both model.yaml and dataloader.yaml. Thanks to efforts like Anaconda, Conda-forge and Bioconda, this covers a vast amount of packages including more than 3,000 bioinformatics packages in Bioconda.

Test examples

Each model also comes with small example files used for model testing:

  • example_files/ directory with small test files

2. Model repository

Models are stored in a github repository (github.com/kipoi/models). We use git large file storage (LFS) to store model parameters and test files. This enables contributing models the same way code is contributed to Github and allows accessing past models the same way as past code is accessed on Github.

3. API for accessing and using the models

You can access and use the models from python, R or the command-line:

Python API. All components (model, dataloader, dependencies) can be accessed and used separately.
Command-line API. Each model makes predictions from standard file formats. BED (.bed) and FASTA are two such standard formats in bioinformatics for specifying the queried intervals (.bed) in the genome sequence(.fa). Output can be written sequentially to a compressed binary format (HDF5, .h5 extension) or as plain text (tab-separated values, .tsv).

The command-line example above shows a generic command for creating a new conda environment and making model predictions. Combined with Snakemake (python-based workflow management system), Kipoi’s API allows to write a single rule to run multiple models. For example, here is the code we used for comparing 5 different models in our manuscript (Fig. 2).

4. Plugins

To enable additional functionality beyond just running model predictions and giving the user access to the model, we implemented plugins for variant effect prediction and model interpretation.

kipoi-veff: Variant effect prediction

To asses the impact of genetic mutations on molecular phenotypes, two model predictions can be compared: one where the sequence doesn’t contain any mutations (reference) and one where the sequence contains the mutation of interest (alternative). Kipoi-veff package reads the mutations from the VCF file, makes models predictions and writes the obtained differences back to the VCF file as additional information.

Using Kipoi-veff, models trained to predict molecular phenotypes from DNA sequence can be used to assess how genetic mutations alter these phenotypes. Thanks to standardization, the plugin works with most of the sequence based models in Kipoi and allows to directly annotate the mutations stored in the standard file format (VCF).

kipoi-interpret: Model interpretation

Kipoi-interpret allows to compute feature importance scores that quantify what part of the input data is used by the model to perform a prediction. Here is an example of Kipoi-interpret visualization of DNA sequence importance for a DeepBind model which predicts protein binding affinity:

Importance scores visualization. (notebook) Model highlights parts of the input that were most important for making the prediction. For DNA sequence-based models, most important regions are typically those bound by by the proteins.

How to get started?

Concluding remarks

We built Kipoi to simplify the adoption and re-use of predictive models in genomics. This enabled us to easily compare models of protein-DNA binding, transfer knowledge in models of chromatin accessibility and combine alternative splicing models to predict the pathogenicity of genetic variants (see our white paper). While our work has focused on genomics, the Kipoi framework could be re-used in other domains.

We believe that sustained investment in infrastructure will return high dividends in the future. We invite you to join the Kipoi community and help build a robust research community — we look forward to drive value for academic research and industrial applications.

Žiga Avsec is a PhD student at the TU Munich working on deep learning models for genomics. This blog post was written together with Johnny Israeli and Roman Kreuzhuber. Kipoi is a collaborative project initiated in three research labs led Anshul Kundaje (Stanford), Julien Gagneur (TU Munich), Oliver Stegle (EBI).