Kipoi: utilizing machine learning models for genomics
Within 4 years, the number of papers with deep learning models for genomics has increased by 40-fold.
Using deep learning (and more broadly machine learning), researchers model how DNA sequence encodes molecular phenotypes, and how a ‘bug’ in this code may disrupt those phenotypes and lead to diseases.
The explosion of machine learning models in genomics has created an environment where researchers struggle to keep up with the increasing number of models published.
We developed Kipoi (kipoi.org), a model zoo for genomics, to facilitate using, sharing, archiving and building such models. We standardized the definition of all required steps for making model prediction on new data, pre-seeded a repository with 2,000 models from the literature from 19 different publications, implemented an API to use these models, and set up continuous end-to-end testing of all the models.
Making model predictions also involves data-loading and pre-processing
To make model prediction on new data, one needs to
- obtain model parameters
- install all the required software packages
- extract and pre-process the relevant information from the raw files
- run model predictions.
The major difference between genomics and other fields like computer vision is in the third step. In genomics, the data are extracted from domain-specific file formats and processed them using bioinformatics tools. These tools implement the required operations like overlapping intervals, extracting sequences or parsing genome annotations. Since all four steps are required to to successfully apply the model it can often take days or even weeks to obtain and re-run a published model on new data. We built Kipoi to remove all the obstacles in this process and reduce the time of making model predictions to seconds or minutes.
Main ingredients of Kipoi
1. Standardization of trained models
A Kipoi model is a directory containing files that describe two main components: data-loader and model.
Data-loader loads the data from canonical file formats, pre-processes them and returns arrays consumable by the model. It can be implemented as a python function returning the whole dataset or as a generator returning batches of data. Specification files:
dataloader_files/(optional) directory with further required files
Model takes one or multiple arrays and makes prediction. It can be implemented in various frameworks like Keras, PyTorch or Tensorflow. Alternatively, it can be also implemented using arbitrary python code which allows to use other frameworks or even invoke command-line tools written in other programming languages. Specification files:
model.py(optional) class implementing
model_files/directory with required files like model parameters
Required software dependencies — conda and pip packages — are specified in both model.yaml and dataloader.yaml. Thanks to efforts like Anaconda, Conda-forge and Bioconda, this covers a vast amount of packages including more than 3,000 bioinformatics packages in Bioconda.
Each model also comes with small example files used for model testing:
example_files/directory with small test files
2. Model repository
Models are stored in a github repository (github.com/kipoi/models). We use git large file storage (LFS) to store model parameters and test files. This enables contributing models the same way code is contributed to Github and allows accessing past models the same way as past code is accessed on Github.
3. API for accessing and using the models
You can access and use the models from python, R or the command-line:
The command-line example above shows a generic command for creating a new conda environment and making model predictions. Combined with Snakemake (python-based workflow management system), Kipoi’s API allows to write a single rule to run multiple models. For example, here is the code we used for comparing 5 different models in our manuscript (Fig. 2).
To enable additional functionality beyond just running model predictions and giving the user access to the model, we implemented plugins for variant effect prediction and model interpretation.
kipoi-veff: Variant effect prediction
Using Kipoi-veff, models trained to predict molecular phenotypes from DNA sequence can be used to assess how genetic mutations alter these phenotypes. Thanks to standardization, the plugin works with most of the sequence based models in Kipoi and allows to directly annotate the mutations stored in the standard file format (VCF).
kipoi-interpret: Model interpretation
Kipoi-interpret allows to compute feature importance scores that quantify what part of the input data is used by the model to perform a prediction. Here is an example of Kipoi-interpret visualization of DNA sequence importance for a DeepBind model which predicts protein binding affinity:
How to get started?
- Explore hundreds of trained models for genomics and apply them to new data in few lines of python, R, or via the command line.
- Fine-tune an existing model on a new dataset.
- Compose a new model using existing Kipoi models as building blocks.
- Contribute models and make them accessible for others.
- Use and develop new plugins. (See kipoi-veff and kipoi-interpret).
We built Kipoi to simplify the adoption and re-use of predictive models in genomics. This enabled us to easily compare models of protein-DNA binding, transfer knowledge in models of chromatin accessibility and combine alternative splicing models to predict the pathogenicity of genetic variants (see our white paper). While our work has focused on genomics, the Kipoi framework could be re-used in other domains.
We believe that sustained investment in infrastructure will return high dividends in the future. We invite you to join the Kipoi community and help build a robust research community — we look forward to drive value for academic research and industrial applications.
Žiga Avsec is a PhD student at the TU Munich working on deep learning models for genomics. This blog post was written together with Johnny Israeli and Roman Kreuzhuber. Kipoi is a collaborative project initiated in three research labs led Anshul Kundaje (Stanford), Julien Gagneur (TU Munich), Oliver Stegle (EBI).