Kipoi 0.6 release notes

Žiga Avsec
4 min readNov 6, 2018

--

We are very excited to announce the major update to Kipoi and its model specification format. Here are the updates:

Note: Since we ported all the models to the new API on kipoi/models in this PR, please update kipoi to 0.6: pip install -U kipoi.

Major updates

Hosting models on zenodo/figshare instead of Git-LFS

For better scalability, easier setup and easier contribution, we decided to abandon Git-LFS and use external services like zenodo or figshare to host model parameters and example files. These allow you to store 50GB of data per project for free and give you a citable digital object identifier (DOI). To contribute a model to Kipoi’s model repository, you now have to upload the model parameters to one of these services and provide a download link in the model.yaml:

You can obtain the md5 hash of the file either on zenodo’s website or run md5sum <file> on linux and md5 <file> on osx.

Kipoiseq — standard dataloaders for sequence-based models

We now provide a fast implementation of common dataloaders in kipoiseq. If your model takes as input DNA sequence (either one-hot-encoded numpy array or a string), you can simply use the kipoiseq dataloader in model.yaml:

Even if your model has a different ordering of the letters (say ATCG) or requires a different order of the axis than (batch, sequence position, letter), you can use default_args to specify these.

The package structure was inspired by torchvision and provides three kinds of objects:

  • dataloaders — Final object used to train models and make predictions. Example: SeqIntervalDl, MMSpliceDl.
  • transforms — simple functions or callable classes that for example resize the genomic intervals or one-hot-encode the DNA sequence
  • extractors — given a genomic interval, extract the values from genome-wide files like FASTA or BigWig. See also genomelake for more extractors.

These building blocks allows you to write new dataloaders for your own models. See our colab notebook on how to use kipoiseq dataloaders to train a Keras model.

Contributing multiple very similar models with a template

To easily contribute model groups with multiple models of the same kind, you can now specify two files describing all the models:

  • model-template.yaml — template for model.yaml
  • models.tsv — tab-separated files holding custom model variables
First few lines of model-template.yaml.
First few lines of models.tsv.

One row in models.tsv will represent a single model and will be used to populate model-template.yaml and construct model.yaml using jinja2 templating language. This allows you to even write if statements in model-template.yaml. See CpGenie model as an example.

Prediction testing

We now also test that the models predictions match the expected ones. Here is the additional field in model.yaml:

File specified under test.expect is an HDF5 file containing the input values and model predictions. You can generate this file either running

kipoi test <model> -o expect.h5 or

kipoi predict ... -o expect.h5 --keep-inputs.

Note that this command is used to generate the file and has to be ran only once. If model.yaml contains the test.expect entry, kipoi test <model> invocation will also test that the predictions still match.

Testing if the predictions match is extremely important as the deep learning frameworks are frequently releasing new versions and we have to make sure that the models stored using the older version still yield the same predictions. This becomes even more important once we start porting models from one framework to another via ONNX.

Common conda environments

We now also provide a set of hand-curated conda environments suitable for multiple model groups. These environments can be installed through kipoi env create. Run the following two commands to install two common environments covering almost all the models in Kipoi:

kipoi env create shared/envs/kipoi-py3-keras1.2

kipoi env create shared/envs/kipoi-py3-keras2

You can see the list of covered models by these two environments here. For each model, you can get the appropriate environment name by running

kipoi env get <model>.

This allows you to automatically activate the right environments in bash scripts or Snakemake rules:

source activate $(kipoi env get <model>)

If you instead want to just invoke a single kipoi command within a custom environment, you can instead get the absolute path to the kipoi binary:

$(kipoi env get_bin <model>) predict .... <model> -o file.tsv

We test that all the model predictions still match in this new common environment.

Minor updates

  • Add kipoi get-example command.
  • Allow to parametrize custom models PR#245
  • Keep track of the kipoi version required for the models source and display a warning if it has to be updated PR #377
  • Allow to read yaml files with additional fields using the old kipoi version (e.g. only display a warning)
  • Add option to disable automatic updates of the model repository. Use auto_update: False in ~/.kipoi/config.yaml.

--

--