1.1K Followers
·
Follow

Image for post
Image for post

Open Software Packaging for Science

Modern scientific applications typically depend on a very large number of libraries written in various programming languages, ranging from Fortran to TypeScript, C, C++, Python, etc. So, we need to pin down substantial requirements to be able to share and distribute these applications.

Distributing a consistent collection of packages is a really hard problem, for it brings into play many constraints, from binary compatibility and platform idiosyncrasies to the solving of complex version combinations. For all these reasons, the open-source scientific computing community has been de facto crowd-sourcing the packaging of the scientific computing stack. This challenge is simply too large for a single team to tackle.

Today, a large part of the community relies on the conda package manager, and the community-led open-source conda-forge project, where volunteers collaborate on the packaging of thousands of libraries in our ecosystem. Conda-forge now has thousands of contributors and is one of the main sources of packages for the community, after a few years of existence.

We have become extremely dependent on conda and conda-forge. We must think of their sustainability.

Conda was developed at Continuum Analytics (aka, Anaconda Inc.). The project is BSD-licensed and still largely maintained by employees of the company. They are very supportive of the conda-forge initiative, they engage very regularly with the community, and gracefully provide hosting for the build artifacts from conda-forge.

  1. However, while it is easy to serve conda packages with a simple web server, there is no open-source implementation of a conda repository, which would let organizations manage multiple channels with fine-grained permissions.
    - The public hosting service operated by Anaconda Inc. is not open-source. Rather, it is probably an ad hoc project tuned to the constraints coming with a very high traffic. As a consequence, there isn't any off-the-shelf option for third parties to host their own packages.
    - There is no public mirror of the conda-forge channel and no official backup of the build artifacts managed by the community. In our opinion, this is a vulnerability for the project and the broader open-source scientific computing community.
  2. Another issue with conda is that while it relies on a very efficient SAT solver to resolve dependency constraints, the massaging of the repository data makes it slow and memory inefficient with large repositories such as conda-forge.
  3. Finally, the tackling of these two issues may be a good occasion for a multi-stakeholder open-source organization to be created, and to serve as a home to a formal specification for protocols, file formats, and alternate reference implementations.

We set ourselves to tackle these different challenges.

The Mamba Org

Image for post
Image for post

Today, we are pleased to announce:

  • The initial release of Quetz, the first open-source server for conda packages, supporting multiple channels, and a role-based access controls for channels and packages.
  • The new version of Mamba, a (very) fast and reliable drop-in replacement for conda, written in C++.
  • The development of the Boa project, a replacement for conda-build leveraging the improved performances of Mamba.
  • The creation of a new GitHub organization, mamba-org, the home of the Quetz, Mamba, and Boa projects. This is also where we hold discussion on an improved specification of conda recipes.

Mamba

Mamba is a fully compatible drop-in replacement for conda. It was started in 2019 by Wolf Vollprecht. It can be installed as a conda package with the command conda install mamba -c conda-forge.

Speed… The main motivation for starting Mamba was to solve the performance issues of conda, both in terms of speed and memory usage.

  • Conda's poor performance is particularly striking in the case of large software repositories (such as conda-forge) where the entire repository index is parsed using the standard Python JSON parser. Tens of thousands of Python objects are created (as many as there are objects in the JSON index file) and processed before being input in the SAT solver.
  • On the other hand, Mamba makes use of openSUSE's libsolv dependency resolver, bypassing most of that processing to resolve dependencies.

In March 2019, Wolf Vollprecht put together the first alpha release of Mamba, which was still relying on conda's internals for most operations except for solving dependency constraints which was delegated to libsolv, and already demonstrated significant performance improvements over conda:

The first public announcement of the mamba project by Wolf Vollprecht

The adoption of libsolv proved extremely beneficial. The solver is fast, battle tested, and very actively maintained as it is used in the RPM world. Since this initial announcement, the Mamba project has gone a long way:

  • Conda version specifications are now fully supported by libsolv, which has been ported to all major platforms.
  • Mamba relies less and less on the conda codebase, the goal being to become a completely standalone implementation, in C++. Mamba supports parallel downloads of packages and repository data.
  • The command-line interface of Mamba has become a lot more user-friendly, spitting out meaningful error messages when version constraints cannot be satisfied.

Finally, mamba includes an extra utility, mamba-query.

  • mamba repoquery search xtensor will show you all available xtensor packages. You can also specify more constraints on this search query, for example mamba repoquery search "xtensor>=0.18"
  • mamba repoquery depends xtensor will show you a tree view of the dependencies of xtensor.
$ mamba repoquery depends xtensorxtensor == 0.21.5
├─ libgcc-ng [>=7.3.0]
│ ├─ _libgcc_mutex [0.1 conda_forge]
│ └─ _openmp_mutex [>=4.5]
│ ├─ _libgcc_mutex already visited
│ └─ libgomp [>=7.3.0]
│ └─ _libgcc_mutex already visited
├─ libstdcxx-ng [>=7.3.0]
└─ xtl [>=0.6.9,<0.7]
├─ libgcc-ng already visited
└─ libstdcxx-ng already visited

And you can ask for the inverse, which packages depend on some other package (e.g. ipython) using whoneeds.

$ mamba repoquery whoneeds ipython Name            Version Build          Channel
──────────────────────────────────────────────────
ipykernel 5.2.1 py37h43977f1_0 installed
ipywidgets 7.5.1 py_0 installed
jupyter_console 6.1.0 py_1 installed

With the --tree (or -t) flag, you can display the same information in a tree.

Most typically, the conda package manager is distributed through Miniconda or Miniforge. These distributions include the conda package manager and all its dependencies, such as a Python interpreter, PycoSAT, OpenSSL, etc.

A byproduct of moving to the C++ programming language is that we can produce a single binary standalone executable that can be used to bootstrap conda environments. This completely removes the need for miniconda or miniforge. Micromamba can bootstrap a root conda environment or conda environments at other locations.

With a lower memory footprint, a lighter-weight distribution mechanism, and its improved performanc, Mamba is especially well-suited for:
- building docker images for Jupyter's Binder where performance is critical to the user experience;
- building documentation on Read The Docs where the performance issues of conda with large channels have been problematic in the past.

Conda is often mistaken for a "Python package manager" when it really is a general-purpose one, very much like APT or RPM. While it was initially adopted by the Python data science community, thousands of R packages are available on conda-forge, as well as entire compiler stacks and native applications.

The Mamba native implementation demonstrates that this is really not about the Python programming language. The use of a compiled language for Mamba is also an opportunity for language bindings for e.g. R or Julia to arise, that will not require a Python interpreter. We hope that other communities will engage with us on the project and make Mamba a more widely adopted solution.

Quetz

Quetz is an open-source package server for Mamba and conda packages, which has been a missing piece in the ecosystem for a long time.

It was started just a few weeks ago, and developed with the "API First" approach, providing REST endpoints to most operations, and implementing a Role-Based Access Control (RBAC) to channels and packages, built upon FastAPI. A web front-end for Quetz is also being developed.

We have defined the different types of roles (package/channel owners and maintainers) with the idea that Quetz should ideally be a good fit for the conda-forge project, or for organizations setting up conda-forge-like infrastructures for building packages.

An issue with anaconda.org's permission system is that it only provides channel-level, and not package-level access. This lack of granularity does not fit well with conda-forge's approach, where recipe maintainers only control a handful of packages (the outputs of a recipe) and should not be able to upload other artifacts to the channel.

We are planning on setting up a public facing instance of Quetz. We hope it will be useful to the community.

What Else in The Mamba Org?

The Mamba community is currently developing additional tools.

  • The Mamba navigator is a web-based UI for exploring local conda environments and package dependencies. The Mamba navigator is still under active development.
Image for post
Image for post
The mamba navigator project.
  • Boa is a fast drop-in replacement for conda-build, relying on Mamba for solving dependency constraints.
  • Setup-mamba provides a GitHub action for setting up Mamba in CI.

Ackowledgements

The development of Mamba at QuantStack was funded by Bloomberg. The development of Quetz and Boa is self-funded by QuantStack.

About the authors

Wolf Vollprecht is a scientific software engineer at QuantStack and the creator of Mamba. Wolf started the Mamba project in 2019 and remains the main contributor to the project. When he is not developing Mamba, Wolf works on the Xtensor project, he develops Jupyter extensions, and robotics applications!

Mario Buikhuizen is a freelance full-stack web developer working with QuantStack on several projects related to Jupyter, Voilà, and Mamba. Mario is the main author of Quetz.

Marianne Corvellec works with QuantStack as a freelance software developer. Marianne developed the front-end of the Mamba navigator. Marianne also contributes to Voilà, Plotly, and scikit-image.

Johan Mabille is a C++ expert and the main author of the Xtensor and Xeus C++ libraries. He joined the Mamba endeavour recently and developed the support for multi-channels, the handling of query results, and took up a general cleanup of the codebase.

David Brochart is a scientific software developer at QuantStack. He contributed significantly to the Mamba codebase. Besides Mamba, David maintains numerous projects from the Jupyter stack (nbconvert, jupyter server, Voilà, ipyleaflet).

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store