An interesting way to speed up conda
Conda has rapidly become the standard for installing data-science related libraries for users due to it just working. One major part of this is a community-led packaging effort known as conda-forge. Conda-forge allows users to build conda packages for virtually anything and has over 6000 packages at the time of writing.
This great success does come with some costs sadly. Conda determines which packages it needs to install by using a SAT solver. These are very good at determining optimal package resolutions for large problems but scale super-linearly with the number of clauses that need to be resolved. For conda-forge, this presently (Jan2019) means around 50 thousand artifacts that need to be considered. This can lead to long install times where users just see a spinning wheel determining which packages are needed to be installed.
This is not a great user experience
Users don’t like waiting to install packages, leading to people choosing to use other systems for managing packages.
Can we fix this?
We can go change conda and improve the solver. There are efforts underway to do that but this is not something that is easy to do. Given that improving the solver is hard — is there something that can be done more easily?
Let’s unpack what happens when you install a package from a conda channel.
- The channel’s repository index (repodata.json) is downloaded from the channel for your machine’s platform and architecture.
- The repodata is parsed and a set of metadata about all packages in that channel is loaded into memory
- A SAT-solver is used to determine which packages to download from the channel and in which order to install them
- Packages get installed.
- Happy users
So, if we were to change the repodata we could affect the behavior of the channel….
Structure of a repodata.json
Conda stores all of the metadata about packages in a channel in a file called repodata.json. This contains metadata about all of the packages in a channel, their versions, their dependencies, and various other metadata fields.
This is a minimal chunk of a repodata.json. Other fields have been removed for clarity.
The interesting parts here are the `depends` key of the various packages. Using this we can construct a simple graph of all of the packages (ignoring dependencies).
Most users are never going to install all 6000 packages from conda-forge into a single environment, instead choosing to install a set of packages instead.
So given we can build a graph of packages we can choose to remove packages we don’t need.
In the example above if we wanted to install
package_awe need to install
package_b but not
package_c. We can turn this into a graph
If we want to solve which subset of packages are viable candidates if we want to install
package_a we can determine the set by finding all the nodes that it depends on recursively until we run out of nodes.
In practice for many common installs using conda-forge we can reduce the size of the repodata by 90% or more. This massively speeds up the solving process as well as reducing the size of the repodata that we need to retrieve from the internet.
Putting these ideas together
I’ve built a version of this at https://github.com/regro/conda-metachannel. We’ll have a version of it hosted soon for people to play with.
The only remaining point that is somewhat interesting is how users can tell our fake channel that they are interested in a particular subset of packages. Conda does not have a way to specify arguments to pass to a channel.
A channel, however, is just a URL. As such we can make our “user interface” be expressed in form of URL segments.
Then in our application, we can just decompose the URL by splitting by slashes and using the package constraints to serve up our graph.
When we encounter a URL that is not for repodata we instead just find the appropriate actual package and serve a redirect to the real file.