Writing a Zip File Importer: The Path Hook

Part 1

There are essentially three parts to any importer in Python 3:

  1. The path hook
  2. The finder (meta or path entry)
  3. The loader

In three posts I hope to write about the creation of each needed class in order to come up with a new zip-based importer to replace zipimport (issue #17630).

Why Replace zipimport?

You might be wondering why would I go to all of this trouble to replace a module that seems to basically work? In essence: maintaining C code sucks compared to Python code. You see, zipimport is written entirely in C for bootstrapping reasons; you can’t load stdlib modules from a zip file if you can’t load the zip file importer which depends on those stdlib modules itself.

But this has the problem of maintaining C code. Did you know zipimport has its own implementation of a zip file reader? Yeah, which means any fixes or improvements which may go into the zipfile module do not make there way into zipimport. E.g. you ain’t about to get ZIP64 support in zipimport unless you want to write a patch to support it in the C code. It also means that improving the importer to be more flexible is simply not supported as no one wants to put the effort in. But if a zip file importer was written in Python, then supporting things that zipfile supports, allowing for alternative loaders, etc. don’t end up seeing like such a stretch.

The Path Hook

Creating a path hook involves writing an object which, when called, can make a decision as to whether it can provide a finder for a specific path. In the case of importlib.machinery.FileFinder, its path hook returns a finder whenever it is presented with a directory. For a zip file path hook, it needs to detect a zip file. Sounds easy, right?

The trick with path hooks is handling the case where the “directory” for a package has been appended to the original thing typically found in sys.path. In the case of zip files, it’s when you have a path to something like /path/to/code.zip but are in some package named pkg: /path/to/code.zip/pkg/ (notice how the two paths are mushed together). Because people tweak package locations in __path__ on occasion you typically want to try and handle this case instead of just throwing your hands up and saying “screw __path__ manipulations, I will figure out where a module is from the top of a zip file and not care about funky choices made by users”. While it’s tempting to do this (especially for things like database importers where the concept of a file path is in no way inherent in the storage mechanism), in the case of zip files it makes sense to try and support this (unlike zipimport).

How do you do this? Well, you start with the path you were given, and check if it’s a file. If it is, is it a zip file? If so then great! If not then you can’t handle this path. Otherwise you slice off the end of the path, walking up until you hit a file or run out of path parts to slice off. It’s a bit clunky, but the only other alternative is to match based on some substring like “.zip” which is rather brittle.

But that’s actually it, by definition, for supporting a path hook: is this path something I can provide a finder for? Sometimes, though, you have one other question to worry about: is there something I should be caching in my path hook?

See, a single zip file might end up with a bunch of different paths pointing into it thanks to various packages it contains. Now each of those paths will have a finder stored in sys.path_importer_cache. Each of those finders will have to check if a path exists in the zip file, and then eventually pass on the zip file object to the loader so it can actually read the data it needs to load a module. What all of this means is that multiple finders can point to a single zip file (e.g. /path/to/code.zip/pkg/, /path/to/code.zip/pkg/subpkg/, etc.). You really don’t want to have to be constantly opening and closing zip file for every import. So in this instance it’s best to cache zip files in the path hook. That way finders can just ask the path hook for the zip file when it needs it and thus only have a single zipfile.ZipFile object open per zip file. Saves on stat calls, overhead, etc.

And that’s it! The code for a zip file path hook is all about finding out if a path has a zip file in it and managing the cache of zip files.