Don’t Use Docker to Package Python Lambdas on Your Mac
MacOS is not exactly like Vanilla Linux. It’s a close cousin, sure, but not really a sibling.
MacOS runs on open-source Unix-based internals known as Darwin, and because not all *NIX- based distributions are created equally, they are not 100% compatible with each other. You might experience some weird behavior as soon as you do something that needs OS-level access or depends on some internal Linux conventions.
Most of the time, these nuanced differences shouldn’t bother you too much. If you need to run a few shell commands or copy-paste something from Stack Overflow, you’ll be just fine.
You may encounter this type of setback if you try packaging your Python repositories and uploading them to a cloud provider. Your Python package may or may not run just fine, depending on how much it relies on OS specifics.
There are several solutions to this problem. One is slow and costs money; the other is fast and free.
The question is: Which do you prefer?
The Obvious Choice — Docker
Most Python packages are OS agnostic. They will work on Linux, MacOS and Windows. Since they are such close cousins, most packages are cross-compatible between MacOS and Linux, as opposed to Windows. But some packages, mostly in the cryptography area, have OS-specific implementations, with incompatible versions between Darwin and other Linux distributions. Those packages will break your Python distributions and, if you are using them in your Python program, you will need to use a separate wheel (or binary) for each OS you want to run your Python program on.
That’s why most developers, when using a cloud-based runtime, are building Python packages on their MacBooks using an intermediary: Docker. Most cloud providers do not offer a Darwin-based runtime, so developers must build their package inside a Linux container to ensure full compatibility.
The Disadvantages of Docker
But there are a few cons to Docker too.
First, Docker is really resource intensive. It’s like a mini Chrome running in the background, eating up your RAM and battery. My i9 MacBook Pro really gets its fans spinning when I build one of our larger Python projects.
Second, the workflow is quite complex. You must spin up a container, copy your project’s dependency file, resolve and install Linux-compatible Python packages and copy it over from the container (or set up a shared volume). We use pip + pipenv, by the way, so the code and examples will be based on those tools.
Even if you use cloud-native build tools, they use Docker in the background and spin up Linux containers for you. They may also require additional configuration, as you’d need to make them use your environment variables and build paths and external dependencies and so on.
Third — and this is a more recent reason — you may need to buy a license for Docker Desktop. This is mostly relevant for businesses, yes, but let’s explore a potentially better and free alternative.
Building Python Packages on MacOS Natively
The idea we decided to go with was to use pip to install packages while enforcing specific OS compatibility. We stumbled upon this feature of pip almost by accident and thought it was worth a go.
The first thing you need to know about this magical pip feature is that it won’t resolve your packages’ sub-dependencies.
Forcing pip to download OS-specific packages using the “ — platform” flag also forces usage of the ‘ — no-deps’ flag. Passing this flag to pip stops it from resolving transitive dependencies, and it will just download the packages you asked for and finish.
Command examples:
— “pip install requests — platform manylinux2014_x86_64”
Trying to install a package for a specific OS won’t work. Pip will error out, saying “either — no-deps must be set or — only-binary=:all: must be set and — no-binary must not be set”.
— “pip install requests — platform manylinux2014_x86_64 — no-deps”
This installs but does not work, as requests do not have any of its dependencies installed.
— “pip install requests certifi charset-normalizer idna urllib3 — platform manylinux2014_x86_64 — no-deps”
Installing requests with all its dependencies solves this problem.
The second issue concerns those platform-specific packages we talked about. One of our packages refused to install because it needed to compile binaries as part of its installation, and it was failing to compile those Linux-specific .bin files on our Darwin distribution.
It seemed we were at a dead end. We had no dependency resolution and one package that wouldn’t install at all.
Fixing Dependency Resolution
Let’s start with the second issue (installing platform-specific packages) because the solution was marginally simpler. Apparently, you can force pip to download pre-compiled binaries for packages that need them. One of those packages is pyssl. All we had to do was add the “ — only-binary” flag, pass our package name to it and voila — it works.
“ — only-binary pyssl”
This flag allows downloading and installing pre-compiled binaries. You can add more comma-separated values as needed: “ — only-binary pyssl,cryptography”
Dependency resolution was a bit trickier. We tried using some external tools, but none of them worked. The only one that came close to a workable solution was johnydep, which worked for most package, but stuck in a recursive, endless loop for packages with multi-level dependency trees.
This caused me to think to myself, “Wait a minute. I’m a software engineer! I, too, can write recursion!” And I did, ending up with our in-house dependency-tree resolver for pip packages.
You can grab the code from this repository.
Getting a single package’s dependencies was easy enough using “pip show <package_name>” and parsing the output. Add some dynamic programming to remember packages you have already resolved dependencies for, and it runs quickly.
This produces a “flat” dependency list, like what you can find in a Pipfile.lock file — but for the OS you had specified. We could now run our pip command with the platform enforcement flags and produce a working, deployable Zip.
Here is the complete pip command you need to run on your flattened dependencies list:
“pip install –target <lambda_build_dir> — requirement <flattened_requirements_list> — platform manylinux2014_x86_64 — no-deps — only-binary <your_binary_only_dependencies>”
Improving Deployment Speed
It still wasn’t perfect. This was a costly process — even on subsequent runs when pip had already cached the packages and didn’t need to download them all. To speed up all subsequent runs, we ended up using another layer of cache. By hashing the contents of a requirements.txt file we had already resolved and flattened dependencies for, we could skip the dependency resolution process.
We could not just run pip freeze or pipfile lock and use their generated requirements list because it was generated only for the current OS you ran the command from, which is the thing we are trying to avoid.
You generally deploy the same project several times, making only minor code changes and not changing dependencies, so this caching really made sense for us.
It’s important to note that we only had this problem because we insisted on specifying different dependencies for each deployable Zip file and did not want to include dependencies that are natively provided by the cloud runtime environment. You can just package all your dependencies from your requirements.txt or pipfile.lock, and it probably won’t matter that much in terms of Zip size or run time.
So, we ended up with sort of a double cache which, when hit, is blazingly fast.
Here’s how we did it:
- First, we calculated a hash on the contents of a requirement file, which takes a fraction of second. If we already have a flattened dependencies list for it, we just use it.
- Then, once we have already downloaded the packages that need installing, pip just installs them from the cache. It took about 15 seconds for a project with 10 packages (which are approximately 60 packages, including all transient dependencies).
- Next is the built-in caching into pip. When downloading a package, pip saves it locally.
- If we need to install it again (say, into a different virtual environment), pip doesn’t need to download it again; it fetches the package from cache. This feature is enabled by default (but can be disabled if you want), so we got that one for free.
Reducing Deployment Size
Now we quickly built Zip files that are ready to be deployed. You can stop here and be happy with your results. You have successfully built Python packages with fully working Linux dependencies natively on your Mac.
One final improvement we made to minimize the size of files is to remove duplicate or unnecessary code.
Here are another few tips for keeping Zip archives down to size:
- Do not package any test or infrastructure files (if any). Cloud runtimes do not need your infrastructure code nor your tests. You should also avoid packaging any dependencies that your project uses for testing and IaaS purposes.
- Do not package dependencies that are supplied natively by the cloud provider of your choice. For example, AWS Lambda runtime will provide boto3 and aws_cdk dependencies out of the box for you, so you do not need to pack them.
- If you’re deploying to multiple clouds, create a separate package for each of them. You may have some vendor-specific dependencies that you can remove for each individual cloud deployment.
My Debt of Gratitude to Tech
This is probably the end of my tinkering with our build process for the time being, but I’m not done with it. During my long hours trying to get this whole orchestra to play nicely together, I’ve come to learn that Amazon Linux 2 Python platform versions support Pipenv-based requirements files.
I’d like to get rid of our requirements.txt files next. We use Pipenv to manage our virtual environments and only generate requirements files for our AWS deployments. We should match those requirements files with the exact versions from pipfile.lock, so we can’t just pip freeze (or pipenv lock -r) and be on our merry way. We have some god-awful code that I’m ashamed to even mention here to generate a requirements file from an already-locked pipenv.
Hopefully, I’ll get to write another post soon about how I got rid of them.
P.S. If you’re planning to use AWS’s new arm64 runtime, you just need to change the platform to “manylinux2014_aarch64.” From what I understand, it should work — but I did not test it. Note that some packages may require older Python versions and older CPython compilers.
Special thanks go out to my colleagues Noa Gelber and Dor Ganor for their contributions. You guys are awesome!