General Strategy for Reducing Docker Image Size, with Python Flask Example

Published in

AI2 Labs

4 min readJun 17, 2020

Docker is a nice tool for containerizing our applications, but we often need to take note of the image size which can go up to a few GiBs. This post details general strategies that can be applied to most languages, and I’ll be using Python Flask to showcase how to apply it on a Python, which can lead to up to 90% reduction in image size.

Remove Dead Cruft

When transitioning from an exploratory phase to pre-production, try to remove dead code and unused dependencies that are left over from exploration. In agile processes, this can happen quite often, so it is helpful to setup CI or specify a linter like PyFlakes for the team.

Removing these can help speed up the docker build process as well, since some dependencies might require a compilation step on installing on a fresh system.

In Python, we can use several tools with this, but the steps are pretty generic and can be applied to any language.

1. Remove dead code

Vulture

Use vulture to find out if your code has any obvious dead code, leading to unneeded imports. Remove everything you can find.

2. Remove unused dependencies

List Imports

Next would be reviewing what imports are actually required by our code. list-imports will find imports in your code.

You should put this in a requirements.in or setup.py file, and remove all code-related and standard libraries imports so you keep a list of all top-level imports.

Pip-tools

Do not use pip freeze to maintain a list of required dependencies, it can be pretty difficult to prune dependencies in the future. Instead, we can compile dependencies using pip-tools.

You can treat the requirements.txt as a generated file instead, and deal directly with setup.py or requirements.in.

Bundling dependencies with your app

This step bundles all the required libraries to create a portable app you can move around. The point is to strip away build files and only extract the compiled files that we need to run the app.

Providing this will also allow us to do the next step, docker multi-stage, which is the largest memory saver.

Pyinstaller

Pyinstaller works pretty well with flask since it can include our template and static files as well. However, there are one change to be made for flask apps, which is to change how we find the folders for the static directories. More information can be found here.

if getattr(sys, 'frozen', False):
    template_folder = os.path.join(sys._MEIPASS, 'templates')
    static_folder = os.path.join(sys._MEIPASS, 'static')
    app = Flask(__name__, template_folder=template_folder, static_folder=static_folder)
else:
    app = Flask(__name__, static_folder='static')

After that, the command you use to build you app can be like the following example.

pyinstaller -w -F \
	--add-data "templates:templates" \
	--add-data "static:static" \
	--hidden-import='pkg_resources.py2_warn' \
	app.py

This particular hidden import is needed due to a problem with setup tools > 45.0.0, as seen here.

You might need a few tries to get this working since pyinstaller is only able find top level, obvious imports from your code, more information here:

You can build the app with --debug=imports and run it to find hidden imports.

Docker Multi-Stage Builds

Sometimes we might hear that Dockerfile best practices involve putting all your RUN commands into one line to reduce layer caching and reduce image size.

This is no longer helpful with multi stage builds since we should take advantage of layer caching when it makes sense.

Example docker file for the purpose of illustrating these steps:

FROM python:3.6 AS build-stage
RUN apt update && apt install -y cmake
ADD app /
RUN pip install -r requirements.txt
RUN pyinstaller -w -F \
	--add-data "templates:templates" \
	--add-data "static:static" \
	--hidden-import='pkg_resources.py2_warn' \
	--noconfirm \
	app.pyFROM debian:buster-slim AS deploy-stage
RUN apt update && apt install -y libxcb6
COPY --from=build-phase /dist/app /app
EXPOSE 3001
CMD ["/app"]

Picking the right base images

Most of your space (and time!) savings actually come from this.

Since we can split the build and deployment stages, we can pick a heavy image with all the tools we need for the build stage, so we don’t need to do an extra step of downloading and installing. During the deploy stage, we can then choose a minimal, thin image.

For most applications, this would mean picking whatever thick image like node:latest or python:3.6 and then deploying with a nginx:alpine or a debian:buster-slim image.

It is interesting that alpine images are not suitable for python deployments.

Layer caching considered helpful

You don’t always get the dockerfile right the first time, and it’s helpful for docker not to repeat steps that has been completed before. For that we can leverage layer stage caching. Docker won’t repeat layers and stages that has no change unless a COPY or ADD command invalidates it.

This means you can explore step by step the correct steps to create your docker image if you split up your RUN, COPY and ADD commands in your build stage correctly by placing COPY and ADD last. However, you still want to keep your commands in the deploy stage to the minimum to reduce image size. You can find more information here : https://docs.semaphoreci.com/ci-cd-environment/docker-layer-caching/

Stages will not be invalidated unless it has changed.