Horizontally scaling GIS/Python analysis with Docker
I’ve recently been working on a project that involves running ETL and a series of spatial analysis calculations using SQL/PostGIS and Python on approximately 800 million data points. The project quickly grew from analyzing a small subset of data to needing to process multiple years of data to support the initial findings and it was soon apparent that some form of parallel processing would be necessary to process the volume of data and be able to run the spatial analysis multiple times. Refactoring the code to utilize Python multiprocessing or threading was a possibility but I wanted the ability to scale beyond multiple servers if necessary (and was looking for an excuse to play with Swarm!).
Docker Engine now has native clustering via Swarm and offers the ability to instantly scale a container across multiple nodes with one command. One minor change to the python code was all it took to be able to leverage Docker for parallel processing.
Parallel processing with Docker Swarm…
Step 1: Configure Python to use environment variables
Step 2: Build a Docker image using the Dockerfile
Step 3: Create and scale Swarm services
Step 4: Sit back and watch it run (aka watch the load averages)
Obviously this is a simplified example but the key benefit to using Docker was that I was able to easily scale an otherwise lengthy ETL and analysis process with one minor modification to my Python code (using environment variables instead of argparse). I would strongly recommend keeping a close watch on the load averages on the Swarm nodes and any associated databases when initially running the service scale command.
I also have no doubt that there are better and more efficient ways to scale out these types of processes. This is meant as a quick and relatively easy way to scale if you have the resources for multiple Swarm nodes or already use other parts of the Docker ecosystem.