Installing GDAL onto a Databricks cluster

Published in

OS TechBlog

3 min readFeb 17, 2022

Here at OS we process a lot of data. We use Python and Databricks to help us with that and my team does a lot of image processing. GDAL is a useful tool for image processing however it’s difficult to install GDAL onto a Python Databricks environment. This is how we managed to achieve it.

Why is GDAL difficult to install?

GDAL is a commonly used library used for reading and writing raster and vector geospatial data formats. It can’t easily be installed onto a Windows Python environment with pipas GDAL is a C++ library with python bindings.

It is much easier to install with condahowever, due to Anacondas recent change to commercial licensing, Databricks has deprecated Conda from their most recent runtimes. The latest runtime that still allowed us to install GDAL, via a Conda init script, was 7.3 LTS which is now three major versions behind the most up to date and going out of support in September.

Why not use the method recommended by Databricks?

We had to find a solution that would allow us to use the latest runtimes for greater supportability and Databricks’ recommendation (See the orange box here) was to use a Conda-based docker container to pre-install any libraries that required Conda.

Databricks have their own “Blue Peter: here’s one I made earlier” container images for Conda in DockerHub however we thought it best to try and not use Conda at all to avoid any potential licensing issues we may come across in the future.

Why not use Databricks’ standard container?

Databricks also have a standard container available on DockerHub and we thought it would be easier to install GDAL onto their standard image if we could also install the correct python bindings required for GDAL.

This was somewhat successful. The python bindings required were python3.8-dev, gdal-bin and libgdal-dev plus a few additional environment variables. This is the Dockerfile we used to successfully install GDAL without Conda:

 FROM databricksruntime/standard:9.x
 RUN apt-get update
 RUN apt-get install — yes python3.8-dev gdal-bin libgdal-dev
 ENV CPLUS_INCLUDE_PATH=/usr/include/gdal
 ENV C_INCLUDE_PATH=/usr/include/gdal
 RUN /databricks/python3/bin/pip install GDAL==2.2.3

However, you may notice that this version of GDAL is quite old, the latest version is 3.4.1. This is because the Databricks standard image is itself based on Ubuntu 18.04 and GDAL 2.2.3 is the latest version available on this distribution. This version was too old to be compatible with some of our other libraries, so we had to find another solution.

Let’s flip it on its head

The problem we have been trying to solve here is “Can we install GDAL onto a Databricks Runtime container?” and the main blocker is that it’s really difficult to install GDAL. So instead of doing that we thought it was worth inverting the question and asking, “Can we install a Databricks Runtime onto a GDAL container?”.

OSGEO have an ubuntu image on their DockerHub with the latest version of GDAL already installed. The Pythonised Databricks container Dockerfile can also be found on GitHub and this installs and sets up all of the required dependencies for the container to be used by Databricks.

By combining these Dockerfiles we finally found a way to have an up-to-date version of GDAL installed on a Databricks cluster without any need for Conda. The Final Dockerfile that we used is below:

FROM osgeo/gdal:ubuntu-small-3.4.1 as base# From https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile and https://github.com/databricks/containers/blob/master/ubuntu/python/DockerfileRUN apt-get update \
  && apt-get upgrade -y \
  && apt-get install -y \
  openjdk-8-jdk \
  iproute2 \
  bash \
  sudo \
  coreutils \
  procps \
  virtualenv \
  && /var/lib/dpkg/info/ca-certificates-java.postinst configure \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*# Initialize the default environment that Spark and notebooks will use
RUN virtualenv -p python3.8 — system-site-packages   /databricks/python3# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark — it is injected when the cluster is launched
# Versions are intended to reflect DBR 9.0
RUN /databricks/python3/bin/pip install \
  six==1.15.0 \
  # ensure minimum ipython version for Python autocomplete with jedi 0.17.x
  ipython==7.19.0 \
  numpy==1.19.2 \
  pandas==1.2.4 \
  pyarrow==4.0.0 \
  matplotlib==3.4.2 \
  jinja2==2.11.3# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3

Installing GDAL onto a Databricks cluster

Why is GDAL difficult to install?

Why not use the method recommended by Databricks?

Why not use Databricks’ standard container?

Let’s flip it on its head

Written by James Clark Dixon