Compiling a Slim Version of PyArrow for Lambda

Steve Fraser
4 min readApr 28, 2020

--

Overview

As a process, we like to utilize Lambda functions, over chron jobs, to move data around for a schedule activity to reduce compute cost and keep maintenance simple.

Our teams tend to like to use Lambda layers to allow for reusable packages and ease of use.

When maintaining Lambda layers, I find it’s important to take be thoughtful about which libraries that are brought in and the size of those libraries.

We start to have trouble with certain larger libraries that are partically prevelent in the data sciences area.

In our case, we wanted to convert JSON data into parquet to store for easy analysis at a later date.

PyArrow was choosen by the team as the library to convert the data.

As it turns out, PyArrow was an exceptionally big package.

For PyArrow, we only cared about the conversion so we could strip out a few of the larger features by building a custom wheel.

Journey

I always use Docker images in order to compile and install all python and c libraries for Lambda.

I started by following my normal process by running the Lambda python3.7 container image and mounting to a local folder.

cd /Users/stevenfraser/Documents/Personal/side_project/medium-art
mkdir python
docker run -it \
-v /Users/stevenfraser/Documents/Personal/side_project/medium-art/:/opt \
lambci/lambda:build-python3.7 \
bash

I proceeded to install all of the required libraries through pip.

cd /opt/pip3 install  --target python/ pandaspip3 install --target python/ pyarrow

Finally, Let’s zip and upload as a new layer to lambda.

exit
zip -r9 pyarrow.zip python
aws s3 cp pyarrow.zip s3://bucket01/libraries/
aws lambda publish-layer-version \
--layer-name pyarrow \
--description "Pyarrow with Pandas" \
--license-info "MIT" \
--content S3Bucket=bucket01,S3Key=libraries/pyarrow.zip \
--compatible-runtimes python3.7
An error occurred (InvalidParameterValueException) when calling the PublishLayerVersion operation: Unzipped size must be smaller than 262144000 bytes

Oh no, it looks like pyarrow with Pandas hits the layer size limit.

How big is it?

du -h345M    ./python

The local folder is 345 mb, which is indeed bigger than the 262 mb layer limit in lambda.

How can we shrink this down?

I started by removing the access testing file, but only was able to slim it down slightly.

So, I decided to create a wheel with slimmed-down feature sets.

I created a Dockerfile that build the arrow c library, then builds pyarrow starting from the lambci/lambda:build-python3.7 image with slimmed down features.

FROM lambci/lambda:build-python3.7

#REMOVES THE OLD VERSION OF CMAKE AND INSTALLS 3.10RUN yum remove cmake -y \
&& yum install wget -y \
&& cd /tmp/ \
&& wget https://cmake.org/files/v3.10/cmake-3.10.0.tar.gz \
&& tar -xvzf cmake-3.10.0.tar.gz \
&& cd cmake-3.10.0 \
&& ./bootstrap \
&& make \
&& make install
#INSTALLS Python dependencies
RUN pip3 install --no-cache-dir \
six \
cython \
numpy
#INSTALLS CURL
RUN yum install curl -y

ARG ARROW_VERSION=0.17.0
ARG ARROW_BUILD_TYPE=release
ENV ARROW_HOME=/arrow/dist/

#MAKES AND INSTALL THE ARROW C LIB
RUN mkdir /arrow \
&& curl -o /tmp/apache-arrow.tar.gz -SL https://github.com/apache/arrow/archive/apache-arrow-${ARROW_VERSION}.tar.gz \
&& tar -xvf /tmp/apache-arrow.tar.gz -C /arrow --strip-components 1 \
&& mkdir /arrow/dist \
&& export LD_LIBRARY_PATH=/dist/lib:$LD_LIBRARY_PATH \
&& mkdir -p /arrow/cpp/build \
&& cd /arrow/cpp/build \
&& cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on \
-DARROW_PYTHON=on \
-DARROW_PLASMA=on \
-DARROW_WITH_SNAPPY=on \
-DARROW_BUILD_TESTS=OFF \
.. \
&& make \
&& make install
#CREATES THE WHEEL
RUN yum install pkgconfig -y \
&& cd /arrow/python/ \
&& export PYARROW_WITH_PARQUET=1 \
&& python setup.py build_ext --build-type=release --bundle-arrow-cpp bdist_wheel

With the above Docker file, I built a custom slimmed down wheel installing it through pip

pip3 install --target python pyarrow-0.17.0-cp37-cp37m-linux_x86_64.whl

How big is it with the new build of PyArrow with Pandas?

exit
du -h
204M ./python

with 204mb, we are now under the layer limit.

Build and Install a Custom PyArrow Wheel

Start by cloning my pyarrow-slim repo

git clone https://github.com/steve-fraser/pyarrow-slim.git

Modify the current build parameter in Dockerfile.amazon_linux

Then, build a new container image.

cd pyarrow-slim/
docker build -f Dockerfile.amazon_linux .
Successfully built 044f876a5258

Start the built container and install the wheel.

docker tag 044f876a5258 pyarrow-slim-lambdadocker run -it \
-v /Users/stevenfraser/Documents/Personal/side_project/medium-art/:/opt \
pyarrow-slim-lambda \
bash
cd /opt/
rm -rf python/*
cp /arrow/python/dist/pyarrow-0.17.0-cp37-cp37m-linux_x86_64.whl /opt/pip3 install --target python pyarrow-0.17.0-cp37-cp37m-linux_x86_64.whl

Install the rest of the required libraries.

pip3 install  --target python/ pandas

Now, let’s package and upload to lambda.

rm pyarrow.zip
zip -r9 pyarrow.zip python
aws s3 cp pyarrow.zip s3://bucket/libraries/
aws lambda publish-layer-version \
--layer-name pyarrow \
--description "Pyarrow with Pandas" \
--license-info "MIT" \
--content S3Bucket=bucket01,S3Key=libraries/pyarrow.zip \
--compatible-runtimes python3.7
{
"LayerVersionArn": "",
"Description": "Pyarrow with Pandas",
"CreatedDate": "2020-04-28T20:45:58.634+0000",
"Content": {
"CodeSize": 55185418,
"Version": 3,
"CompatibleRuntimes": [
"python3.7"
],
"LicenseInfo": "MIT"
}

--

--