Machine Learning on AWS Lambda

AWS Icons. The “S” stands for “Spot The Difference.”

AWS Lambda is a service that lets you execute a single function on the AWS cloud, and only pay for the actual execution time. This is tremendously helpful for computation intensive tasks like cropping and converting images uploaded by users — you can trigger an AWS Lambda function with a file in S3, and it will dutifully perform its task independently of the rest of your architecture. It doesn’t matter whether you’re converting ten thousand pictures an hour or ten every month, there is absolutely no effort in scaling, and no architectorial differences.

That makes Lambda incredibly appealing for a lot of distributed computation tasks. However it can be a bit of a pain to set up: you have to bundle all of your dependencies into a single zip file along with your code. If you’ve ever tried that for a complex machine learning environment involving numpy and sklearn, you will have already experienced the torment and misery this will bring upon you.

Here is how to do it while maintaining your sanity.

First, create an EC2 instance using Amazon Linux and login to that. There we’ll install all of our dependencies.

Remember Fortran? Yeah, we need it.

sudo yum -y update
sudo yum -y upgrade
sudo yum -y groupinstall “Development Tools”
sudo yum -y install blas
sudo yum -y install lapack
sudo yum -y install atlas-sse3-devel
sudo yum install python27-devel python27-pip gcc

Scikit-Learn won’t compile on less than 1GB RAM. If you’re using a free micro instance, create a swap file:

sudo dd if=/dev/zero of=/swapfile bs=1024 count=1500000
sudo mkswap /swapfile
sudo chmod 0600 /swapfile
sudo swapon /swapfile

Okay, let’s create a virtual environment and install everything we need:

virtualenv ~/stack 
source ~/stack/bin/activate
sudo ~/$VIRTUAL_ENV/bin/pip2.7 install numpy
sudo ~/$VIRTUAL_ENV/bin/pip2.7 install scipy
sudo ~/$VIRTUAL_ENV/bin/pip2.7 install pandas
sudo ~/$VIRTUAL_ENV/bin/pip2.7 install sklearn

Another challenge is that Lambda has a limit of 50MB for zipped files including all dependencies. Let’s use some dirty tricks to bring our bundle size down. We can use strip to remove everything from binaries we probably won’t need in both lib and lib64:

find “$VIRTUAL_ENV/lib*/python2.7/site-packages/” -name “*.so” | xargs strip

Finally, let’s bundle up all of our modules in ~/

pushd $VIRTUAL_ENV/lib/python2.7/site-packages/
zip -r -9 -q ~/ *
pushd $VIRTUAL_ENV/lib64/python2.7/site-packages/
zip -r -9 -q ~/ *

Lambda will be looking for shared libraries in /var/task/lib, so let’s put everything we need into a lib folder:

cp /usr/lib64/atlas-sse3/ lib/.
cp /usr/lib64/atlas-sse3/ lib/.
cp /usr/lib64/atlas-sse3/ lib/.
cp /usr/lib64/atlas-sse3/ lib/.
cp /usr/lib64/atlas-sse3/ lib/.
cp /usr/lib64/atlas-sse3/ lib/.
cp /usr/lib64/atlas-sse3/ lib/.
cp /usr/lib64/ lib/.
cp /usr/lib64/ lib/.
cp /usr/lib64/libquadmath.2so.0 lib/.

And add this to our bundle:

zip -r -9 -q ~/ lib/

Great! Back on your local machine, get the zip file from EC2:

scp -i pemfile.pem ec2-user@

Now you can add your If you have a more complex program, I recommend putting that into a module that you import in Let’s remove all .pyc files first though:

find my_module/ -name ‘*.pyc’ -delete
zip -9
zip -9r my_module/

Because it’s so big, we need to upload it to an S3 bucket first before updating lambda:

aws s3 cp s3://my_bucket/
aws lambda update-function-code --region us-east-1 --function-name lambda_function --s3-bucket my_bucket --s3-key

This should take care of most obstacles you are going to run into. Did this work for you? Let me know in the comments.

Ex-neuroscientist, data wrangler, designer, co-founder of AI consulting firm

Ex-neuroscientist, data wrangler, designer, co-founder of AI consulting firm