Machine Learning on AWS Lambda
AWS Lambda is a service that lets you execute a single function on the AWS cloud, and only pay for the actual execution time. This is tremendously helpful for computation intensive tasks like cropping and converting images uploaded by users — you can trigger an AWS Lambda function with a file in S3, and it will dutifully perform its task independently of the rest of your architecture. It doesn’t matter whether you’re converting ten thousand pictures an hour or ten every month, there is absolutely no effort in scaling, and no architectorial differences.
That makes Lambda incredibly appealing for a lot of distributed computation tasks. However it can be a bit of a pain to set up: you have to bundle all of your dependencies into a single zip file along with your code. If you’ve ever tried that for a complex machine learning environment involving numpy and sklearn, you will have already experienced the torment and misery this will bring upon you.
Here is how to do it while maintaining your sanity.
Setting up a Machine Learning Environment on EC2
First, create an EC2 instance using Amazon Linux and login to that. There we’ll install all of our dependencies.
Remember Fortran? Yeah, we need it.
sudo yum -y update
sudo yum -y upgrade
sudo yum -y groupinstall “Development Tools”
sudo yum -y install blas
sudo yum -y install lapack
sudo yum -y install atlas-sse3-devel
sudo yum install python27-devel python27-pip gcc
Scikit-Learn won’t compile on less than 1GB RAM. If you’re using a free micro instance, create a swap file:
sudo dd if=/dev/zero of=/swapfile bs=1024 count=1500000
sudo mkswap /swapfile
sudo chmod 0600 /swapfile
sudo swapon /swapfile
Okay, let’s create a virtual environment and install everything we need:
virtualenv ~/stack
source ~/stack/bin/activate
sudo ~/$VIRTUAL_ENV/bin/pip2.7 install numpy
sudo ~/$VIRTUAL_ENV/bin/pip2.7 install scipy
sudo ~/$VIRTUAL_ENV/bin/pip2.7 install pandas
sudo ~/$VIRTUAL_ENV/bin/pip2.7 install sklearn
Another challenge is that Lambda has a limit of 50MB for zipped files including all dependencies. Let’s use some dirty tricks to bring our bundle size down. We can use strip to remove everything from binaries we probably won’t need in both lib and lib64:
find “$VIRTUAL_ENV/lib*/python2.7/site-packages/” -name “*.so” | xargs strip
Finally, let’s bundle up all of our modules in ~/lambda.zip:
pushd $VIRTUAL_ENV/lib/python2.7/site-packages/
zip -r -9 -q ~/lambda.zip *
popd
pushd $VIRTUAL_ENV/lib64/python2.7/site-packages/
zip -r -9 -q ~/lambda.zip *
popd
Lambda will be looking for shared libraries in /var/task/lib, so let’s put everything we need into a lib folder:
cp /usr/lib64/atlas-sse3/liblapack.so.3 lib/.
cp /usr/lib64/atlas-sse3/libptf77blas.so.3 lib/.
cp /usr/lib64/atlas-sse3/libf77blas.so.3 lib/.
cp /usr/lib64/atlas-sse3/libptcblas.so.3 lib/.
cp /usr/lib64/atlas-sse3/libcblas.so.3 lib/.
cp /usr/lib64/atlas-sse3/libatlas.so.3 lib/.
cp /usr/lib64/atlas-sse3/libptf77blas.so.3 lib/.
cp /usr/lib64/libgfortran.so.3 lib/.
cp /usr/lib64/libquadmath.so.0 lib/.
cp /usr/lib64/libquadmath.2so.0 lib/.
And add this to our bundle:
zip -r -9 -q ~/lambda.zip lib/
Great! Back on your local machine, get the zip file from EC2:
scp -i pemfile.pem ec2-user@100.110.120.130:~/lambda.zip lambda.zip
Now you can add your lambda_handler.py. If you have a more complex program, I recommend putting that into a module that you import in lambda_handler.py. Let’s remove all .pyc files first though:
find my_module/ -name ‘*.pyc’ -delete
zip -9 lambda.zip lambda_handler.py
zip -9r lambda.zip my_module/
Because it’s so big, we need to upload it to an S3 bucket first before updating lambda:
aws s3 cp lambda.zip s3://my_bucket/lambda.zip
aws lambda update-function-code --region us-east-1 --function-name lambda_function --s3-bucket my_bucket --s3-key lambda.zip
This should take care of most obstacles you are going to run into. Did this work for you? Let me know in the comments.