Evaluating TensorFlow Models in AWS Lambda
Our journey to classify video frames as quickly as possible — with serverless parallelisation
In the BBC, Artificial Intelligence and Machine Learning are currently being explored heavily by pretty much all product teams — Why? For the answer to this, I am going to point at this blog post in which Ali, our Head of Emerging Technology, goes into a bit more detail about the why and covering not just BBC iPlayer.
While exploration is very important, in iPlayer we’ve already gone a step further: A couple of months ago we’ve trained our first machine learning model using Google’s TensorFlow. Its job is to classify video frames by certain characteristics and give us a score back. This is has been running in production for a while now. And we’re iPlayer — we have a lot of videos, and therefore a lot of video frames to be processed and classified.
How many you ask? Well, most of our catalogue is available for 30 days. Every time we broadcast an episode that is not currently available, it becomes ‘available’, and thousands of episodes of iPlayer content become available each month.
Every time an episode becomes ‘available’, we need between 80–200 video frames classified. Ideally as quickly as possible.
How? The obvious step would be to look at Google, who make TensorFlow, and see that they have a service called the Cloud Machine Learning Engine. And whilst we’re experimenting with Google Cloud Platform (and yes, some BBC teams are looking at Azure), AWS is our production platform, so we had to work with that as a constraint.
I did spoiler it in the title for this blog post what our solution is: AWS Lambda. In theory, we should be able to invoke a classification function 150 times in parallel, pay for that compute time, and get results back pretty quickly.
In practice — it works! However, it was a bit of a journey:
AWS does place certain limitations on Lambda functions. Of interest here is the Lambda function deployment package size at a neat 50 MB (N.B. when using the console to upload the function this is 10 MB).
Let’s have a look at what we need to provide to classify our video frames:
- A trained model necessary for classification takes up space — in our case around 85 MB.
- The TensorFlow library itself to run a video frame against the model takes up space — uncompressed we’re looking at 280 MB for that and the Python dependencies it requires that don’t come with Amazon Linux (the environment in which Lambda functions get executed).
- Some code to process the data, and return something — this weighs in at around 15 kB in this case.
Doing some maths this adds up to 365 MB. Just a bit more than the 50 MB AWS allows us to upload.
How could we get around that? Compression I hear you shout. We use TensorFlow 0.11 — the download size for that alone in a Lambda-like environment is 39.8MB. Add all the dependencies to that and we’re way over our limit.
Now you could download and install the dependencies and TensorFlow at runtime — but for several 10s of thousands of invocations a day that’ll add up in computation time and bandwidth. It’s probably not a secret that iPlayer doesn’t quite have the $500M R&D budget of a certain competitor to even attempt justifying that. Thankfully there are some work arounds for that to which we’ll get to in due time.
How about the model? Can we compress that? Well, there’s some research, Caffe models can do it. TensorFlow? Nothing production-ready. And regardless of that, when the model gets trained further or becomes more complex, its size will increase, so this will not help us much in the long term.
Let’s begin with the hackiest part: --exclude. A TensorFlow 0.11 installation gives us a lot of tools included for visualisation: TensorBoard, Eigen, and libraries for making use of other tools such as Numpy and ffmpeg. Additionally, when you run Python, the interpreter will save the compiled byte code with the source files — we do not need to include all of when packaging up our zip. So we’re not including any of that when packaging it up. To be precise, we exclude
*.DS_Store *.pyc /external/* /tensorflow/contrib/* /tensorflow/include/unsupported/*. This saves us 83 MB or so. We can probably save a lot more on that front by excluding more unnecessary libraries or using a different version of TensorFlow to evaluate models, but it turns out this is enough for us for — alongside the other tweaks.
There is only one more thing to be removed before our zipped package weighs in at under 50 MB and is ready to be deployed: The model.
The storage: S3 and /tmp
We store the model in S3. Sounds simple, right? The benefits are clear: We can decouple that, train and update the model separately and we don’t need to include some extra 85 MB of data in the lambda.
On the other hand, we have to pay for S3: storage of the model and requests from the lambda. Also, it takes extra compute time we’d have wait for and pay for, which isn’t ideal. If we go back to the Lambda limits I mentioned earlier, it states that there are 512 MB of Ephemeral disk capacity (“/tmp” space) available. And that’s where we are going to keep our model:
Because we run 200 invocations or so in parallel we’ll only need to download the model once and save it there — the other invocations should theoretically have access to it if it’s already saved. So one of the first things we do when invoking the function is to check whether the model exists in that location and download it using Boto3 (the AWS SDK for Python) if it doesn’t:
import osif os.path.isfile('/tmp/graph.pb') != True:
s3 = boto3.client('s3')
s3.download_file('modelbucket', 'graph.pb', '/tmp/graph.pb')
Dandy right? Except there’s one more item on my solutions list above:
The warm-up: CloudWatch Scheduled Events
Some of you might have got some questions in regard to the above — if we’re kicking off 200 invocations in parallel, wouldn’t they all have to download the model as at that point it won’t exist? How long does
/tmp persist anyway? We’ve done some digging.
Exhibit 1: Understanding Container Reuse in AWS Lambda. This blog post from 2014 when Lambda was still in preview mentions the following:
Files that you wrote to /tmp last time around will still be there if the sandbox gets reused.
Exhibit 2: AWS Lambda: How It Works:
Each container provides some disk space in the
/tmpdirectory. The directory content remains when the container is frozen, providing transient cache that can be used for multiple invocations. You can add extra code to check if the cache has the data that you stored. For disk space size, see AWS Lambda Limits.
[…] Do not assume that AWS Lambda always reuses the container because AWS Lambda may choose not to reuse the container. Depending on various other factors, AWS Lambda may simply create a new container instead of reusing an existing container.
So in theory, it should just stay there, except it might not. So we have set up a scheduled event to keep the container ‘warm’. We invoke the function every couple of minutes not to do any processing, but to make sure we’ve got the model ready. This accounts for a very small percentage of invocations but helps us mitigate race conditions in the model download.(We also limit artificially how many lambdas we invoke in parallel as an additional measure).
So how big is our function now? The zip measures in at 40MB at the moment — turns out python source files compress quite nicely.
This should remain at this level for this specific one as we don’t need to update the lambda to train the model further.
This brings us to the final question: How do we invoke the function? — API Gateway is the answer to that one.
The access: API Gateway
We have an API Gateway POST endpoint that takes the base64-encoded jpeg of the video frame, and returns JSON 10 seconds later. This also allows quick and effective testing of changes to the model without having to write any code to invoke the function —
base64 is all that’s needed to evaluate a frame.
API Gateway does actually support binary data now, so we could scrap the need for
base64 as it does the base64 encoding for you, but when we first deployed this, that feature was not available using CloudFormation.
All in all we are quite happy with how this is running in production at the moment , in terms of reliability and efficiency, but we are looking forward to improving on this pipeline, and putting it to use for many more applications to come.
We are gathering usage data of this classifier and with help from the editorial staff we are analysing the accuracy of the model, but the training of the model is currently a manual process which we are hoping to automate in the future.
We are also looking at other metadata which might be available to us elsewhere in the broadcast chain to enrich our content, and how we might use it to improve the audience experience — stay tuned for more on what we are actually classifying if the gif above doesn’t give away too much.
If you’re interested and excited by the technical challenges we’re facing, we’re always looking for driven engineers to join us.
You can see all our job listings for BBC iPlayer here by searching for iPlayer.