Lessons learned from launching TubeStats: a completely serverless service

I recently launched TubeStats which is a simple service to export comments from YouTube videos.

In this post, I’ll give a brief overview of the building process and the technology used to develop the web app.

TubeStats is completely serverless and uses Serverless framework under the hood to manage AWS resources. It is running on AWS lambda and using DynamoDB and SQS for job queues and data storage.

The setup is a little different from most other AWS lambda based tutorials as I’m also serving HTML from lambda handlers.

Here is what it looks like

cloudcraft.co FTW

To build the service running on lambda, I turned to Flask which is my go-to framework for building web applications. The Flask framework can be used with this Serverless WSGI plugin to translate lambda requests to any WSGI compatible web framework.

The website is also served on HTTPS through AWS certificate manager and Serverless domain plugin which manages the subdomains. The process is followed is described here.

Handling long running jobs:

As we know Serverless has its own limitations like limited disk space and running time. I’ll talk about the disk space later but getting around limited running time was quite important for running long tasks.

TubeStats can handle videos with comments up to 150,000 which takes a lot more than 15 minutes (current AWS lambda running limit) so I build a simple helper utility to package up the state in a zip file and then restore it on the next lambda run.

It is like hibernating the running job as we get closer to the 15 minutes and then picking it up again when another lambda is triggered. The same process repeats if the job spans more than 2 lambda invocation.

Disk space issue:

The other issue that I came across when trying to download all 5.3 million comments from Gangnam style YouTube video. At some point during download thousand and thousands of comments, the archive file got bigger than 512MB which is the limit of the disk size provided by AWS lambda container.

I haven’t solved this problem but I’m have a couple of ideas to get around this.

Payment Processing:

Currently, payments are processed using Paddle which is quite easy to set up and their support was very responsive with onboarding through the process.

However, the platform still has some limitations as I still need to get around the limitation of processing and generating the different download link in the service.

Here is what happens when someone pays using Paddle forms.

Later at some point, the user clicks on the download link in their email which I handling in a lambda and generates a unique download link if the job is finished or display a page with the progress of the job.

Why not use … ?

Fargate:

This may be the solution to processing a lot more comments so I may look into using it alongside Lambda. This could be triggered when all the comments are downloaded and the only thing required is to ‘reduce’ the files into a package and upload to s3.

The VPC setup and other infrastructure setup put me off initially but this may be a pure Serverless solution to get around some of the limitations of AWS lambda.

AWS Step functions:

The whole job processing and state management that I implemented with AWS lambda and DynamoDB tables can be replaced with Step functions (I think). The only reason for not using it at the moment is unfamiliarity and getting something launched as soon as possible before optimising it further.

Next Step:

One of the main reason for putting it all on serverless so that I can leave it on auto-pilot without having to worry about infrastructure and any related concerns.

So the next steps all depends on the customer demand and requests. I’ll probably look into Fargate and Step functions to establish a pattern that can be reused with other product ideas in the pipeline.

Acknowledgements:

AWS Cloud
Serverless Framework
Serverless WSGI Plugin
Serverless Domain Plugin
Paddle