Kinesis -> Lambda -> Firebase -> React

A Real Time Analytics Tracer Bullet in a Day

Note: This post was written in 2014. Great read if you’re interested, but I am not sure if the content is still relevant or accurate!

If you want to migrate an existing batch analytics process to real time, I recommend starting with a “tracer bullet,” pushing through to a working, end-to-end solution as quickly as possible before fleshing things out. To that, in less than a day I just put something together using Kinesis, Lambda, Firebase and React , and I will show you how so you can do the same.

tl;dr Watch this video and see the view count update in real time here.

Our Legacy Batch Process

At Adventr, we help creatives build interactive videos like this John Legend piano lesson. Then we help them track performance and user interaction. Unfortunately, our current (soon legacy) analytics are compiled with a traditional batch process.

The videos we create include tracking code that streams events to http endpoints backed by a simple Flask app (just 67 lines of Python). The Flask app receives the events and stores them in Dynamo. Then we have a nightly cron job extract the day’s records to S3 and kick off a Hadoop job using Elastic MapReduce. The results output to S3 where the job then loads them to our production RDS instance to be served as required in our app.


Kinesis for the Event Stream

I wanted a proof of concept within a day or two, so Kafka would have to wait. Fortunately, Amazon’s Kinesis takes the same approach as Kafka, treating the event stream as a log to which multiple consumers can subscribe. This means if consumers go down, they can spin back up and continue where they left off without dropping data.

Since our Flask app uses Boto to push data to Dynamo, it was literally a matter of 3 lines of code to push the same events to Kinesis in parallel:

Just import, connect, and put. 3 lines and you’re up and running with Kinesis.

Lambda for Reactive Task Processing

If Kinesis is Amazon’s answer to Kafka, Lambda is their answer to Storm or Samza. The service allows you to write simple tasks that respond to events in the Amazon stack (including Kinesis events). If you are a fan of micro-services, this takes us a step further by simply writing standalone functions.

Writing Lambda function doesn’t feel very different from writing Resque jobs in Ruby, except there is nothing to enqueue. Any number of Lambda functions can subscribe to any number of streams, it scales automatically, and you just pay for what you use. Even better, it works with Node. Score.

Here’s a handler for our video events:

Lambda just needs a handler function exported, but you can use npm modules by including them in a zip upload.

For now I had to hard-code the Firebase URL (which is not sensitive like a password), since you cannot set environment variables in Lamdba as with Heroku or Elastic Beanstalk. I have a hunch I might be able to use dotenv, but that’s for next week. The environment variables question needs answering, though. For the next iteration I need to work with auth tokens.

Update: See this article for how we eventually configured environment variables for Lambda functions)

In any case, the `handler` function consumes the Kinesis event, extracting the required data and pushing it to Firebase. Onward.

Firebase for the Serving Layer

If the event stream is the official historical record, you can think of the Serving Layer as a materialized view or cache of the current state. It is ephemeral in the sense that since we are persisting historical events separately, we could replay transformations from the beginning to regenerate the current state. Or, if we alter our processing logic, we can replay the same events under the new rules to generate the corrected state.

Our serving layer must be able to increment a value atomically rather than simply write a value, since we will be tallying metrics such as view count and duration. Firebase provides this via transactions. Check.

And of course, the obvious benefit that sets Firebase apart is the built-in realtime support. As Lambda updates our serving layer, all connected clients receive the update automatically.

We structure Firebase as we would any NoSQL store. As you can see in the gist above, our event data sits at `root/project/:project_id/events/:event_id`.

React for the UI

What could be sexier than standalone, declarative UI components? I am quickly becoming a huge fan of React, and their virtual DOM approach reminds me of the declarative syntax that D3 employs. We are just breaking ground on React/D3 components, but I am confident that we are tilling fertile ground here.

React components are easy to abstract and use across projects. In this case I used Hubspot’s fabulous Odometer library to display view count. They use CSS transitions for the display updates, so everything is performant, and I dropped the JS and CSS into a standalone React component that you’re welcome to use as you like.

The React-Firebase mix-in strips away any need for the glue code you would find in a Backbone view. The example below feels trivial because we are talking about a read-only UI, but in 25 lines of code we have a ghetto router and real time data binding:

In this example I am using coffeescript, but I’ve also played with the JSX syntax and pure javascript. Ultimately, since the components stand alone, each component can be something different.

Finally, the React-based dashboard is compiled to a single static file delivered via CloudFront. As if everything wasn’t already easy enough, I don’t need server code, and the page never goes down.

You can see the end product here. As you (and others) watch the video, you’ll see the view counts running on the odometer. Not bad for a day’s work.

Next Steps

This approach is showing a lot of potential, but we have some work ahead.

  1. Environment Variables: It is not enough to build 12-Factor Apps. The same rules apply to micro-services and Lambda functions, which need to be able to access config stored via environment variables. Update: Done. See this article.
  2. Authentication: Currently, the proof of concept does not include any sort of authentication on the reporting side. I still want to deliver the the real time dashboard as static files over CDN, but I will want to retrieve Firebase tokens via an API call.
  3. Kill Flask: Our web application traffic is both manageable and uniform. But the event endpoints experience incredible spikes in traffic, so I would love to find a way to pipe events directly to Kinesis or at least buffer them in SQS, eliminating the Flask app outright. If we don’t sunset the Dynamo solution by then, we can push to Dynamo from the event stream, as the gods intended.

Of course, as our use of the event stream and serving layer tandem grows, the footprint of our API will shrink, as well, as we begin to take all our reads from the serving layer, leaning on the API for authentication and writes that require validation.

But more on that in a future post. For now, I hope this view into our emerging real time analytics was helpful!

This is article 1 of 3 in a series on Amazon Lambda. If you found the article useful, please recommend it and share with others, and perhaps check out the other two:

Thanks so much!