(Server)less is more

How to build a Google Analytics-like collection platform without deploying a single server

Web analytics in the Age of Context

“Less is more” L. M. van der Rohe

A s some tech aficionados eloquently put it, we are now entering the “Age of Context”, an age in which the joint effect of several technological trends (big data, social networks, mobile phone sensors) promises the biggest revolution in consumers’ life since the Internet.

When used correctly, the unprecedented wealth, frequency and quality of these data streams would power a more personalized and anticipatory experience for all users.

At Tooso we are building a next-generation search and discovery platform in our mission to make online shopping a seamless and customized experience. To achieve this goal, our whole infrastructure needs to support real-time and almost real-time processes at scale.

Most readers will be already familiar with the most famous web analytics platform on Earth, Google Analytics (GA): by simply dropping some Javascript into your website, Google gives you a nice app with all sorts of insights about how users interact with your pages. GA use cases are not unique though: any service — like ours — that needs to collect browsing data efficiently and tell users apart (is X the same user that was on page Y yesterday? What is the ratio of new/returning visitors? etc.) will need its own web analytics pipeline.

When designing the third major update of our APIs, we decided to completely revamp our protocol to collect and ingest data: can we build our own “Tooso Analytics” by relying entirely on AWS PaaS (Platform-as-a-Service) services? In this post, we are sharing our code, infrastructure and tooling choices hoping that our experience will save you some time when making your own decisions.

We start with a small overview of pixels and use cases for web analytics: if you already are a web ninja, feel free to skip to the next section.

Pixels 101

“Advertising is the greatest art form of the 20th century” M. McLuhan

Web analytics start with pixels.

What is a pixel? While real-time personalization for online shopping is still a cutting-edge frontier, everybody is familiar with pixels through at least one product: online advertising (which explains why they have such a bad rep these days).

Let’s start with a classic (and somewhat simplified) example. You go to awesomeBooks.com and take a look at book XYZ, then you go to your favorite social network mySocial.com and you get an ad showing you precisely XYZ. How does that happen? It happens through a pixel, a 1x1 transparent gif (invisible to the final user) that gets loaded by both awesomeBooks.com and mySocial.com through (and this is the trick) a third domain, say, gugolAnalytics.com.

Simple advertising example: thanks to a pixel from a third-party, site A and B are able to “recognize” the user in two different moments and provide relevant information.

When loading the gif from gugolAnalytics.com the webpage sends a normal request to the remote server, complete with parameters containing user information. In response to that, gugolAnalytics.com sends the 1x1 image and a cookie to your browser: it is by using these data that you can get on mySocial.com an ad based on what happened on awesomeBooks.com.

Google Analytics pixel being loaded on Tooso’s website: note the long list of parameters in the HTTP request.

Obviously, pixels can be used without any advertising purposes, as it is the most unobtrusive way for online services to collect information needed to improve their products.

For example, at Tooso we use data from pixels to improve our A.I. models and enhance our understanding of how users browse our partners’ websites, and personalize their experience accordingly.

More generally, as mentioned, any online service that relies on the ability to send real-time information from client to server can use a pixel-like architecture to satisfy its use cases. Now that we know what happens in the browser, let’s detail a bit more the requirements for an end-to-end solution.

Requirements for a modern data ingestion platform

While pixels are very important, they are just the front-end piece of the puzzle: a data platform needs back-end components to reliably store data for further processing — in our case, do our data science magic. Adding a bit more details to our previous example, a general “pixel architecture” will look like the following:

Sketch of a web analytics architecture.

Users visit a website that loads a Javascript library from a CDN (say, tracking.gugolAnalytics.com); the library will make a pixel request to gugolAnalytics.com/pixel.gif, as discussed before (so, it will pass browsing information in the request and it will get back a 1x1 gif and a cookie). Once data are exchanged — thanks to the magical pixel—from client to server, the backend can process everything and power all the analytics use cases as needed.

Putting together front-end and back-end, any analytics platform needs to satisfy three basic requirements:

  • Ease of integration: adding tracking capabilities to partners’ websites should be quick and seamless — Google Analytics no-nonsense JS copy+paste is the standard reference here.
  • Scalability: it should handle web-scale traffic seamlessly; bonus points if the platform somehow resizes itself dynamically to respond to peak load.
  • Reliability: it should be robust to underlying (possibly, virtual) hardware failure and it should run with the least possible maintenance and monitoring effort.
While you may think setting up and maintain all this is a devOps nightmare, you will be delighted to know that you can add an analytics library to your services without deploying a single server.

We will sketch our solution in the next section.

PaaS to the rescue

“Simplex Sigillum Veri.” TLP (5.4541)

As promised, our solution is completely in the spirit of today’s AWS PaaS offering. In particular, the picture below sketches our AWS-based architecture:

Overview of Tooso’s AWS PaaS architecture

Let’s have a look at all the elements, piece by piece, from left (closer to the final user) to right (deeper in our data pipeline):

  1. JS library: our Javascript library is inspired by the Google Protocol Measure, both in syntax and semantics. We serve it globally with minimal latency through AWS CloudFront (in particular, we upload the minified JS to a devoted AWS S3 bucket linked to our distribution).
  2. Our pixel is served through AWS API Gateway and a Lambda function. We devoted a previous post (with a companion GitHub repo) explaining how to serve pixels from lambdas, so we are not going to spend more time on that here.
  3. Once the lambda receives the client request with all user’s data, the event is serialized and dumped to our message broker, i.e. Firehose.
  4. Firehose is configured to write incoming events to a specified S3 bucket — call it s3-target.
  5. A second lambda function is configured to be triggered by objects created in the s3-target bucket. In other words, when a new blob is created by Firehose in s3-target, the second lambda gets invoked. In our use case, this lambda is responsible for ETL, i.e. taking the JSON event from the first lambda and re-shuffle/enrich data in the proper format for a database insert.
  6. Finally, an instance of AWS RDS (in our case, if you’re curious, it’s Postgres) hosts the lambda-transformed data in a table structure designed to simplify the job of our data scientists.
  7. (Bonus) AWS CloudWatch hosts lambdas logs and print statements (for monitoring and debugging purposes).

So, how did we fare on the requirements we set for our solution?

  • Integration: our Javascript library was inspired by Google Protocol Measure to ensure a seamless tag manager integration and great compatibility with established best practices in our partner’s IT departments. Our library is released under an open source license and we are likely to share the un-minified, commented version in the near future.
  • Scalability: AWS lambdas and AWS firehose (and, in a sense, s3) scale automatically as needed (see below for some caveat though). Amazon RDS databases can also be resized after DB creation to handle the increase load/data ingestion (depending on your load/final use cases, other managed solutions may be more appropriate than Postgres).
  • Reliability: our platform is end-to-end built from managed services: as promised, there was no need to deploy a single server to achieve the desired level of functionality. In other words, after the creation of all resources, the solution is basically guaranteed to run insofar as AWS is running — which, while not 100% bulletproof (as nothing is), we can all agree to be a pretty solid baseline! Moreover, a combination of Terraform and Serverless will also provide an amazing solution for a scripted, repeatable process for the initial stack setup.

Where to go from here

Our architecture overview should be enough to give you a clear idea on how to start your own PaaS solution. There is a bunch of notes, mistakes we made and random thoughts (in no particular order) we would like to share before wrapping up, hoping some of these considerations will save you some time or highlight some important point for your use cases.

  • Security: our overview did not mention the network architecture at all. Our suggestion is to isolate sensible resources, like ETL lambdas and databases, in a private VPC (in case you’re interested in getting started, Amazon has a nice tutorial). Note to self: remember to setup a VPC endpoint or your in-VPC lambdas will not be able to access s3.
  • Load testing: if you’re curious to verify how it scale in your use cases, there are a bunch of tools you can quickly use to run some tests. We like Goad, for example, which “allows you to load test your websites from all over the world whilst costing you the tiniest fractions of a penny”.
  • Real-time vs quasi real-time: if you choose Firehose as your main ingestion hub, events get dumped to, say, s3 not exactly in real-time; in particular, Firehose will transfer data when some size/time threshold is reached. This may be good for some use cases, but it is not enough for others: if the product needs to adapt at every page visited by the user, a one minute delay from client event to server processing may look like an eternity. One option is to switch from Firehose to Kinesis, which is also a managed service, but designed for streaming data in real time — the downside is that scaling is a bit trickier with Kinesis; a second option — which we are exploring right now — is to load data that are essential for real-time personalization into a fast, shared cache (in our case, our beloved Redis). By leveraging Redis lightning fast performances in both read and write, we are sure that user information gets available to the entire pipeline milliseconds after the first client event and with no noticeable latency in the pixel http request.
  • Replayability: as it should be clear by now, our main design philosophy can be summed up in “dump all data now and worry later”; by using only auto-scaling, managed services from pixel to persistence (in s3), we make sure that, whatever happens later in the pipeline (say, an ETL bug, a DB overload, etc.) we are always able to replay all clients events as needed. To the extent you want to achieve something similar, we suggest you add, as we did, two important parameters to the pixel call from the JS library: one is a client timestamp (useful to re-order the events irrespectively of how/when they got loaded), the second is a client event unique identifier (if you need to replay/re-ingest data, a uuid on events is very useful to deterministically avoid duplications).
  • Authorization in API Gateway: in case you didn’t know, API Gateway offers a nice mechanism to decouple your functions from authorization logic. We strongly recommend to encapsulate authentication logic and basic api keys checks into a separate function. Note: if you use serverless to deploy lambdas, at the time of writing the first draft of this article the framework had a bug which greatly restricted the authorization options: you can check the progress here.
  • Debugging and monitoring: while lambdas are, for many reasons, a pleasure to work with, sometimes debugging and monitoring through CloudWatch feels a bit frustrating; when possible, make sure to have test events ready to save you from chasing down regression errors later.
  • Lambda scaling: while lambdas is often advertised as “virtually infinitely scalable”, there are indeed some out-of-the-box limitations to concurrent executions. Also, lambdas advertised cost efficiency (you only pay when they get invoked, not 24/7 like a standard EC2 instance) may break down at a certain point if you really reach very high-volumes of traffic. Bottom line: if your service grows after a certain size, the trade-off between ease of maintenance/deployment vs costs at scale may no longer be favoring a purely lambda solution; however, to go from zero to end-to-end quickly and cheaply, lambdas are very hard to beat.

See you, space cowboys

If you have question, feedback or comments, please share your serverless story with jacopo.tagliabue@tooso.ai.

Don’t forget to get the latest from Tooso on Linkedin, Twitter and Instagram.

Acknowledgments

Fabio Melen’s exceptional ingenuity was behind every single good idea in the development of the new platform: if we moved pretty fast and broke not that many things, it’s mostly thanks to his talent and vision.

Ben from Queens has been invaluable in understanding some AWS subnet intricacies and saving us countless sleepless nights: he was indeed the devOps hero we needed (but Queens still deserves him more, apparently).

Katie’s help, as usual, scaled up automatically to give us her timely, invaluable feedback on earlier versions of this article.

Finally, many thanks to Ryan Vilim and Ang Li for very helpful comments on previous drafts of this post.

Like what you read? Give Jacopo Tagliabue a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.