Building a distributed, query-friendly application log store with Webtask.io and MongoDB

Photo from #WOCinTechChat (http://www.wocintechchat.com/)

I had the occasion to work with MongoDB Atlas recently, which made it so easy to quickly spin-up a MongoDB replica set, among other more-advanced options for things like sharding, etc. I work with so-called “serverless”, Function-as-a-Service tooling like AWS’ Lambda, among other common and internally-developed frameworks, but decided I’d try out Webtask.io for this exercise in logging with MongoDB storing my event records. The idea was, if I could store each log event as a new record, it would be easy to build tooling around it to query logs, sort by various keys, manage log expiration, etc. using Webtask as my log endpoint.

What attracted me to Atlas, as my MongoDB solution, besides the obvious top-tier support from the vendor, was the features that offer deeper integration into an application environment. For example, peering with your AWS VPC, and automated security policies:

and some truly cool automation around sharding, user ACLs, and replica sets.

For the purposes of this tutorial, you’ll only need a single replica set and a single user, but the rest of these features make the real strengths of the platform apparent, and I recommend using Atlas for this tutorial (there is a free tier, but I recommend the low-cost option on the next tier up for compatibility reasons with the Node.js driver, that I will go into later on).

A few words on “Serverless” Architecture

This example isn’t really about “Serverless”, but it’s important to understand what Function-as-a-Service does on a high-level and why this process may be quicker/easier/more scalable for this kind of task.

Webtask makes it fairly straightforward to create an HTTP endpoint (exactly like using any other kind HTTP API, for example) to run your task, so for example, if you have a task that, in our case, takes in log data as the payload in a POST request, your task executes the following before terminating:

  1. Reads the parameters
  2. Connects to MongoDB
  3. Inserts the record
  4. (If it’s successful) exits 0!

So, how this differs from running a web app to do this work is subtle; to you, the user/developer, the experience is the same, but for you, the systems administrator, you don’t need to do the legwork involved in configuring, for example, the environment for a Node.js application or keep it online when it’s not actually in use. Even with traditional PaaS technology, you are stil concerned with resource usage, and running hours for the application. With “serverless” architecture/Function-as-a-Service, the unit of execution is not taking in and responding to requests, but to you, the user, the successful completion of the task. This is useful when the task completing is more important than uptime, usually when the task is the entirety of the work being done.

I’ve written about this before, but I’ll reiterate a little bit: Michael Hausenblas RedHat, formerly of Mesophere, breaks it down in the most concise, approachable way I’ve seen thus far:

PaaS and Serverless are certainly closely related. The main differences so far seem to be:
Unit of execution: with PaaS you’re dealing with a set of functions or methods, in Serverless land with single functions.
Complexity: with PaaS you have to conform to a number of (contextual) requirements, need to set up stuff, etc. while with Serverless you only need to specify your function.
Pricing: with PaaS you’re paying for the whole package and in Serverless land only per (successful) function call/execution.

The reason I’m using Webtask here is 1) because it’s easy to setup an HTTP interface for MongoDB, and 2) I can manage secrets like credentials, and the data I’m transferring without 3) having to build anything other than an HTTP(S) POST request from my application (libraries for which are usually native in most application frameworks) and 4) a complex logging framework to hook this up to.

Note: for the purposes of this tutorial, the work done here likely won’t be production-grade, and for the sake of simplicity, things like how to harden this workflow won’t be covered in any real detail.

What you’ll need

You’ll need a MongoDB cluster; there’s a lot of ways to do this, but like I said, I used MongoDB Atlas:

and you’ll need a Webtask account, and the CLI:

On your MongoDB cluster, I recommend using version 3.2 (available on Atlas) because of limitations in the current version of the Node.js MongoDB driver that may not work as expected with MongoDB 3.4 (important if you plan to use the M0 cluster size on Atlas). If you encounter an error like this while using 3.4 with the version of the MongoDB driver for Node.js in use:

{
"code": 400,
"error": "Script returned an error.",
"details": "MongoError: no SNI name sent, make sure using a MongoDB 3.4+ driver/shell.",
"name": "MongoError",
"message": "no SNI name sent, make sure using a MongoDB 3.4+ driver/shell.",
...

Verify that the version of the driver you are using is compatible with your MongoDB version.

Preparing the Webtask

If you used Atlas to create your MongoDB cluster, you can grab your Mongo connection string from your cluster’s page:

Make note of this, and your credentials, as you’ll be storing them as part of your Webtask.

Before you create your Webtask, you’ll need to define the task. One of their sample tasks does connect and insertOne record into MongoDB, so we’ll use that as a base, but modify the following function:

function do_something(db, done) {
db
.collection('my-collection')
.insertOne({ date: ctx.data.date, app_name: ctx.data.app_name, client: ctx.data.client_ip,
response_code: ctx.data.response_code }, function (err, result) {
if(err) return done(err);
done(null, result);
});
}

The relevant section is that you are not inserting static data, as in the example, but taking parameters POST’d to your webtask, and then parsed and inserted. These can be whatever data you’d like, but for my sample application (next) I’d like the client’s IP, the date of the request, the app name (if I’m logging for more than one application service, for example, and just don’t care to have separate collections), and the response code.

This data will be extracted from the URL (following the ctx.data path to the relevant variable) and accessed when executing the function at run time. You can download this modified script here:

The Sample Application & Webtask

Preparing your application for this is pretty straightforward, if you take an app that’s a simple Ruby app like this:

require 'sinatra'
get "/" do
   return "hello."
end

modifying it to log to your Webtask is pretty simple. First, go ahead and create the Webtask from wherever your modified mongodb.js file is, and where you installed the wl CLI tool.

You’ll be creating a task from the modified mongo.js file, (which I called mongolog.js )and initializing it with a secret which is your Mongo URL with you MongoDB credential. If you grabbed your URL string from Atlas, you just need to replace <PASSWORD> and <DATABASE> with your password you set up when the cluster was created, and a database name of your choosing, with the task handling insertion into the collection (this is done as a lazy create; if it doesn’t exist, it will be created once a record is inserted into the specified collection).

Now, with your Webtask URL initialized, update your application:

You’ll see all I did was add an HTTP client, and POST the data I wanted to my Task endpoint, which in this scenario acts as the central HTTP endpoint for services (if this were a larger, decoupled app, for example).

Once the app is updated to connect to Webtask, run wt logs on your client machine to stream the log, and you’ll see an output like this:

Some thoughts on enhancing MongoDB’s role, security for the workflow

Back on your MongoDB cluster, you can use db.<collection>.find() functions to search your logs, or, if you want, access your log data through another method.

You can see from this example that MongoDB can be used to store data very flexibly, and using Webtask as an HTTP frontend for storing whatever data you’d like, you can, for example, create a loose time-series, and abstract this into better, more specific data to track all sorts of characteristics of activity on various services that make up your application. This is a sample “hello world”-style application with no deeper functionality, but in a more complex microservice architecture, for example, you can have services register errors, etc. and be used in a variety of ways upon retrieval.

One potential enhancement for such an architecture, and to integrate it further, could be to use the dump and restore tooling to backup old logs (to object storage like Minio or S3, for example), and in Mongo itself, set TTLs when you create the records to expire them from the database after X number of seconds, thus managing your cluster storage.

Because Webtask jobs are publicly accessible, securing this workflow for production becomes a little more complicated, but is a good exercise in extending your task’s functionality. One could, for example, limit requests to the task from your application server, or have the task validate a token sent with the POST request with the event data. Or, more ideally, implement a fully-featured authentication system, as Webtask does, for example, support a few different options (including integration with Auth0, the maintainers of webtask):

In short, this is one way to approach making your applications make use of serverless architecture without additional overhead like more application nodes, administrating/scaling database clusters, and reduce that work down to updating your Webtask to reflect how your data gets managed without modifying your applications components.