Capturing and Integrating Service Data with Google Cloud Functions and Neo4j

All code mentioned in this article can be found on Github.

Many services provide a mechanism for webhooks, or the ability to call some custom URL whenever a certain action takes place. With so many different systems and triggering needs, it’s a nice and easy way to make events on one person’s proprietary platform subject to publish and subscribe type behaviors somewhere else. It’s also very simple; a single HTTP POST, with a JSON body.

Real examples include:

Data Capture

More than just triggering a custom behavior, many use cases require capture first. Recently when developing some cloud-deployable images, I wanted an easy and flexible way to capture data from the cloud service, which offered me the ability to call a webhook whenever someone deployed my package. I knew I wanted to get this into a neo4j graph, because I’d eventually need to integrate this data with another graph I had.

Webhook captured data in neo4j

Google Cloud Functions

GCP’s Cloud Functions are a way of creating stand-alone functions running on top of cloud infrastructure. They’re “serverless” because Google manages all of the provisioning and runtime parts. The developer just writes a set of functions, deploys them to the cloud, and they’re run when needed. Ideally too these functions are stateless and idempotent, which helps with reuse.

Well when you see this tagline, “respond to events in the cloud”….hey that sounds like webhooks!

For me, the advantage of this for the webhooks was also that I was expecting fairly low volume (say, hundreds per day maximum) — and I didn’t want to pay to host a VM 24/7, or bother to take care of that VM.

How Neo4j Cloud Functions can capture data

The code associated with this article can be found in the neo4j-serverless-functions repo on github.

Capturing data turns out to be fairly easy. Google’s cloud functions expose an API that in javascript looks like (or may be) the Express.JS API. The simplest possible function would look like this:

export.myCloudFunction = (req, res) => {
return res.status(200).json('All OK!');
};

To capture data from a webhook, all we need to do is look at the data coming in on the request, things like its headers, it’s POST body and so forth, and then transform that data into a Cypher statement that creates data in neo4j. You can see that code here.

Creating Nodes from Webhook Data

Suppose we had a simple POST, like this:

curl -XPOST -H "myheader: foo" -d '{"name":"Bob"}' http://cloud-endpoint/node?label=Person

That then would turn into the Cypher equivalent of:

CREATE (r:Request { myheader: "foo" })
CREATE (n:Person { name: "Bob" })
CREATE (r)-[:POST]->(n);

This is a simple way of capturing audit-able data from an external service, any service, any JSON schema. Of course, the property definitions in neo4j are going to be far from ideal, but this can be manipulated with Cypher after the fact.

Creating Edges from Webhook Data

For APIs which will allow some customization of the URL, you can use the same approach to knit together the graph. The code package includes a separate “edge” function which can be given a property name and value, and which will serve to draw new relationships if that’s appropriate for your use. Suppose on a social network someone friends someone else, you could invoke the webhook:

http://cloud-endpoint/edge?fromLabel=Person&fromProp=userid&fromVal=PERSON_A&toLabel=Person&toProp=userid&toVal=PEROSN_B&edgeType=KNOWS

Which would have the same effect as cypher like:

MATCH (a:Person { userid: "PERSON_A" }),
(b:Person { userid: "PERSON_B" })
CREATE (a)-[:KNOWS]->(b);

“Stateless Functions” aren’t really stateless

An interesting part here is that one of the ways serverless functions are sold involves the idea of statelessness. Ideally, your cloud function is a pure function, in the sense that it produces a value and has no side effects. As usual, reality intrudes on perfection…

Under the covers, of course google has to deploy this function to an actual container or server. This can cause confusion since the reality is that “serverless ain’t serverless”. So in reality, your code at least at times is hot deployed somewhere, and it’s not spinning up from cold every time, that would be very inefficient, doubly so if you’re using a heavyweight runtime (e.g. java).

In google’s tips and tricks for cloud functions, they point out that it’s a best practice to use variables to reuse objects in future invocations. Let’s take a look at what that means in the neo4j case.

In the code module that sets up the neo4j driver, there’s a persistentDriver variable which holds a driver instance between function invocations.

const driverSetup = () => {
const username = process.env.NEO4J_USER || _creds.username;
const password = process.env.NEO4J_PASS || _creds.password;
const uri = process.env.NEO4J_URI || _creds.uri;

const auth = neo4j.auth.basic(username, password);
return neo4j.driver(uri, auth);
};

let persistentDriver = null;

exports.getDriver = () => {
if (!persistentDriver) {
persistentDriver = driverSetup();
}

return persistentDriver;
};

Neo4j driver objects are “heavyweight” according to the documentation, and shouldn’t be created in large numbers willy-nilly, or you’ll see performance impacts. By having a persistent driver, database connections can be reused between function invocations. By exposing an accessor function, this driver gets “lazy created” when it’s needed, it’s not just lying around.

What if we did it the other way? If we didn’t have a persistent driver here, the function would still work; but we’d likely spam the database with connections. Probably OK for dozens of requests, but that function would never scale to large numbers of calls.

Cloud Functions All Over the Place

This article covers some simple examples of how to do a basic integration. But beyond two simple functions, you can (and people surely do) layer on quite a few more.

When you understand the model, you’ll quickly see how an entire backend can be written with only cloud functions. What, no VMs? No docker containers? That’d be a “backend-less backend”. Very zen.

Unfortunately, it’s not quite so simple. As I described with statelesness above, Google cloud functions do have both state and servers, of necessity. As result, the serverless abstraction is a bit leaky, there are some bits of a functions lifespan that are good to know as you get into them. I also found this article to be a good concise overview on some key architectural issues you’ll face in the brave new world. As with all things in engineering, it’s not what’s right or wrong, it’s what’s a good tradeoff for your use case.

Next Steps

In this simple example, I’m glossing over a number of things you’d want to do in a more heavy-duty use case. I’m just after capturing data from a few webhooks here and there, nothing heavy duty. For example:

  • For many cloud functions, you’ll want to have an identity store and verify the caller. The code described here is effectively open, so if someone knows your endpoint, they could fill your DB with junk if they were so inclined.
  • The code we’re discussing stores the data in a “naive” format, for minimal effort and maximum flexibility. But with many webhook calls, it will result in a situation where you have a lot of disconnected nodes in your graph. If you want to change that an iteratively build a better graph, you can take a variant of the code provided. All that’s needed is to modify the cypher that runs to whatever fits your use case. You can write a body transformer too that filters which JSON payloads it accepts, or transforms them (for example, taking a Slack JSON response, and breaking it into multiple objects/nodes such as channels, users, messages, etc).