Chris Vontas
19 min readOct 13, 2016
Light and Shadow in the Carina Nebula. Credits: NASA

Today we are announcing beta availability of our platform, Red Sift. Over the last year, we have been working on our vision to Automate the Planet. We feel there is too much of a disparity between what technology can accomplish today and what normal human beings experience. We spend too much of our time and energy searching through data and performing robotic tasks all the while knowing that today’s technology holds within it the power to make our lives fundamentally better. If only there was a way for us to create automations and agents that can monitor the data we care about and perform sophisticated work on our data so we don’t have to. Work that help us prioritize better or get new insight into the choices we make. This platform is Red Sift and the agents of change are the Apps that run our platform — we call them Sifts.

But we can’t do it alone. There are too many things to build and we don’t believe one organization can do them all. That is why we built Red Sift first and foremost as an open platform. We want to empower developers to build solutions to complex problems and make them available to the world. As a Red Sift developer, our platform will be free for you to develop Sifts and publish them publicly if you wish. Sifts can compute on any data they can get access to, think public or OAuth provisioned APIs, Bots channels or even your email Inbox. When we first started on this journey, we realized that the Inbox holds a large percentage of our digital footprint. But this is currently a dumb database hiding behind an archaic API, we can’t really get at this data and do anything meaningful with it. We set out to fix this and made a first class integration with email and our computation cloud.

Rather than talk about it, let’s cut some code, why don’t we start building a relatively simple Sift that we can run on our Inbox that extracts out some insight. Email isn’t everything but it is a pretty good place to start.

Like most people, I need help triaging my Inbox. Right now, I don’t have a great overview of what messages are brief and I can probably plough through and what I need to set aside reasonable time for a thoughtful response. While there are a number of techniques I can apply to extract nuance from text, for the purposes of this introduction let’s start with something simple — counting the words in a message and giving me an indication integrated with my Inbox list view.

When it comes to email, we currently have first class support for the Gmail Web App in the Chrome browser. Let us know what platform you want us to tackle next.

I am Sift

On Red Sift, this packaged application is a Sift. Think of it like an App for data. With one click, a user can securely deploy this against their own data and get your agent automating or augmenting their life. We love GitHub and use it as our ‘store’. You can work on your Sifts in private or public GitHub repos individually or via a collaboration and when you are ready, share it with the world. Let’s build this Sift from the ground up.

Email, like most real world event streams is pretty messy. We provide first class integration with IMAP that makes it easy to ‘subscribe’ to messages and compute on them live as it happens. We don’t just limit you to new stuff though, you may want a sample of messages from before your Sift was installed, lets say if you want to train a classifier or calculate historic metrics. However doing this increases complexity, what happens when we are streaming old stuff and a new email arrives. Do we wait for the archive messages to drain through your DAG before we serve you a new one? What happens if an error occurs, let’s say you were calling an external API in your node and something went pop. We also care a lot about privacy so as Red Sift we don’t store a copy of our user’s Inbox. Everything is ephemeral and computed on demand. Great for our users but as developers we need to invest the effort to keep true to our vision of a cloud that puts our privacy first.

DAG (from Directed Acyclic Graph) is the paradigm that we use to describe workloads for Dagger, our computation platform.

To simplify all these requirements, we do not provide ordering guarantees for an email stream. We promise to send a message at least once, but may repeat the message if we need to retry the graph. As your Sifts become more sophisticated, you will need to consider this in the design but thankfully we have some simple patterns that just do the right thing and promote a clean, resilient implementation.

Ready? Let’s build it.

Our SDK helps you get started so let’s download it.

run the above command in a new shell

We currently support OS-X and Linux distributions. If you are a Cloud9 user, you can also follow this tutorial and run our SDK in the cloud (just remember to add the -p 8080 argument when running your Sift to point to the correct port). If you use Docker, we can support local development in multiple programming languages but let’s start with Javascript.

The Red Sift SDK includes simulation versions of the components in our cloud so you can develop a Sift entirely locally.

installing the SDK

Give this a few minutes to install our SDK. We can then create a new scaffolding for our Sift, let’s call it counter.

redsift create counter

Choose the minimal-sift template and pick the defaults for the Sift metadata. This just creates a few files in a folder to get us started.

cd counter

creating the counter sift

Now just do redsift run

At this point the SDK will setup an environment that simulates a local and limited version of the services we provide in our production cloud. In a few moments, the SDK will serve our empty Sift in your Chrome Browser.

view of the SDK

We now have some placeholder content in what we call our ‘Summary View’. This white card is what a user sees in their mail client when they install Red Sift for Gmail or on their home screen at redsift.cloud. It is meant to provide an at a glance overview of the data your Sift has processed and the relevant insights you want presented. It is a sandboxed HTML canvas for our developers so if you want to show your Sift’s users a picture of a lolcat do feel free to do so.

Open your favourite code editor and let’s take a look at the contents of our counter folder. The main items of interest are the sift.json file and the frontend & server folders.

Abandon hope all ye who continues here. Just joking, but a basic understanding of Javascript, ES2015 syntax and HTML5 features are required to make the most out of this example.

A manifest called sift.json

The sift.json is what we like to call our manifest. It describes the top level structure of your Sift, the data it wants to consume and the data your Sift is going to export. Right now, not much of interest is happening. We have a single javascript node at server/node1.js that will be scheduled to run every 15 minutes (900 seconds) in our cloud on behalf of the user. We can simulate this locally by running this node via the run menu on the left.

And not much happens (besides DAG complaining that there is no input). We are not selecting any data or doing anything interesting in the node so let’s fix that.

First off, we want to process all the important email a user gets. We can select this data set in the sift.json. Lets replace the empty emails, at the inputs section of the dag, with the following.

Breaking this down,

  1. We want emails and have labelled the input gmailEmails.
  2. We want 1 week of the user’s archive in addition to new emails as we want to show a new user some of our results.
  3. We want the text of the message, we don’t care about attachments in this use case.
  4. We want everything from the Important gmail folder as we only want to count messages that have been flagged as important.

As we are not using any other streaming data source, we can delete the slack-bot and webhooks inputs.

Now let’s wire this up to our node by adding it as an input.

We also want this to run in real time against the messages a user gets so we have removed the when clause in our implementation. Now our node gets events as new emails come in. In the SDK however, we need to pull this data manually while we are developing.

Select the inputs menu on the left and you can see the gmailEmails input that is currently in a ‘Disconnected’ state. Click on the ‘Google’ button to connect an email account you want to test with via an OAuth login. This flow gives your local SDK access to your account so it can pull down emails for your development nodes to process. You can now click the ‘Download’ button to get a snapshot of the emails in your inbox that matched your subscription criteria.

Your console will now be filled with a dump of what your node was called with. This is a JSON data structure called JMAP and is what Red Sift uses when providing your node with email messages. There is nothing terribly special about the data we pass through, if it was a slack message the event would be identical to the contract for the Slack API.

Crack open server/node1.js and let’s have a look.

The default implementation is essentially just logging the messages and returning a simple data structure. The messages are provided as an array of events as our platform may batch up messages if multiple events are available for processing. In the next node we write, we will also see why this proves helpful for doing real work.

I only want YOUR emails

Let’s change this implementation to achieve our word counts. First we are only interested in emails we receive — we don’t want to compute metrics for the messages we sent so let’s filter them out. Go ahead and add the next line after the declaration of the json variable.

We simply check that the user (i.e. the owner of the account) is not the person originating the email address.

Next we need to get at the text content of the message and clean it up before tokenizing it and counting the words.

We can preprocess the text any number of ways but given that these are pretty common tasks when processing text in emails, we made a tiny library for this. You can import it into your Sift with npm. On the command line:

cd server && npm install --save @redsift/text-utilities

Then ‘require’ it and use it in your code as you would in any NodeJS application.

We can now run this and we’ll see a bunch of word counts as expected. So straightforward! However, this is not useful to anyone who is not inspecting the console. Let’s get this data out of the server and to the frontend so we can line up their email messages with these counts. First we need to define an output in our sift.json.

An export is data that we want to leave the confines of the backend. Anything sent here will be available for API query and synchronized to the frontend in a key / value pair form. We are calling our export count and we defined a key schema that is a single string. All this means that the system will expect a flat key for the data in this bucket. Let’s also go back to the node definition and make it reference this count.

full picture of the current DAG state

Now email data is set up to be processed through server/node1.js and exported out via count. Let’s modify our code in node1.js to emit something to count.

Face of a Sift

Now let’s work on our presentation. The frontend of a Sift is essentially a progressive web application with the Red Sift infrastructure taking care of synchronizing data from your backend and making the result of your computation immediately available on the frontend. The platform also provides a security framework that isolates access, so your frontend only has access to the data only your back end has produced. Issues such as offline or partially connected clients, streaming data updates and security are automatically handled but bear some explanation.

The Sift framework provides a MVC pattern for the frontend. This post covers the pipeline to build a Sift with ES2015 for your code and the Stylus preprocessor for your CSS. Rollup is used to package up the Sift so that it can be run on our platform. We use this setup to build our Sifts and we provide npm libraries (redsift-bundler, ui-rs-core) for you to use if you wish to adopt our setup. However Red Sift is not opinionated about the framework choices you wish to use for your frontend. Future posts will cover building a frontend as a data driven React application and data presentation via D3.js.

So far we have been successful in processing a set of emails and counting the number of words in the body, however we haven’t done anything useful with it yet. Let’s step through building a frontend for our Sift and presenting our data in the Summary View first. In the last step we exported data out to the ‘count’ bucket. The special thing about exported buckets is that they will be automatically synchronized with the frontend client (and stored using IndexedDB on our Chrome Extension and Web App redsift.cloud). The frontend can then query these exported buckets and display them in a meaningful way.

Here’s a general structure of the frontend folder.

A bit more info about each file.

  • package.json defines dependencies on the npm modules mentioned above
  • bundle.config.js declares the files that will get packaged up by the redsift-bundler
  • gulpfile.js defines the gulp tasks that will get executed to complete the bundling stage. The SDK is configured to always run the ‘default’ gulp task, so all custom gulp tasks should become subtasks of ‘default’.

The src folder contains the source scripts and stylus pre-processor files. The public folder is the final destination for the Sift frontend files. You can add assets and html files here and they will be accessible relative to the public folder.

For instance the file public/assets/blueprint.svg should be referenced as assets/blueprint.svg in your html files.

The redsift-bundler uses the public/dist folder to write all it’s destination files. Since the public folder is the only destination folder the Sift recognizes you should ensure that any custom gulp tasks you add will need to use it as their destination folder.

The Controller

The Controller has access to the stored data and we provide convenience methods to access this data. You’ll never need to query the client data-store directly. The loadView() method of the controller is called when the Sift is ready to be loaded. You’ll need to return two important things from the loadView() method:

1) the path to the html file you would like the Sift to display to the user (as mentioned above relative to the public directory)

2) The data to be passed on to the View when it is displayed.

The Sift then prepares your View and calls presentView() when ready. The data that you returned from loadView() is now available inside your View and you can now display it where appropriate.

Data flow from Controller to View

Controller(.js) ⇒ loadView() ⇒ *.html ⇒ View(.js) ⇒ presentView()

We’ll focus on the summary.html file since this is the one that we’ll get the Sift to display in our Summary View. The summary.html somewhere before the end references a script called dist/js/view.umd-es2015.min.js. This is the output script after the bundler builds the src/scripts/view.js script.

The great purge

Changing the outputs section in the sift.json from output1 to count is equivalent with a schema change in our IndexedDB, hence some house cleaning is required before proceeding.

Purging all our frontend data is as simple as clicking the ‘Delete’ button in the top right corner of the SDK. This will delete the data in IndexedDB and drop the previously created ObjectStores (~ tables). Once the SDK reloads, it will create the new schema based on the outputs we defined.

What about updates?

The frontend also receives updates when changes happen in the exported buckets. The Controller can subscribe for events from Storage on an array of buckets (here we’re interested in count ) with the help of the internal storage object like this:

this.storage.subscribe(['count'], this._suHandler);

The last argument is a callback function that will be called once an event is received. The implementation in controller.js with the relevant sections uncommented and updated will look like this:

So here we have a few new things going for us. The subscription to events from storage that we mentioned earlier happens in line 10 and the callback we are using is declared in line 4 and provides a reference to the onStorageUpdate() method in line 28. That’s all needed but it’s just glue code, here are the important bits:

  • Events from storage are only on a bucket level, that’s why we need the getCount() method to fetch them for us which is basically wrapping a call to Storage. We can see it’s used in two places (lines 20 and 30). The first populates the view when in loads, the second fetches the new data after an update.
  • The publish() method in line 32 which publishes the results from the call getCount() to the View. Well that’s the intention, since at the moment nobody has subscribed for "counts" events.

The View

The View can subscribe to events from the Controller in a similar fashion to the Controller subscribing to Storage with code similar to this:

this.controller.subscribe('counts', this.onCounts.bind(this));

All we are saying here is when we have some data, use the onCounts() method. Data either will come when the Sift loads and the presentView() method will be called (line 11) or when an update takes place (line 5).

To complete the presentation and bubble up the data to our presentation we need to add one more line in summary.html

If you would like to use the CRUD paradigm as a mental map of the data operations in a Sift it would go like this:

Create: exported data from outputs section of the DAG

Read: from Controller through the convenience methods of Storage

Update: events from Storage when changes happen on a bucket level

Delete and destroy with the UI button on the top right corner

Let’s run it now!

It’s alive!

Nothing happened? Did you delete your front end DB?!!

We now have our display updating in real time as we export updated values from our Sift’s backend. But the number is not terribly useful, it is a random entry as each data element we emitted has the same key name (word_count) and overwrote the last value we spat out. What we actually want to do is sum up each of the message word counts so we can have a running total. Ideally, we want to do this on the back end so we can scale our counting and limit the amount of data we have to export and we can do this quite simply with another node.

Time for a change

First, let’s go back to server/node1.js and emit a key that is unique for each message.

value.id is a unique hash for each message. Next, as we don’t want all of these going out to the frontend individually, we rewire the output of this node to go to a ‘store’. A store is similar to an export but is not synchronized to the frontend. It also allows us to do much more sophisticated things like applying grouping operators on the data. Our minimal sift.json already included a store, let’s rename it as messageCounts and modify our node to write to it. Next we add a second node that selects from messageCounts and writes to our export count.

For those of you familiar with functional programming, you can see we are building a functional pipeline with the First Node acting like a map and the Second Node acting like a reduce. The magic that turns the second node into a reduce is the fact that we have "select":"*" in the input section. As mentioned, stores can do fancy operations on the key space and this is an example, it is asking for a grouping of all the keys in the bucket. Our implementation of node2.js is simply:

We are just summing up the count values the previous node emitted.

Now when we run this we see our summary view counting up the number of words. This is nice to look at but isn’t exactly what we wanted. We want a summary on the list view against each message. Our deep integration with Gmail allows you to present your data with this list view by exporting data to a special export that our Chrome extension knows about. It will then merge up and index your data so that we can quickly decorate a long list of entries with your badges and value in a performant manner. Let’s bring this special bucket into our Sift.

Gmail and Threads

The Gmail interface, like many others, displays a threaded interface. Now wiring to threads on the server will let us decorate an email thread in a list view. Ideally we want to badge the word count of the last message in the thread that is not mine. While this sounds tricky, it is actually relatively straightforward once we add an additional concept into the mix.

I mentioned stores can do fancy things with keys. The major feature they support are key hierarchies. Let’s modify our store to accept a key hierarchy.

We then modify node1.js to emit a hierarchical key and the date of the message.

And now modify the selection that node2.js is interested in to group all of the threadIds together and emit to the threads bucket.

These small tweaks have changed the functionality of the DAG significantly. We are grouping by each threadId and node2.js will now be able to sort the thread entries and emit the right value based on our requirements.

It also neatly solves incremental computation. Each new email message will trigger a mapping of the message and an incremental recomputation of the threadId it was assigned. If the message is replayed through the graph, the recomputation will happen but the end result will be the same. All of these things occur because our nodes are side effect free. This is a pure data flow and keeping the relevant state of the system purely in the outputs and a strict function of the inputs will ensure your DAG is always correct and scales well.

Thread badges need some love too

We have all the information that we need indexed on the threadId in our database, so now it’s time to hook up the Controller to send data to the View, in this case the list view. The list view is a special case because it augments the existing UI of Gmail so design freedom is restricted. We have a few templates ready so there is no need for an HTML file like before, we just need to implement the special controller for this case called email-client-controller.js.

If you uncomment the relevant sections, the only other change we did was to update the subtitle property in line 9 to the value that we wanted. Small side note here about the loadThreadListView() method. Your first assumption could be that everything exported to the threads bucket will arrive here. That’s half true pay attention to the list attribute we used above in the return statement of the node2.js implementation. Only the data under the list attribute will arrive as the content of the listInfo argument.

Once you run it switching to the email-client view in the SDK you should have an image similar to the following.

email list view in the SDK

Hitchhikers guide…

There are many features we have not touched upon such as joining data from multiple buckets, accessing user input and creating data that expires. Our documentation and open source Sifts are a good place to start digging.

We are also really proud of our polyglot streaming compute solution. If you don’t want to manipulate messages in Javascript, how about Python? You can then use all the great Natural Language Processing and Machine Learning libraries in Python in a few lines of code. We automatically detect the language a node is written in and spin up the environment to support it. Not a pythonista but really want to use Keras? How about a mix & match — write one node in Python and the rest in the language of your choice. Today we provide first class support for nodes written in Javascript, Python, Julia, Java, Scala and Clojure. Missing your favourite? Let us know by appending in our feature-request list and we can add it.

Use it and share it

The final code for this blog post is available as a live Sift — you can install it on your Gmail Inbox here and browse the repository here. Feel free to fork it and deploy your own version instead. Find an opportunity for improvement or want to add cool new functionality — send us a pull request or share your version of the Sift by publishing it on our catalogue. For example, we could start to personalize the reading metric for a user my making them take a Word Per Minute test and use that to scale the calculations instead of simply presenting a word count.

This is a fairly bare-bones Sift but serves as useful functionality and an overview of what a Sift can do. You can find a polished version of the same functionality with our TL;DR Sift. Browse our catalog and you can see other examples that build on the core capabilities we touched on and turn them into smart agents on your Inbox.

At Red Sift we are not just about email. We tackled email first because a lot our data lives in the mess that is an average Inbox but our vision is to Automate the Planet. Want to Slack bot that does more than hello world? We have a Sift for that and like all our Sifts today that is also open source. Want to compute on your IoT stream, we have some great capabilities coming.

Credits: Amazon

Looking for inspiration? Have a look at our bounty board. Build something from there and share it with the world. If you let us know about it — we will send you a surprise box that will contain a second generation Amazon Echo Dot!

Above all, we value the feedback of our fellow developers. Like, love or hate something we are doing? Let us know, it would be great to hear from you.

Chris Vontas

Engineer @Redsift. ex-(@thomsonreuters, MPme) Makes developers and servers happy!