Humanising Data

Back in 1999 it was estimated that we produced 1.5 billion gigabytes of data through the whole year. Fast forward to 2016 and it is estimated that we now produce upwards of 2.3 trillion gigabytes every single day!

Searches for ‘big data’ over time, ‘content marketing’ included to give a sense of the search volumes.

Big Data is no longer a buzzword, it has become mainstream. Gartner says 75% of businesses across all industry sectors are expected to invest in big data in the next two years. Data Scientist roles were not mainstream a decade ago but was recently voted the sexiest job of the 21st century. It is easy to see how the interest in ‘big data’ developed over time, a quick search for ‘big data’ on Google Trends confirms what we already know — big data has truly gone mainstream.

In our previous series of blog posts we introduced our platform and explored how we used it for AI & NN experiments. In this blog post we’d like to describe the Redsift platform in more detail and explain our motivations.

Motivation

The Big Data Industry has matured to the point where there are several vendors providing products and services across most industry sectors. You have the Palantirs for Governments, Hadoops & Sparks for large enterprises and several point solutions for SMEs.

However despite all the advances in big data, you and I still can’t do much with our own data today. The tools are fragmented and takes non-trivial effort to set up. We call this class of data human sized data and have built a cloud solution where you can securely process your data, produce meaning & insights and have it presented back to you beautifully.

We want to make data science accessible to everyone who wants to extract meaningful insights from their data. The ability to take data driven decisions shouldn’t be an exclusive club. Taking advantage of data in your everyday life is nothing short of exciting!

We address the end-to-end integrations necessary for data science so that users can focus on the business logic instead of needing to be concerned about the setup steps. What this means that you have a few less things to worry about. Think of Redsift as a PaaS for your data.

The Redsift Solution

Using our SDK developers can build and test a Sift — our term for a computational unit or application. The same Sift will then work seamlessly across our Chrome extension. A Sift has two components — a server component that runs in our cloud and a frontend component that runs inside a Chrome extension. Our Sift is written in isomorphic JavaScript but since our platform is polyglot you can write the server component in any language that best suits your computational needs — Python, Julia; Java, Scala, Go, R in the works. The frontend is written in HTML, CSS and JavaScript.

Our architecture is meticulously built on top of Docker and we take security very seriously. All your data is encrypted at rest using rotating AES 128 bit keys and the Sift runs in a secure sandbox to prevent data leakage and malicious use.

The computational unit is a Directed Acyclic Graph defined as a JSON structure. It consists of an array of nodes with data flowing through the nodes. Data enters the graph through input nodes and is exported out from output nodes. Nodes typically have an implementation associated with it, this is code that you would normally write for a server.

With Redsift the data computation process now reduces to downloading our SDK, reading our documentation, firing up a command to create a template sift and then modifying the sift to suit your data computation needs. Once you run the sift you have a beautiful frontend to display all your results. We provide some d3.js visualisation components for you to use but you are free to add to or replace it with any other visualisation tool of your choice. We synchronise all of your computed data to the frontend allowing you to focus entirely on the interface and functionality of your Sift.

Redsift provides first class support for email data. We started with email because it is a personal data-store that contains troves of interesting data but is usually inaccessible because mail apps don’t provide a data computation layer. We abstract away all the messy details of IMAP and MIME parsing to provide access to your emails in the JMAP format which provides access to all the interesting attributes of your email — from, to, subject, plain & html body, headers, mailboxes.

Our stats-e Sift summarises email statistics for your Inbox.

More than email

You can mix in more data through our webhooks. This allows for a whole new class of computation whereby data can be sourced from other channels — IoT devices, Slack and other services allowing you to perform computation on data from as many sources as you need. You’ve always wanted those slack commands and bots to do magical things right? Now you can also process data instead of merely piping it.

Sift Store on our Chrome Extension

Collaboration and Publishing

Building things and seeing them work is very rewarding and makes you feel great. Being able to collaborate or share your work with others is even better! Finding and installing Sifts on our platform follows the Github paradigm. You can work with your peers on a Sift the same way you would in a regular coding project with a repository on Github.

It’s also a great conversation starter: “What does your Sift do?”

A Deeper Dive

Let us take a deeper dive into how a Sift is built. We start with a very simple example that takes in as input a set of emails and then outputs a count of the number of emails that were processed.

We start with defining a DAG which takes as input all emails from the past 1 week. The date is specified as a filter attribute. We’ve named the input port ‘gmailEmails’. A DAG is made up of an array of nodes. We start with a node called ‘Map’ which takes the input ‘gmailEmails’ and produces an output named ‘messages’. The node is implemented in JavaScript, processes incoming emails and emits an empty value for each email processed. The next node in the graph is a ‘Count’ node that takes as input ‘messages’ and selects all keys written to ‘messages’. The implementation is again in JavaScript and it counts the number of keys contained in ‘messages’ and emits this to output ‘count’. The ‘count’ output is exported which means it will be synchronised and available for the frontend to query. As emails get fetched and sent to the ‘Map’ node the count keeps updating and stops when all the emails have been processed.

Visualisation of how data flows through the DAG we constructed.

Here is a snippet from the sift.json file which defines everything necessary for our platform to run the Sift:

"dag": {
"inputs":{
"emails":{
"gmailEmails":{
"filter":{
"conditions":[{
"date": "between now and 1 week before now"
}],
"operator": "AND",
"wants": ["archive", "htmlBody"]
}
}
}
},
"nodes":[{
"#": "Node #1: Map",
"implementation": {
"javascript": "server/map.js"
},
"input": {
"bucket": "gmailEmails"
},
"outputs": {
"messages": {}
}
},
{
"#": "Node #2: Count",
"implementation": {
"javascript": "server/count.js"
},
"input": {
"bucket": "messages",
"select": "*"
},
"outputs": {
"count": {}
}
}],
"stores":{
"messages" : {
"key$schema":"string"
}
},
"outputs":{
"exports":{
"count" : {
"key$schema":"string"
}
}
}
}

map.js code snippet:

module.exports = function(got) {
// inData contains the key/value pairs that match the given query
const inData = got['in'];
  var result = inData.data.map(function(datum) {
var message = JSON.parse(datum.value); // JMAP message
// Emit an empty object for each message so count can be
// calculated in the next node
return {
name: 'messages',
key: datum.key,
value: {}
};
});
  return result;
};

count.js code snippet:

module.exports = function(got) {
// inData contains the key/value pairs that match the given query
const inData = got['in'];
  var total = 0;
inData.data.map(function() { total++; });
  // Return the total count of emails from @gmail.com
return ({name: 'count', key: 'TOTAL', value: total});
};

You can build Sifts on our platform across a whole spectrum of data computation — all the way from simple aggregation like the count illustrated above to joining data across two stream events to implementing complex Machine Learning algorithms. Want to add in human assisted workflows like Amazon Mechanical Turk to process that odd receipt which couldn’t be recognised? Our platform supports that too.

Redsift vs Others

Redsift is purpose built for processing a large volume of small events when compared to other big data technologies. The focus on size is evident in the platform limits set by us but trades this for lower overhead latencies i.e. single digit millisecond latencies for complex joins and filters. In many ways, Redsift provides a simpler set of primitives compared to other Stream processing solutions. However, each primitive is much more expressive.

In addition, security is a primary design goal with our graph engine. Our process model is unique in that it provides isolation from the underlying platform and data containerisation and isolation. Processes on the graph cannot accidentally or maliciously crosstalk without specific provisioning. The data units themselves are either entirely ephemeral or encrypted at rest with per user AES 128-bit keys.

These trades are a fit for the domain we are addressing and maps well against email and other message event sources as each quantum of data is typically much smaller than other ‘big-data’ domains. However, interactive applications demand low, predictable latencies and this is what Redsift delivers.

Intrigued?

We are handing out a new batch of invites for our early access program in the coming months and we would love to see the interesting Sifts you can come up with. If hacking your own data and potentially creating solutions that can help thousands (or millions) of other people sounds interesting to you, why not register here for early access or email us?