The SPARK that fires intelligent search and more: The Cloudtenna Data Graph and real-time data modeling

In a recent post, we discussed Cloudtenna’s new approach to Identity Management (IDM). But now that you have an identity, what happens next?

Identity management opens the door to the many file repositories a user accesses every day. Files are scattered across these disparate data silos, and it has become virtually impossible to keep track the mess created. Cloudtenna catalogs this file activity — we call this the data graph.

The data graph is the foundation for the intelligent search, audit, and governance functions offered by Cloudtenna. Cloudtenna performs real-time data modeling on this data set to return (1) insights about the data and (2) filter the data to ensure employees are only able to access files they have permissions to view.

For best results, it is critical that this data modeling is done in real-time. If it’s not done in real-time, even something as simple as a file search is rendered ineffective. Imagine if you performed a search, but the files you worked on earlier today didn’t show up. Or worse, imagine if you performed a search and saw files that you are no longer suppose to have permission to read. File management services require access to “hot” data — the data most recently updated.

To ensure that Cloudtenna is working off the latest set of data, our machine learning performs data modeling in real-time. The ability to crunch that much data at the speeds necessary to provide this used to be a pipe dream but has recently become a reality with in-memory processing and “Fast Data.”

In order to look forward, we must first take a step back and look at what came before the advent of Fast Data, a term that’s emerging to describe the use of real-time analytics on data sets in order to mine data for the purposes of taking action that will affect business outcomes.

Hadoop: the good and the bad
Prior to Fast Data applications, real-time analytics on large data sets were essentially impossible. Hadoop and other big data services pioneered data modeling on top of massive data sets, but did so at less than breakneck speeds. Big data is well suited for many use cases that require one-off reports, but fails to return results fast enough for key file search, audit, and governance functions provided by Cloudtenna.

Think about it: across a large enterprise, employees create and edit thousands of files each day. Each employee of course has permissions to see a different subset of these files based on file permissions. Every data model must take into account the latest file changes, and do so individually for each user. That’s a lot of data modeling!

Data modeling needs to be fast enough to keep up with this demand. Hadoop, a common big data platform, has inherent limitations that prevent it from returning the speeds required. Hadoop is a framework traditionally used to analyze huge datasets by virtualizing data storage and distributing the data across multiple compute nodes. The problem is when workloads become too heavy for in-memory processing, Hadoop sends overflow data out to slower storage. While Hadoop is extremely effective at dealing with large data sets, it suffers latency hits when it must access storage across the network as memory caches fill.

Imagine the columns in an Excel table: you can put as many rows as you want. It can grow to be immense, but there is always a map of where it is, row 156 in column 78, for example. MySQL has been the programmatic language for sorting these large datasets.

Hadoop has been great for interrogating ridiculously large sets of data, and is commonly used for advertising, reporting, or product research on large datasets of customers. For example, you could use it to very efficiently and accurately crunch the numbers to discover how many cookies consumers purchased between 11pm and 2am on April 20th.

The problem is, eventually, you’ve got a huge number of data rows to sort and analyze, and network storage can’t feed the much faster memory quickly enough to look at data with the speed it takes to do intelligent real-time analytics that produce instantaneous results.

Spark makes the Cloudtenna’s Global Data Graph fly
This makes Hadoop far less effective for doing the constant real-time data analysis and sorting required to accomplish intelligent enterprise search and other data-intensive auditing and governance activities. As our vision for what we wanted to accomplish was clear from the beginning, we chose to build the Cloudtenna platform in the cutting-edge Spark platform from Apache. In-memory processing like Spark has only recently become feasible after prices of the RAM have dropped below critical thresholds.

Rather than ingesting and delivering analytics by extending the job to slower data storage across the network in the fashion of Hadoop, Spark instead works on those datasets themselves, in-memory, and does not need to worry about the higher-level, system management components or the introduction of latency from storage. This makes it exponentially faster and a perfect tool with which to forge what we’ve been doing.

For example, to do what we call “user reconciliation,” the process by which we associate what subset of files each individual user has access to, we must understand the many proprietary file permissions paradigms present across every data silo, whether users keep files in Google docs, Salesforce, or in on-prem storage. We generate a file-tree for each individual user in relation to every action taken on files by all users in the company in real-time, with live data sets, across every silo where data is stored.

As mentioned, a Walmart-size data analysis using Hadoop will go through results and filter down, and return a very accurate report based on the perimeters of the inquiry and an existing dataset, however large it may be. But for search, results must be instant and relevant to the user at hand. Every time you search for a file, your colleague could have opened it just before you, perhaps in a different data silo, changing its ordering in realms of shared relevance by metrics of time and additional content, the location of the file, etc. Intelligent search must engage all these variables and treat the data as a living, breathing organism.

Instead of real-time data modeling, traditional enterprise search companies cache data models on a daily or weekly basis to reduce the amount of data processing necessary. This results in outdated results. Cloudtenna uses Fast Data to allow for results from the latest data sets and ensure results are bound by the latest set of file permissions.

We’ll go deeper next week when we discuss how the intelligences gathered from the Data Graph is augmented by a second set of data that we call the User Graph. These graphs work together to deliver industry-leading intelligent features for enterprise information management.