GRAKN.AI at Data Day Texas 2017

Reporting the highlights of #ddtx17

Borislav Iordanov
Vaticle
7 min readJan 18, 2017

--

By Erik A. Ellison [CC BY-SA 3.0], via Wikimedia Commons

This past weekend, I had the good fortune to attend this year’s Data Day Texas conference in Austin, alongside my young and energetic boss/CEO, Haikal Pribadi. The conference had several parallel and somewhat related tracks: graph database technology, natural language processing and data management in health care. It was an intellectually stimulating, cordial, and professional environment.

What follows is an overview of some of the highlights, for me at least, from the conference.

Both NLP and health care are domains which epitomize the problems knowledge graphs solve. In fact, the term knowledge graph is increasingly used instead of database to refer to application data storage. Gabor Melli, from OpenGov, talked about building knowledge graphs interlinked with text corpuses. Various predictive modelling techniques can be applied on these knowledge graphs in order to discover concept mention and similar tasks.

In a similar vein, Sanghamitra Deb, from Accenture, presented a tool to extract knowledge bases from text in a semi-supervised way, using rules. The idea there is to bootstrap a training set by running some rules over text, with minimal human validation and then use the training set as input to machine learning models.

Finally, in the NLP category, a very nice review-type talk on the current state of the art of NLP was given by Jonathan Mugan of Deep Grammar. Jonathan made the point that currently successful NLP software does not exhibit anything close to actual understanding. Rather, for true AI, common sense is needed, which can be achieved only through embodied intelligence, something which I suspect is known by people taking AI seriously.

On the graph database side, Corey Lanum, from Cambridge Intelligence, showed us some nice examples of graph data visualization for a geospatial map with time-based information. The “geo” part I found useful, but the time component I didn’t find that compelling. In all fairness, it’s a tough problem — how many dimensions can you put on a flat screen without butchering the visualization?

We saw Luca Garulli, from OrientDB, give his perspective on the graph database market. Lack of standardization (modulo TinkerPop) is a pain point for everybody. Of course, standardization is difficult because different vendors propose different data models. It will therefore be a long road ahead for a standardized, i.e. vendor-agnostic, query language to take hold on the level of SQL. Fortunately for us at GRAKN.AI, while it may take a while for standardization, we already have a powerful, functional, and versatile query language — our beloved Graql — that developers can quickly learn and apply to their data problems.

Speaking of which, Juan Sequeda presented some recent work from the Linked Data Benchmark Council (LDBC) where a task force was formed to analyze both existing languages such as Gremlin, SPARQL, Cypher, SQL etc., as well as the design space specifically for property graphs. One pertinent insight was the idea of composability: if one sees the semantics of a query as an operation from data to a result, that operation is not closed in an algebraic sense since the result is not like the original data — it is not a graph. In SQL, one queries tables and gets back tables. This make the language composable (hence the structured part). None of the current graph query languages have that property.

On the theoretical side, there was an amazing talk by professor Alexandros Dimakis (University of Texas) on graph analytics. He explained the notion of k-profiles which are global properties of a graph characterizing the frequencies of certain types of k-node graphs that occur as motifs. Such profiles capture structural characteristics of graphs at a large scale and there is research that relies on these characteristics to classify types of graphs.

One practical application of Prof Dimaki’s talk would seem be detecting anomalies: fake online users or fraudulent behavior. Another interesting theoretical takeaway was the notion of sparsification: sparse graphs or matrices can be manipulated with efficient algorithms. When a graph is not sparse, what one can do is randomly delete edges to make it sparse, compute those global profiles on the sparsified version efficiently, and then estimate the actual profile of the original graph. Techniques like these have enabled Prof Dimakis to design an algorithm to return the top elements according to a page rank with high probability and without computing the actual rank. Pretty cool stuff!

Finally, the highlight for GRAKN.AI was — of course — that Haikal and I had the opportunity to present our work, for which we are very thankful to the organizers and Lynn Bender in particular. I presented our work on analytics:

The algorithms I described are due to our amazing engineer-researchers Jason Liu and Sheldon Hall.

Haikal did an amazing job explaining why there is gap in the graph database space currently and how GRAKN.AI fills that gap.

Funnily, before the talk I tried to encourage him to be more succinct in explaining some of the basics, thinking the audience was already exposed to them. I couldn’t have been more wrong — not only did people appreciate and connect with our technology, but the cameraman ended up praising the talk to the organizers as the one talk that allowed him to understand what this conference was about. As a result, Haikal gave an interview to be published on youtube alongside the presentations and interviews of other, already established players. We’ll post a link when we have it.

The Data Day Health was a no less productive conference. It was smaller, but that can be an advantage — as I mentioned above. Particularly relevant is that the theme of knowledge graphs showed up again and again. Many startups work by modelling medical information in a knowledge graph and then use that knowledge graph to create a traditional expert-system-style application on top, sometimes combined with machine learning, but often a purely logical AI approach. Denise Gosnell presented PokitDok which offers a platform with APIs to handle health care transactions, such as determining insurance eligibility.

Stephen Bar, of the Seattle Cancer Care Alliance, showed a really cool (I mean geeky cool!) system that models complex cancer treatments as Haskell programs stored in a graph database and then matched against patient profiles to verify the treatment is correctly applied (i.e. type checks in Haskell, if I understood correctly). So the Hindley-Milner type system is used to cure cancer now. That should shut up any skeptics of pure functional programming!

Of course, there had to be an “app” from Silicon Valley, too. Nikhil Buduma from Remedy Health started by promising to re-think the US healthcare industry from the ground up, because it has so many problems, and he is doing it the Valley way — with an app. The concept of the app is fairly unoriginal IMHO, but the execution is incredibly hard to pull off and Remedy Health appears to be doing an amazing job — the app has a human in the loop self-learning concept, where an expert system collects symptoms and is trained to ask the right questions before forwarding the information to a doctor, who then makes a diagnosis and prescribes medication. All for $30 and no paperwork whatsoever. Good luck to those guys!

Cambridge Semantics showed their all-in-one high performance system that can store lots of medical data in a ginormous triplestore. Finally, Juan Sequeda, mentioned above on graph query languages, pitched his company, which virtualizes RDBMs as triplestores, adds some OWL inference in the mix and allows doctors to perform semantic searches in their otherwise impenetrable hospital legacy databases.

Finally, I don’t want to leave without mentioning one of the newcomers in database technology: ScyllaDB, which is a drop-in replacement for Cassandra written in C++. We shall put it through its paces at GRAKN.AI and report back. But having had the pleasure to talk about it with one of its founders, Dor Laor, I’m highly optimistic.

Overall, we had a fantastic time in Austin, and met some wonderful people pushing forward graph database and related technologies. The organization of the conference was excellent, and I would encourage you to consider attending in the future. The one and only criticism that I would submit to the organizers is that the number (7–8) of concurrent, high quality talks made it difficult to pick which sessions to attend. The sentiment was shared by many attendees. Hopefully, future conferences will somehow remedy the situation.

Based on what I observed, it’s fair to say that knowledge graphs will form the foundation for many innovative apps; yet, so far, there isn’t any established product or set of practices on how to implement them. Moreover, analytics over knowledge graphs (especially when inference is at play) is a largely untouched field, and a great research opportunity. Here at GRAKN.AI, we’re working to provide an analytically powerful, developer-friendly platform for knowledge graphs, so we’re very optimistic about the future.

Availability of the talks on YouTube should be in about 3 weeks time, I am told, on the Global Data Geeks channel.

May the force be with you,

Boris

--

--