Can Graph Integrate Data At Scale? Hint: Yes, but the answer isn’t what you think.

At Cambridge Semantics, we have built our company on the vision that the semantic graph data model uniquely accommodates the speed and complexity of modern data integration initiatives, and therefore, scaling graphs has become imperative for us and our customers. However, this capability is much in question in the marketplace.

  1. How Semantic Graph Models uniquely address these requirements
  2. Anzo’s capabilities enabling graph-based data integration at enterprise scale
  3. The role of Kubernetes in our approach
  4. Real-world use cases

Executive Summary

Leaders have expectations of speed, flexibility and scale for their digital transformation initiatives. Central to meeting those expectations is the ability to integrate data across the business rapidly, while dealing with complexity and uncertainty. A modern data management strategy, often referred to as “The Enterprise Data Fabric”, provides this capability by meeting the following requirements:

  • Presenting data sets and data products using common business data models
  • Addressing unanticipated questions and requirements.
  • Semantic graph models fundamentally support the expression of models in business terms.
  • Semantic Graph models were designed for unanticipated questions.
  • Integration of clinical and research data within BioPharma.
  • Customer 360 across industries

Integration Requirements for The Enterprise Data Fabric

The Enterprise Data Fabric is an architecture for modern data management that anticipates the need to connect data across the enterprise at speed and scale.

Enterprise Data Fabric Architecture
  1. The data fabric presents data based on common models described with business concepts — the way people think about the data itself, as well as the connections between data sources and data elements. Data consumers use these common models to access data sets or combinations of data most important to their analytic tasks. Applications tie into feeds based on these common models as required.
  2. The data fabric must address unanticipated questions and requirements. Traditional data integration approaches, including warehouses, were designed to accommodate pre-defined sets of business questions based on known data sources. The data fabric, which integrates all the data in the business, adapts to new questions, and to new models and ways of looking at the data on the fly. The data fabric accommodates new analytic use cases and initiatives without starting from scratch. The data fabric “inverses” the traditional application development process and yields an ad hoc “question-and-answer” layer, which empowers users to “know what the enterprise knows.” Each time a user has a new question, IT is no longer required to prepare the data to answer the question.

Why Semantic Graph for Integration?

The Cambridge Semantics team has been applying graph data models to data integration for almost two decades, beginning with our early research at IBM leading up to our founding in 2007. As research firms including Gartner and Forrester have identified graph as a central capability in the data fabric, the centrality of graph technology for data integration is growing in prominence. As our name suggests, we specifically mean semantic graph data models based on the W3C Semantic Technology standards of RDF, OWL, and SPARQL.

Semantic Graph Data Model
  1. Semantic graph models fundamentally support the expression of models in business terms. Concepts, called Classes define what each node in the graph represents, and Properties define the meaning of edges as well as the attributes of nodes. Collections of related Classes and Properties and their characteristics make up an Ontology that provides meaning defined outside of the graph itself. Ontologies, therefore, support inheritance, linkages, and reuse to facilitate a connected business level definition of all an enterprise’s data assets.
  2. Semantic Graph models were designed for unanticipated questions. Unlike relational schema or LPG definitions constructed for a known universe of questions, semantic graph models allow users to pivot their questions by following relationships in the graph. The semantic models allow users to make sense of these relationships in asking their questions. When requirements change drastically, semantic models also allow dynamic remodeling of data to support new types of questions, analytic roll-ups, or simplifying the context for a particular audience. When multiple data sources are connected in the semantic graph, dead-ends in the data vanish; the unanticipated becomes the intuitive. Semantic Graph models embrace uncertainty. Whereas relational models implicitly assume the model is “correct,” semantic graph models assume they never have all the information, so they expect the unexpected. Just as data changes, so do our models whenwe learn new information, crises appear, and new requirements emerge.

How does Cambridge Semantics make this work at scale?

Most analyst research focuses on the use of knowledge graphs for metadata management and data cataloging in the data fabric. We believe that is an important foundation, but it misses graph’s true calling which is integrating the data itself. Based on the data fabric requirements and capabilities above, graph must scale for both metadata and data.

Metadata

Our Anzo product uses semantic graphs to manage and model all metadata throughout the data fabric including source system metadata, business concepts, ontologies, transformations, rules, analytics, access control and even underlying cloud infrastructure. Apart from a convenient, self-similar architecture and programming model for our development teams, the use of a semantic graph model for the metadata catalog affords an abundance of external benefits.

Metadata-driven Workflow

Data

As important as effective use of metadata is for the data fabric, the data fabric is all about, well, the data itself. Solutions that use graph only to catalog data sources, manage models and vocabularies, or virtualize queries fall well short of realizing the true value of graph.

Complex Data Model

The Graphmart

The graphmart is a metadata-driven structure in Anzo where users organize, combine, connect, and transform data from different sources with a novel construct called “Data Layers.” Each data layer in a graphmart contributes a logical sub-graph to the overall graphmart. Each layer is individually secured and may be turned on and off dynamically. Anzo uses the set of layers in the graphmart to load data into AnzoGraph or run queries to transform the graph, creating new RDF triples in memory. The power of the underlying AnzoGraph engine allows the graphmart to manage 10s to 100s of billions of RDF triples across as many sub-graphs as required while allowing users to iterate quickly on their graphmart design and data model.

Spark-based ETL Data Onboarding
MPP ELT Data Loading
Graphmart sourced through three options.
Seamless access to the Graphmart

Kubernetes: Why “How many triples can you store?” isn’t the right question.

At this point, you might be wondering how we manage all the compute required to support all of this loading and integration across data sources, users and use cases in an enterprise data fabric. This question is often expressed by prospects and customers in variants of “How many triples can you store in Anzo”? Such a lens is natural. The semantic technology and graph database markets have typically oriented people around the notion of a graph database or a triple store. How much can it hold? How much can it query? As it turns out, AnzoGraph does both better than anything else. With horizontal scale, we have turned up the dial to 1 trillion triples in memory and queried it orders of magnitude faster than other approaches. Typical customers today are in the 10’s of billions and growing for interactive query. But even with AnzoGraph’s breakout load and query capability, the traditional “single cluster or database” mindset breaks down for enterprise data fabric scale integration.

  • loads the data from onboarded files or direct from source (options above)
  • runs data layers to connect, integrate and transform the data
  • notifies the user their graphmart is ready
Kubernetes Architecture on AWS

A few examples

So, how is this applied in the real world? Cambridge Semantics has proven graph capable of integrating data at scale. I’ll share a few examples here to conclude.

Insider Trading Surveillance

Financial services firms, including hedge funds, use the data fabric to integrate and model data coming from a variety of sources including:

  • pricing data
  • electronic communications (chats, emails)
  • login data
  • badge swipes
  • call logs

Clinical Data Unification

Several of our BioPharmaceutical customers use the data fabric to unify patient data across 100’s or even 1000’s of clinical trials. This helps them analyze data across trials for all kinds of interesting use cases beyond the intended design of the data. Repurposing or retargeting existing drugs is one such use case. The challenge lies in the fact that data collected for each study follows a set of submission standards designed to allow standardized FDA review of the given study. These standards do not, on their own, ensure the data is suitable for integrated analytics across studies. Further unification is required.

Chief Revenue Officer and Co-founder, Cambridge Semantics Inc

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store