Use Case: Using InfiniteGraph for Deep Link Analysis
Introduction
Graph technologies have been around for a number of years, but adoption of graph databases has been slow until the last few years. Some of the reasons for slow adoption were lack of understanding of the power of graphs by end users, and early adopters ran into performance issues as the size of the graph exploded. Also, with incumbent graph databases, performance degrades as the graph models get more complex.
Why would you want to use a graph database? One of the key advantages of using a graph database is the ability to model real world situations where there is lots of data and there are many complex connections between the data, not only reflected in the graph model, but also as new connections are discovered.
Many current database solutions, whether relational, NoSQL, Big Data or graph, sacrifice either speed or scale, when confronted with extremely large volumes of complex highly connected data that require results in operational real time. Traditional databases have solved this problem with a scale up approach (rather than scale out). But, of course, with a scale up approach sooner or later you can reach a limit with memory, processing power or disk storage.
As the name database suggests, databases are for storing and querying data as efficiently as possible. A lot of effort has gone into optimizing these databases for this type of workload. There have been attempts to support a different type of query that navigates through the data searching for connections between the data. This typically results in poor navigation performance because of a phenomenon known as recursive join, navigating by looking up by value of data. This can become very painful in highly connected data looking for deep nested connections. Graph databases tend to overcome this recursive join problem by building connections directly into the graph model and data. In graph database terminology these are known as the Nodes (data) and Edges (connections) which then supports direct navigation in the graph, scaling and supporting high performance queries no matter how complex or large the graph becomes.
InfiniteGraph, built on proven database technologies that use a federated distributed and massively parallel architecture, allows complex graph models to be implemented that operate at speed and scale, even when the graph exceeds memory limitations. InfiniteGraph is built on technology that allows the graph to grow by continuously adding to the graph even while querying and discovering new connections.
InfiniteGraph supports many different operational deployment environments from single workstation/server to very large clusters of workstations/servers in a distributed network. InfiniteGraph can be deployed in the cloud, on premise, or a mixture of both, on commodity hardware or specialized High Performance Computing systems.
With InfiniteGraph it is just as easy to scale out as it is to scale up. With a scale out approach it is easy to incrementally add additional resources as required, as the workload and volume of data increases.
Deep Link Analysis of Call Detail Records
Use Case
Many telecommunication companies (Telco) generate and collect Call Detail Records (CDR) for analysis. Typical analysis include, but not limited to, fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM (recency, frequency, monetary value) analysis. Historically, these are collected in a relational, simple NoSQL database or Big Data storage. However, this type of storage is not suitable for deep link analysis.
InfiniteGraph was contacted by one such Telco who had been trying Big Data and NoSQL tools to perform deep link analysis to track and uncover the social interactions of people to discover new intelligence. Collecting and analyzing CDR data is a continuous process. This necessitated several requirements:
- The graph needs to be updated even while running real-time analysis on existing data.
- The ability to execute data queries on the data using the same tools.
The Telco built a Proof of Concept (PoC) to demonstrate InfiniteGraph could support both their data queries as well as executing the deep link analysis.
Problem
Using more traditional technologies, the Telco had not been able to load, let alone perform any queries on the over 10 billion CDRs. Even at 1 billion CDRs they had not received any satisfactory results on query performance time with their existing database technology. Success of the PoC included a specified number of CDRs (approximately 10 billion), a time frame to load the initial number of CDRs, a rate to load newly arriving data, and approximately 30 data and deep link analysis queries.
The initial data set included 5 million Persons, 20 million Phones and 10.025 billion Calls.
Solution
The PoC was performed using InfiniteGraph in a standard Amazon Web Services (AWS) environment using commodity hardware. InfiniteGraph was able to meet or exceed all of the requirements for load, data queries and deep link analysis. InfiniteGraph is the only graph database that can dynamically load/ingest the massive, complex, distributed data while simultaneously building the graph which provides the ability to query the data in real-time.
Results
The data ingest was performed on AWS using one large “master” server and 32 “slave” servers in order to take advantage of InfiniteGraph’s patented feature called “Pipelining”, which enables the ingest of vast amounts of highly connected data in parallel. The ingest rate was maintained at 37 million CDR/minute with this configuration. Indexing was performed during the pipeline ingest, thereby eliminating the need to perform a separate step to build and maintain indexes, which is common with other graph databases.
To execute queries, a single server on AWS was used. The queries were parallelized using the number of threads available in the single server. InfiniteGraph’s DO Query Language was used to perform the queries. DO supports both data (SQL-like) queries and graph (Cypher-like) queries. The syntax of DO will be familiar to any data scientist that is familiar with SQL and Cypher.
Typical data queries included:
Exact Match Queries
“Given a callee number, retrieve the 10 longest call-duration caller numbers”
Fuzzy Match Queries
“Given a caller’s id, retrieve total call call durations, frequencies (number of calls) of 20 mostly called callees of which the age ranges between, before a given date and time”
Graph Queries
“Given 2 people verify their linkage” and “Given a phone number, retrieve the people linked to him or her, through up to 5 degrees of separation, within a date range”
Most exact match queries returned results in 1 to 2 seconds; most fuzzy match queries returned results in about 5 seconds. Most graph queries returned results in up to 5 seconds depending on the number of degrees of separation.
Benefits
InfiniteGraph’s DO query language supports both data queries and graph queries in a simple easy to use query language. Using InfiniteGraph’s web based browser, “Studio”, it is easy to compose and execute queries and display the result either as a table for data queries or as a graph for graph (navigation) queries.
Most query languages require that what you are looking for is known, whether it’s by value of data or knowing a navigation path through the connected data (joins in relational databases).
DO provides the ability to specify a query such as “given these two things of interest how they are possibly connected”, through many different types of connections, through many different types of things.
This is the power of InfiniteGraph. If you’re using old database technology how would you handle this type of query?
Deep link analysis can also help discover clusters and chains of connection between phone/people. By looking for anomalies in call patterns, most calls will be local within an area calling friends, family and work colleagues. The frequency of such calls should be fairly stable, for example, call my brother once a week or call home once a month. Anomalies could be if there is a sudden increase in calls, may be to a new number, or increase in calls out of the area. This sort of behavior could be an indicator of setting up cells with a gang to do a robbery or some sort of fraud.
Conclusion
InfiniteGraph can do deep link analysis queries on over 10 billion CDRs, whereas relational databases and Big Data/NoSQL tools couldn’t achieve acceptable results (and in some instances were not able to achieve any results) with 1 billion CDRs (10% of the CDR size managed with InfiniteGraph).
InfiniteGraph is the only graph database that can continuously load/ingest massive amounts of complex connected data while simultaneously performing queries in operational environments.