Statistical Graph Analysis with KgBase

Matias Morant
Knowledge Graph Digest
3 min readAug 21, 2020

We developed a statistical graph analysis tool for KgBase. Throughout development, we used Gephi as a benchmark.

In this article, we compare the results and performance of KgBase to Gephi, for one of the projects hosted on KgBase

To reproduce these results, you can read how to compute statistics on KgBase and how to import data from KgBase into Gephi.

Global Statistics

These are the global statistics we compute on KgBase:

  • Nodes
    The number of nodes |V|
  • Edges
    The number of edges/connections |E|
  • Mean Degree
    The Degree of a node is the number of edges incident to it. The mean degree is the mean of all nodes degree. This is equivalent to 2|E|/|V| or |E|/|V| depending if the graph is directed or undirected.
  • Network Diameter
    The Network Diameter of a graph is the shortest distance between the two most distant nodes. In other words, once the shortest path length from every node to all other nodes is calculated, the diameter is the longest of all the calculated path lengths.
  • Graph Density
    The Density of a graph is the ratio of the number of edges |E| to the maximum possible edges.
  • Connected Components
    A Component is a subgraph in which any two nodes are connected to each other by paths, and which is connected to no additional nodes in the supergraph.
    This is the number of components in the graph.
  • Mean Path Length
    The Mean Path Length of a graph is the mean distance between pairs of nodes. In other words, once the shortest path length from every node to all other nodes is calculated, the mean path length is the mean of all the calculated path lengths.
  • Av. Clustering Coefficient
    The Average Clustering Coefficient is a measure of the degree to which nodes tend to cluster together.
    The neighborhood of a node is the set of all nodes connected to it. A neighborhood edge is an edge between two neighborhood nodes.
    The Local Clustering Coefficient of a node is the ratio of the number of neighborhood edges, to the maximum possible neighborhood edges.
  • Modularity
    Modularity measures the strength of division of the graph into clusters. High modularity indicates dense connections between the nodes within clusters but sparse connections between nodes in different clusters.

These are the results we obtain for these statistics on KgBase, compared to Gephi:

On both, most of the time was spent computing Network Diameter (undirected) and Mean Path Length, which are computed simultaneously for performance reasons. In Gephi, these 2 statistics took 178 seconds, while on KgBase they took 1455 seconds. For the remaining 10 statistics, it took 108 seconds to compute on Gephi, while it took 98 seconds to compute them on KgBase.

Local Statistics

These are the local statistics we compute on KgBase:

  • PageRank
    Ranks nodes (“pages”) according to how often someone following random links will reach a node.
  • HITS authorities
    A node HITS authority score is the sum of the HITS hub score of all nodes that point to it. It is a measure of how valuable information stored at that node is.
  • HITS hubs
    A node HITS hub score is the sum of the HITS authority score of all nodes that it points to. It is a measure of the quality of the links of a node.
  • Eigenvector Centrality
    Eigenvector Centrality is a measure of the influence of a node in a network. A high eigenvector score means that a node is connected to many nodes who themselves have high scores.
    The score of a node corresponds to its component in the eigenvector with largest eigenvalue of the adjacency matrix.

As a test node for local statistics, we chose the ‘Lyft’ node. These are the results we obtain for these statistics on KgBase, compared to Gephi:

It took 18 seconds to compute all these statistics on KgBase.
On Gephi, It took us 81 seconds to compute these statistics except for Eigenvector Centrality (undirected). This last statistic, which depends on the number of iterations, took

--

--