Summary: Name Disambiguation in Anonymized Graphs using Network Embedding (CIKM 2017)

Pouya Pezeshkpour
UCI NLP
Published in
4 min readDec 16, 2018

Authors: Baichuan Zhang, Mohammad Al Hasan

In this paper, the authors consider the problem of name disambiguation in anonymized graphs and provide a novel solution using only relational data. In contrast to the majority of existing solutions, the authors tackle the task in a privacy-preserving setting, i.e., do not leveraging biographical features such as name, address, institutional affiliation, and also, contextual features such as collaborator, community affiliation, and external data source.

To incorporate the link data, for a given name reference “a”, their method pre-processes the input data as three graphs: 1) person-person graph representing a collaboration between a pair of persons. They construct this graph by firstly identify all the documents (set D_a) which “a” is associated with them. Then finding all the collaborators (set A_a) in these documents and considering them as the nodes of the graph, they link two persons together if they coauthoring a document together. 2) person-document graph representing the association of a person with a document. This graph is a bipartite network where the nodes are from two sets of D_a and A_a and links represent the authorships relation between a person and a document. 3) document-document similarity graph. They define the similarity between documents through a combination of person-person and person-document relationships. Two documents are similar if the intersection of their collaborator-sets is large (by using person-document relationship) or if the intersection of one-hop neighbors of their collaborator- sets is large (by using both person-document and person-person relationships).

Providing the topological information through the three defined graphs, they find a vector representation for each document by learning these graph simultaneously and embedding them in the same vector space. They define the affinity score between two nodes as the inner product of their corresponding embedding representations. Then, they learn a negative log-likelihood objective function over the three graphs by applying a sigmoid function on subtraction of the score of each sample with a random negative sample. After finding the vector representation for every document associated with “a”, they solve the disambiguation problem by hierarchical clustering on the embeddings assuming that the number of clusters is known. The pseudo-code of the proposed method is provided as follows:

As the benchmarks, they consider two citation datasets namely as Arnetminer and CitSeer. The statistics of these datasets, which for each name reference contain the number of documents, and the number of distinct authors associated with that name reference, is provided below:

To evaluate their method, they consider several baselines differing only in their documents embedding procedure and sharing the same clustering technique as before. The comparison results of Macro-F1 is provided as:

Furthermore, they conjecture that the relatively good performance of their proposed method across all the name references is due to the fact that the method is able to learn document embedding, which is particularly suited for the name disambiguation task by facilitating information exchange among the three networks. In addition to Macro-F1, they study the effects of considering different values for the number of clusters on their performance in Figure 3. As it shows, their method provides a more robust performance comparing to the baselines.

This paper is one of the first and few papers which goes beyond the biographical and contextual features to solve the disambiguation problem and incorporate the graph topological information instead to solve this problem. Although they achieve very interesting and compelling results through their experiments, I believe the inability of their baselines to capture information from the three defined graphs play a major role in their poor performance. As a result, providing a more fair environment by either increasing the baselines capability to process all three graphs or defining these graphs according to the baselines requirements can shed more lights on the performance of these methods. Furthermore, as the next step, it would be very interesting to combine the external features with graph topological information together to achieve even better performance.

--

--