Summary: Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop (KDD 2018)

Pouya Pezeshkpour
6 min readDec 17, 2018

--

Authors: Yutao Zhang, Fanjin Zhang, Peiran Yao, Jie Tang

In this paper, the authors consider the problem of name disambiguation in AMiner, a free online academic search and mining system. Dealing with the name ambiguity problem in a large online system, they address 4 challenges in this paper: 1) How to quantify the similarity between entities from different data sources? 2) How to determine the number of persons with the same name (which is this is usually a prespecified parameter for existing name disambiguation algorithms)? 3) How to integrate data continuously? 4) How to involve human efforts in the loop to achieve high disambiguation accuracy? To find a solution to these challenges, the authors design an end-to-end framework which is depicted in Figure 1.

For a given name reference “a”, they define the candidate set of “a” (set D_a) as all the documents associated with “a”. Furthermore, they represent each document D_a^i ∈ D_a by a set of features D_a^i = {x_1, x_2, ...} including title, abstracts, coauthor names, venue names, etc. They use I(D_a^i) to denote the identity (corresponding real-world person) of D_a^i. Thus if D_a^i and D_a^j are authored by the same author, we have I(D_a^i) = I(D_a^j). The provided end-to-end algorithm in this paper consists of 5 steps:

  1. Global Metric Learning

In this step, they first transform the features set for each document D_i, into a continuous low-dimensional space using Word2Vec. To do so, using Word2Vec they transform each feature x_n to a vector x_n. Then, they calculate the feature embedding of document D_i as weighted sum X_i = sum (a_n . x_n). In the equation, α_n is the inverted document frequency of feature x_n and X_i captures the correlations between features based on the co-occurrence statistics within each individual documents. Using the calculated feature vector for each document, the authors find the embedding for each document by global metric learning optimizing the following contrastive loss:

where delta is the Euclidean distance in the embedding space, y_i is the new embedding of D_i (passing X_i through a linear layer), and D_i+ is the positive sample where I(D_i) = I(D_i+) (sampled from the labeled data), and D_i- is the negative sample where I(D_i) != I(D_i-) (sampled randomly). The architecture of global metric learning with triplet loss is shown in Figure 2.

2. Local Linkage Learning

In this step, the goal is to combine the global embeddings learned in the previous step with local topological information for a given name reference “a”. To do so, for a given name reference “a”, they construct a local linkage graph G_a = (D_a , E_a ), where D_a = {D_i^a } is the set of documents authored by a person named “a”, and E_a = {(D_i^a,D_j^a )} is a set of edges capturing the similarity between the documents. In this paper, they measure the similarity of two documents based on the common features shared by the two documents and construct an edge between two documents in the local linkage graph if they share more than 10 features (they find this threshold empirically). As a result, to combine the global embeddings and local graph topological information, they define a variational graph auto-encoder receiving both of them as the inputs and use the encoder component of auto-encoder to find the final embedding for each D_i. The architecture of this auto-encoder which incorporates a two-layer graph convolution network as its encoder is depicted in Figure 3.

3. Cluster Size Estimation

In this paper, the authors consider a setting that the number of persons with the same name is not known so it will be essential to design a method to estimate the cluster size. As a result, they propose a sampling strategy to construct a pseudo-training set and then used a bi-directional LSTM as the encoder and a one-dimensional fully-connected layer as the decoder to predict the number of clusters. As the loss function, they optimize the Mean Squared Logarithmic Error as follows:

where K_t is the true number of clusters, and D_t is a set of documents.

4. Clustering

After estimating the number clusters for a given name reference “a”, the authors use a hierarchical agglomerative clustering algorithm on the output fo step (2) to solve the disambiguation problem.

5. Human in the Loop

To provide the labeled data, they allow both users and professional annotators to provide feedback based on the clustering results. The supported feedbacks include (1) Delete: removing a document D_i from a profile, (2) Insert: adding a document D_i into a profile, (3) Split: annotating a profile as over-merged and requesting clustering within the profile, (4) Merge: merge all the documents from two profiles, (5) Create: creating an empty profile, and (6) Confirm: label a profile as a clean cluster.

To evaluate their method, they construct a benchmark based on AMiner, sampling 100 author names from a well-labeled subset of AMiner database. The benchmark consists of 70,258 documents from 12,798 authors. As the baselines, the authors consider GHOST [1], Zhang et al. [2], and Louppe et al. [3]. The results of the name disambiguation task are provided below:

Moreover, the evaluation of the cluster size estimation is provided as:

As it shows, the provided method outperforms the baselines with a considerable gap. Although this work provides a very novel end-to-end approach for the task, and introduce many interesting techniques to address different challenges in the task, we can see that Zhang et al. [2] achieves a very similar results compared to the presented method, and considering the fact that in Zhang et al. [2] the authors only used local graph topological information (in unsupervised setting) to address the issue, it seems the second step of learning (Local Linkage Learning) in this work needs to be done in a more elaborate way to achieve a higher accuracy. Furthermore, based on clustering estimation table, although the predicted examples outperform the baselines with a considerable gap, they are still very far from the true values. As a result, comparing the results in a setting that the number of clusters is prespecified can shed more light in the evaluation of these methods.

[1] Xiaoming Fan, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. 2011. On graph-based name disambiguation. JDIQ 2, 2 (2011), 10.

[2] Baichuan Zhang and Mohammad Al Hasan. 2017. Name disambiguation in anonymized graphs using network embedding. In CIKM’17. 1239–1248.

[3] Gilles Louppe, Hussein T Al-Natsheh, Mateusz Susik, and Eamonn James Maguire. 2016. Ethnicity sensitive author disambiguation using semi-supervised learning. In KESW’16. 272–287.

--

--