Pluto Open Project (1)

Introducing Author Name Disambiguation

Changbae Ahn
Pluto Labs
6 min readNov 22, 2018

--

Hi, it’s Pluto Network’s data mining team.

We’ve been sharing in prior posts our ideas and hurdles in building Decentralized Scholarly Communications platform to disrupt obstacles existing in research environments.
Before implementing some ideas, the team has thought that the past records and performances of academics have to be managed and calculated properly. Currently, academic databases are mainly used for searching scholarly articles, thus they are managed primarily focusing on individual articles rather than authors, failing to achieve high systematic standards. A lot of challenges arise from the fact that the publication information is collected from tens of thousands of different journals and publishers, who each have different policies in handling these information.

In practice, publications from a single researcher are often split into multiple author identifiers (published in different journals, thus having different data sources), and publications by different authors are merged into a single author identifier. It is difficult to distinguish research outcomes by authors of similar names since most of data are based on their names without any standard, universal identification system. We’ve identified several other problems in our database such as name changes at marriage, multiple name representations, abbreviations in names, inconsistent representation, and so forth.

Pluto Network is utilizing some data mining techniques to find a breakthrough in matching past academic objects (i.e. papers) to appropriate individual researcher, and to apply the same methodology to future inputs. We call this problem “Author Name Disambiguation”, and with upcoming series of posts, we will be describing in detail challenges we’re facing and approaches we take upon them.

Before we explore more

Followings are some of challenges in Author Name Disambiguation and some of our concerns

  • There’s not enough number of cases where “true value” is known. (i.e. 100% assurance that a given pair of authors are actually the same person) Thus we’re approaching with Unsupervised Learning in the early stage, and later when we have enough training data with proper labels we’ll be trying Supervised Learning with the dataset.
  • Data is more sensitive to False Positive Errors than to False Negative Errors. That is, incorrectly merging different persons into the same identifier could be more critical than missing out split identities that should be merged. Therefore, we may need to take conservative criteria with high precision requirements when merging authors.
  • There are too many author identities to try greedy approach on every pair. (++100 million) We’re making “blocks” of authors by their surnames.
  • Unlike usual problems in Kaggle, inputs and outputs are not defined. The objective is not to predict values in certain column, but is to identify same objects when they’re stored as different and to distinguish different objects when they’re stored as one, requiring tasks of very high complexity. Prior studies have used generic machine learning models like Random Forest with specific datasets. While referring to these past trials, we will embrace further techniques such as blocking, clustering, link analysis, and etc.
  • Cannot solve every problem at once. Rather than splitting mis-merged authors, we will focus on correctly merging split authors where relatively more data are available.

Attempts made

As aforementioned, we’ve blocked authors with their surnames, and tried followings within those “surname blocks.”

Criteria1: Self-Citation
-
Citation is used as the most major source of information to measure the impact of individual articles. A lot of academics, along with some other reasons, thus often cite their own studies from the past. With this background, we believed that if the authors of a citing article and the cited article have high similarity in their names, they have high possibility of being the same person.
- For example, if a paper authored by “Taylor Swift” cites a paper authored by “T. Swift”, it is highly likely that “T. Swift” is an abbreviation of “Taylor Swift” and both papers are authored by the same person, Taylor Swift.
- Based on this idea, for each surname block, our database was structured into network using Python NetworkX library, authors as nodes and citations as edges. (typical citation graphs would set papers as nodes) Looking at the subgraphs of each surname, several identical authors were found.

Criteria2: Co-Authors
-
Similar to the logical inference in above example in self-citation, different author identities with similar names and similar co-author profiles would have high probability of being the same person.
- For example, if a paper by Adam Smith was co-authored by Taylor Swift and another paper by Adam Smith was co-authored by T. J. Swift, we would believe that it is highly likely T. J. Swift and Taylor Swift are the same person.
- To investigate this inference, we created co-author lists for each author, calculated their pairwise similarities, and found several cases where they seemed to be same authors actually.

Limitations

Although we’ve found many successful cases from above trials, we at the same time faced several limitations. These not only included problems from methodologies used but also came from the requirement of data pre-processing.

1. We Never Know
Even after we have checked that two author identities have the same surnames, similar name representations, co-authors, and mutual citation relations, we still cannot be SURE that they represent the same person. It is even worse when their names are abbreviated into initials. We’re trying to come up with our own criteria to determine whether two are the same. (again, this problem is very sensitive to false positives)

We can’t be sure that they are the same person

2. Malformed Data
A lot of records (articles) were identified to be missing their references (++10 million). Looking at some random samples, many of them are supposed to be filled with references. We’re putting efforts to come up with solutions.
We use word counts in abstracts to filter out malformed data. In case of papers written in Chinese, word counts hardly works. In case of papers indexed by scanning their documents, spacings are often broken (often due to failure in capturing line breaks). We are exploring to find more cases of malforms and solutions to each.

Examples of malformed data

3. Non-research Articles
Several tens of millions of records were identified to be non-academic contents. (or at least requiring different indexing structures) These include: patents, mails, audio records, and etc. We will come up with more patterns to recognize as many objects of these kinds as possible.
- ex) Caribbean Report (audio news by BBC)
- ex) Dictionnaire historique du Japon
- ex) Audio record of classical music concert

4. Edge Cases
Typical papers from the European Organization for Nuclear Research (a.k.a. CERN) would have several tens to thousands of co-authors.
- ex) Physics paper sets record with more than 5,000 authors Upcoming

Upcoming

To sum up, we’ve achieved several meaningful analyses based on self-citation and co-author profiles, but they come with their own limitations. We will put more efforts on data pre-processing to have better quality data, replicate prior attempts again on the processed dataset, and evaluate to improve those methods or come up with novel approaches.

Thank you.

Pluto Network
Homepage / Github / Facebook / Twitter / Telegram / Medium
Scinapse: Academic search engine
Email: team@pluto.network

--

--