Pluto Data Team’s progress in November

Yoonji Kim
Pluto Labs
Published in
5 min readDec 12, 2018

In the last post, Changbae, a data scientist in Pluto, introduced our Open project: author name disambiguation. This project is a major part of Pluto’s Data Team. Today, I want to share what our Data Team does and their progress in November.

There are two projects that Data Team focused on:

  1. <Author Name Disambiguation> project
    This project, which is already introduced, is designed to find a breakthrough in matching past academic objects(i.e. papers) to appropriate individual researcher and to apply the same methodology to future inputs.
    Initially, this project was carried out only by team members, but from the Q4 we are trying to make it an Open project. We believe traits of Open project such as Collaboration, Transparency, Inclusivity will lead us to the solution much faster. Also we have continued to feel that we need insights from a variety of people in the academic community in doing this. So we prepared a preliminary work to open the project during November.
  2. <Exploration for metrics for good research>
    When looking for papers, researchers usually refer to abstracts and reference lists. But they still can’t be sure of how valuable these papers are, so they put a lot of time seeking “good papers”. To solve this problem, Data Team began this project.
    This project explores 1. Criteria for judging as a “good research” and 2. How to measure criteria for judgment.
    This is similar to work that we look for quantitative measures of scientific output, replacing the indexes that current academia uses(i.e. Impact Factor). We thought that if we find metrics for “good research”, it could solve that problem as well. The metrics that we are exploring should be quantitative, based on robust data, rapidly updated and retrospective, normalized by discipline, cannot be manipulated.

As we have proceeded through these projects, we found it problematic that we not only lacked a deep understanding of our whole database but also needed quality control of data. So we focused on improving both our understanding and the quality of our data.

* This post summarizes the overall progress of Data Team. More information on the Open project can be found in the next series.

Exploratory Data Analysis (EDA)

First of all, we conducted EDA to improve our understanding of the whole database. EDA is an approach analyzing data sets to summarize their main characteristics, with visual methods. Usually, EDA is for seeing what the data can tell us before the formal modeling or hypothesis testing task. With EDA, we got distributions for following items: reference count, citation count, document type, number of words in abstract, number of available URLs, number of co-authors, number of publication for each author, published year, and the journal each paper is published in

Filtering

After increasing our understanding of the whole data, we prepared the filtering work.
Here are some description about filtering work for improving the quality of the data.

  • Using the document type (doc-type)
    We got information about doc-type of the whole data using EDA. From this information, we checked that about 20 percent of the whole data is patent data, not papers. We did a pilot test to confirm that doc-type information through EDA is true, and the test result told us it is reliable. So we used the doc-type information for filtering patents data.
  • Using the length of the abstract
    This filtering criterion is divided into two cases based on the length of the abstract. One is the case that the length of the abstract is very short. The abstract that is very short has less than 20 words. These cases don’t require other conditions. And the other case is that the length of the abstract is slightly short. The abstract that is slightly short has 20 to 50 words. These cases require other conditions. Based on this idea, we could distinguish the kind of data, such as letters, dictionaries, or audio.
  • Remove data from indexes
    We filtered data that comes from other academic index sites, not original data. These data records are prone to be duplicate in our database since the database itself comes from a crawler-index service, Microsoft Academic. Thus, data coming from other indexes is “double indexed”.
  • Remove data without links to other data
    We filtered the data without any reference or citation. We’re confident in pruning these data as they aren’t so important in the database at the moment, and we can restore them anytime if we need.

Verifying

After filtering the data, we tested self-citation ideas.
This test was conducted to verify 1. the filtering is successful through comparing the original data and filtered data and 2. the validity of the self-citation idea on more high-quality data.

We did blocking to the surname of Cruz that has the proper size and made the network-graph based on the block. (node: author, edge: reference) In order to identify only the cases needed to validate the idea, we excluded cases where the name was 100% matched because our idea only works when there is ambiguity in the author’s name.

And then we focused on the cases where names and references were linked, referring to the distribution of name data. We referred to the distribution of name because the decision to merge two authors with similar names is highly sensitive to the distribution of their names. Suppose there are 1000 identities named David Cruz and 2 of them are linked with reference, then it is hard to assure that they are the same person. But if there are only 2 David Cruz, and they are linked with reference, there is a high probability that they are the same person. That is, the distribution of name data can determine whether or not they are reliable.

Result

Example of network-graphs between authors
  • Test on data without filtering
    Based on the data of 47,585 authors, 100,370 papers, and 6,925 references, we got the results with 118 cases, and we could eliminate the ambiguity of the names of 227 authors.
  • Test on data with filtering
    Based on the data of 34,202 authors, 82,069 papers, and 6,663 references, we got the results with 139 cases, and we could eliminate the ambiguity of the names of 321 authors.
  • Sum up
    -
    paper: 18.23% Declined
    - author: 28.12% Declined
    - subgraph: 5.34% Declined
    - merged author: 23% Increased
    -
    Through the test results, we were able to verify the filtering was successful and our ideas worked. We think that these results came out as meaningless data were filtered out.

As a result, we have had successful outputs in November, and now we are trying to apply it to the whole data of Scinapse based on the November’s attempt.

Our data team is always playing an important role behind services. Through this post, I hope that users will know about the efforts of our Data Team.

And Pluto team always welcome the participants who want to join the Open project with our Data Team

Pluto Network
Homepage / Github / Facebook / Twitter / Telegram / Medium
Scinapse: Academic search engine
Email: team@pluto.network

--

--