China to Overtake US in AI Research

By Field Cady and Oren Etzioni | Allen Institute for Artificial Intelligence

Abstract

In 2017, China announced plans to become the world leader in AI by 2030. In response, the Semantic Scholar project has analyzed over two million academic AI papers published through the end of 2018. Our analysis shows that China has already surpassed the US in published AI papers. If current trends continue, China is poised to overtake the US in the most-cited 50% of papers this year, in the most-cited 10% of papers next year, and in the 1% of most-cited papers by 2025. Citation counts are a lagging indicator of impact, so our results may understate the rising impact of AI research originating in China.

Results

We found that China overtook the US in the number of AI research papers in 2006 due to a surge in published research that began around 2001 and peaked in 2010 (long before the Chinese government’s announcement). Overall, the number of AI papers published worldwide increased from just shy of 5,000 in 1985 to over 143,000 by 2018.

However, not all papers are created equal. Chinese researchers are sometimes stereotyped as making incremental research contributions, such as those tracked by various leaderboards (e.g., leaderboard.allenai.org). To better assess the quality of the research, we ranked all papers published in a given year by the number of citations each paper received and examined how many of the most-cited papers came from each country.

This focus on high-impact papers shows a clear trend of Chinese ascendance in the field of AI. Looking at the top 10% of papers, we see the US’s share has declined gradually from a high of 47% in 1982 to a low of 29% in 2018. China, on the other hand, has been rising steeply with a peak of 26.5% in 2018 and every indication of this trend continuing.

If we fit a line to the trends of the last 5 years, we can see that China and the US are set to converge in early 2020 for papers in the top 10% and in 2025 for papers in the top 1% as shown here:

Methodological Details

For this analysis, we defined an AI paper as a journal or conference paper associated with the “artificial intelligence” field of study in the Microsoft Academic Graph. The citation numbers we used were the estimated citation counts from the data source. Results were very similar if, instead, we examined only the number of known citations.

Associating papers with countries is somewhat more complicated since papers can have multiple authors with multiple affiliations. Therefore, we focused on classifying affiliations as being US, Chinese, or other.

As a heuristic, we classified an institution as being from the US if its website ended in .com or .edu. An institution was classified as Chinese if:

  • Its website ended in .cn or .hk
  • The name of the institution contained the word “China” or “Chinese
  • The name of the institution contained one of the following city names: Beijing, Shanghai, Tsinghua, Tianjin, Wuhan, Huazhong, Zhejiang, Xidian, Nanjing, Shandong, Shenzhen

We classified a paper as a US paper if any of its author affiliations were from the US. A paper was considered Chinese in origin if any of its author affiliations were Chinese or if the paper itself was written in Simplified Chinese. The top Chinese research institutions, sorted by the number of citations in our data, is shown below.¹

Top Chinese research institutions by number of citations

Of course, it is possible for a paper to be associated with both the US and China, usually because the paper has several authors. This was relatively rare; only 5.5% of AI papers from the US were also from China and 5.6% vice versa.

Data and Code

This analysis uses data from the Microsoft Academic Graph (Feb 21, 2019 version). We use the following tables and their fields:

  • Papers.txt: estimated citation count, year
  • PaperAuthorAffiliations: Paper ID, Author ID, AuthorSequenceNumber, Affiliation ID
  • Affiliations: Affiliation ID, OfficialPage, NormalizedName
  • PaperLanguages: Paper ID, Language Code

The source code for this work can be found in GitHub.

Limitations

Our methodology for associating author affiliations with countries required some approximations. There are, for example, some edge cases where a Chinese company, like Alibaba, has a website ending in .com and accordingly gets counted toward the US. This effect is limited, however, since .com affiliations were relatively rare; less than 2% of Chinese papers had any .com affiliations and only 5.2% of US papers had one. Of course, raw citation counts are a noisy measure of impact. We plan to replicate this study using Semantic Scholar’s Highly Influential Citations metric to mitigate the noise associated with raw citation counts.

Related Work

This work is not the first to examine the scientific literature to study the research contributions of different countries, but ours is the most up-to-date study examining papers published through the end of 2018. Our study is also unique in focusing on the most impactful papers.

Dong et al. (2017) studied scientific research as a single entity cataloging the shift from Europe and America to a more international and global research community. However, they did not break out specific fields, like artificial intelligence.

Shoham et al. (2018) is much closer to the present work in that it focuses on AI specifically. It contains many similar findings to our own, including the spike in Chinese research output around 2009 and the increasing rate at which Chinese papers are being cited. However, they did not specifically look at top-tier papers. Their findings about citation rate could be equally well explained by China producing more medium-quality or fewer low-quality papers. In addition, their data for citations goes only through 2016.

Tsinghua University produced a similar report with data going through 2017. Their approach is more similar to ours in that they define a notion of high-value papers and look at how many of them are produced by different countries over the years. However, they define “highly-cited” papers by comparing them against all papers in their ESI field (which includes all of computer science) and look only at the top 1%. This is more stringent than our exploration of AI papers in the top 10% and is susceptible to interference from other areas of computer science. Like us, they found the US has had stable output of highly-cited papers while demonstrating the sustained high- impact of US papers and growth of Chinese papers. We are uncertain of their methodology, so it’s difficult to draw more definitive conclusions.

Conclusions

Our data shows that impactful Chinese investment in AI research pre-dates their 2017 announcement regarding AI supremacy by more than a decade. By most measures, China is overtaking the US not just in papers submitted and published, but also in the production of high-impact papers as measured by the top 50%, top 10%, and top 1% most-cited papers. By projecting current trends, we see that China is likely to have more top-10% papers by 2020 and more top-1% papers by 2025.

When we look at Best Paper Awards (whose choice-by-committee is somewhat idiosyncratic), we do see the US as firmly ahead. However, even at the very top of the field we see outstanding research by Chinese authors including the Best Papers in AAAI 2012, ACL 2012, CVPR 2017, etc.

Future work from the Semantic Scholar team will examine whether authors are more likely to cite authors of the same nationality and whether that will accelerate the observed trends as Chinese authors become increasingly prolific and prominent. Recent US actions that place obstacles to recruiting and retaining foreign students and scholars are likely to exacerbate the trend towards Chinese supremacy in AI research.

Acknowledgements

We are grateful to Iris Shen at Microsoft for helpful discussions about data and methodology, and to Microsoft Research for creating, updating, and sharing the Microsoft Academic Graph. Thanks to Nicole DeCario and Carissa Schoenick for their feedback on an earlier draft.


¹The absence of Microsoft Research Asia is conspicuous and may be due to a defect in the underlying data set.


References

Dong, Yuxiao, et al (2017). A Century of Science: Globalization of Scientific Collaborations, Citations, and Innovations. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1437–1446.

Yoav Shoham, Raymond Perrault, Erik Brynjolfsson, Jack Clark, James Manyika, Juan Carlos Niebles, Terah Lyons, John Etchemendy, Barbara Grosz and Zoe Bauer (2018). The AI Index 2018 Annual Report. AI Index Steering Committee, Human-Centered AI Initiative, Stanford University, Stanford, CA.

China Institute for Science and Technology Policy at Tsinghua University: (2018). China Development Report 2018.

Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243–246.