Which encoding mechanism is best for Chinese, English, Japanese, and Korean?

CDS’s founding director Yann LeCun & Ph.D. student Xiang Zhang produce first systematic study of 473 encoding models for text classification on 14 multilingual data sets

As the Silicon Valley quarrels over its diversity policies thanks to the Google Diversity Memo fiasco, CDS’s founding director and director of Facebook AI Research, Yann LeCun, and his doctoral student, Xiang Zhang (NYU Courant), are taking another approach to make the tech community a more inclusive space — at least, in terms of its languages.

In “Which Encoding is the Best for Text Classification in Chinese, English, Japanese, and Korean?”, Zhang and LeCun produce the first systematic study of 37 existing encoding methods on 14 large-scale data sets — 473 models in total — to find out which one can best handle western and non-western languages.

Drawing from sites like the Chinese online restaurant review website dianping.com, Japanese online shopping website rakuten.co.jp, Korean online shopping website 11st.co.kr, and The New York Times, they collected 14 multilingual datasets and used over 10 million samples for testing and training. “The model [that] achieved the best consistent performance,” they concluded, “is the character level 5-gram fastText.”

fastText is a popular method developed at Facebook AI research, which is available in open source.

Their study not only helps computer scientists and data scientists perform multilingual text processing more efficiently, but also encourages the development of research projects with a more global dimension. And, Zhang and LeCun will also be releasing all the code that they used in their study under an open source license for the community.

Well, data scientists? What are you waiting for? It’s time to buck up and embrace the globe more fully. Learn more about their work here.

By Cherrie Kwok

)

NYU Center for Data Science

Written by

Official account of the Center for Data Science at NYU, home of the Master’s and Ph.D. in Data Science.

Center for Data Science

This is the official research blog of the NYU Center for Data Science (CDS). Established in 2013, we are a leading data science training and research facility, offering a MS in Data Science and, as of 2017, one of the nation’s first universities to offer a Ph.D. in Data Science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade