Which encoding mechanism is best for Chinese, English, Japanese, and Korean?
CDS’s founding director Yann LeCun & Ph.D. student Xiang Zhang produce first systematic study of 473 encoding models for text classification on 14 multilingual data sets
As the Silicon Valley quarrels over its diversity policies thanks to the Google Diversity Memo fiasco, CDS’s founding director and director of Facebook AI Research, Yann LeCun, and his doctoral student, Xiang Zhang (NYU Courant), are taking another approach to make the tech community a more inclusive space — at least, in terms of its languages.
In “Which Encoding is the Best for Text Classification in Chinese, English, Japanese, and Korean?”, Zhang and LeCun produce the first systematic study of 37 existing encoding methods on 14 large-scale data sets — 473 models in total — to find out which one can best handle western and non-western languages.
Drawing from sites like the Chinese online restaurant review website dianping.com, Japanese online shopping website rakuten.co.jp, Korean online shopping website 11st.co.kr, and The New York Times, they collected 14 multilingual datasets and used over 10 million samples for testing and training. “The model [that] achieved the best consistent performance,” they concluded, “is the character level 5-gram fastText.”
fastText is a popular method developed at Facebook AI research, which is available in open source.
Their study not only helps computer scientists and data scientists perform multilingual text processing more efficiently, but also encourages the development of research projects with a more global dimension. And, Zhang and LeCun will also be releasing all the code that they used in their study under an open source license for the community.
Well, data scientists? What are you waiting for? It’s time to buck up and embrace the globe more fully. Learn more about their work here.
By Cherrie Kwok