Language identification in python using fastText

C CHAITANYA
2 min readMay 23, 2020

--

What is fastText?

fastText is a library created by Facebook’s AI Research lab for efficient learning of word representations and sentence classification. fastText allows us to create supervised and unsupervised learning algorithm for obtaining vector representations for words. Pre-trained word vectors using unsupervised algorithms for 157 languages have been provided by fastText. Read more about it at https://fasttext.cc/docs/en/crawl-vectors.html

fastText for language identification.

fastText has also published a fast and accurate tool for text-based language identification capable of recognizing more than 170 languages. The tool has been open-sourced to be used for free by anyone. Two versions of language identification are available, each optimized for different memory usage.

  • lid.176.bin, which is faster and slightly more accurate, but has a file size of 126MB.
  • lid.176.ftz, which is the compressed version of the model, with a file size of 917kB. The smaller file size is achieved by a little compromise on accuracy

Diving into the code

  • Firstly install the fasttext library using pip install fasttext
  • Secondly, download either one of the pre-trained models lid.176.bin(126 MB) or lid.176.ftz(917kb) depending on your use-case. If no memory constraints are imposed and high accuracy is needed use lid.176.ftz, if high memory constraints exist use lid.176.ftz.
  • Thirdly the prediction code

Few Observations to consider

  • The object returned by the model is of the form ((‘__label__pl’, ‘__label__sv’), array([0.40688798, 0.23321952])) of <class 'tuple'> where pl and sv are the ISO 639 code for Polish and Swedish. The prediction for both languages is correct as Hej means Hello in both languages. The second part indicates the respective confidence of the sentence belonging to those languages.
  • The larger the sentence the more accurate the predictions.
  • The higher the confidence returned by the model, the higher the probability of sentence belonging to that language.

Code

The same code can be found on my git repo at https://github.com/c-chaitanya/language-identification

Reference

https://fasttext.cc/docs/en/language-identification.html

https://fasttext.cc/blog/2017/10/02/blog-post.html

Conclusion

These 4 lines of code can be wrapped up in a Flask application and served as a RESTful API and can be containerized using docker as well. Do let me know your reviews.

Feel free to connect with me on LinkedIn at https://www.linkedin.com/in/c-chaitanya/

--

--