Language identification in python using fastText
What is fastText?
fastText is a library created by Facebook’s AI Research lab for efficient learning of word representations and sentence classification. fastText allows us to create supervised and unsupervised learning algorithm for obtaining vector representations for words. Pre-trained word vectors using unsupervised algorithms for 157 languages have been provided by fastText. Read more about it at https://fasttext.cc/docs/en/crawl-vectors.html
fastText for language identification.
fastText has also published a fast and accurate tool for text-based language identification capable of recognizing more than 170 languages. The tool has been open-sourced to be used for free by anyone. Two versions of language identification are available, each optimized for different memory usage.
- lid.176.bin, which is faster and slightly more accurate, but has a file size of 126MB.
- lid.176.ftz, which is the compressed version of the model, with a file size of 917kB. The smaller file size is achieved by a little compromise on accuracy
Diving into the code
- Firstly install the
fasttext
library usingpip install fasttext
- Secondly, download either one of the pre-trained models lid.176.bin(126 MB) or lid.176.ftz(917kb) depending on your use-case. If no memory constraints are imposed and high accuracy is needed use lid.176.ftz, if high memory constraints exist use lid.176.ftz.
- Thirdly the prediction code
Few Observations to consider
- The object returned by the model is of the form ((‘__label__pl’, ‘__label__sv’), array([0.40688798, 0.23321952])) of
<class 'tuple'>
wherepl
andsv
are theISO 639
code for Polish and Swedish. The prediction for both languages is correct as Hej means Hello in both languages. The second part indicates the respective confidence of the sentence belonging to those languages. - The larger the sentence the more accurate the predictions.
- The higher the confidence returned by the model, the higher the probability of sentence belonging to that language.
Code
The same code can be found on my git repo at https://github.com/c-chaitanya/language-identification
Reference
https://fasttext.cc/docs/en/language-identification.html
https://fasttext.cc/blog/2017/10/02/blog-post.html
Conclusion
These 4 lines of code can be wrapped up in a Flask application and served as a RESTful API and can be containerized using docker as well. Do let me know your reviews.
Feel free to connect with me on LinkedIn at https://www.linkedin.com/in/c-chaitanya/