Bahasa Indonesia Open Sourced NLP Resources
1 min readOct 8, 2016
A few might know open sourced resources for Bahasa Indonesia NLP, since they are scattered everywhere on github. Here are a few that I know, hope it helps other people for getting started their NLP projects:
Negative and Positive Unigrams
Stopword List
- https://github.com/pebbie/pebahasa/blob/master/indonesian
- https://github.com/aliakbars/bilp/blob/master/stoplist
POS-Taggers
MWE (Multi Word Expression) Lists
- https://github.com/andryluthfi/indonesian-postag (see the resources folder)
Twitter Sample Corpus
- https://github.com/aliakbars/bilp/tree/master/sample (on the csv files)
Slang Words Dictionary (Kamus Alay)
- https://raw.githubusercontent.com/nasalsabila/kamus-alay/master/colloquial-indonesian-lexicon.csv (IALP 2018: Colloquial Indonesian Lexicon , http://inacl.id/conferences/ialp2018/accepted-papers/)
Named Entity Recognition (NER)
- https://github.com/yohanesgultom/nlp-experiments/blob/master/data/ner/training_data.txt
- https://github.com/yusufsyaifudin/indonesia-ner/blob/master/resources/ner/data_train.txt
Universal Dependencies for Bahasa Indonesia
could be found here.
Feel free to comment if there are any other open sourced resources that haven’t been listed on here, and I will update the list.