Bahasa Indonesia Open Sourced NLP Resources

Arie Pratama Sutiono

--

A few might know open sourced resources for Bahasa Indonesia NLP, since they are scattered everywhere on github. Here are a few that I know, hope it helps other people for getting started their NLP projects:

Negative and Positive Unigrams

  1. https://github.com/masdevid/ID-OpinionWords/

Stopword List

  1. https://github.com/pebbie/pebahasa/blob/master/indonesian
  2. https://github.com/aliakbars/bilp/blob/master/stoplist

POS-Taggers

  1. https://github.com/pebbie/pebahasa (python)
  2. https://github.com/andryluthfi/indonesian-postag (java)

MWE (Multi Word Expression) Lists

  1. https://github.com/andryluthfi/indonesian-postag (see the resources folder)

Twitter Sample Corpus

  1. https://github.com/aliakbars/bilp/tree/master/sample (on the csv files)

Slang Words Dictionary (Kamus Alay)

  1. https://raw.githubusercontent.com/nasalsabila/kamus-alay/master/colloquial-indonesian-lexicon.csv (IALP 2018: Colloquial Indonesian Lexicon , http://inacl.id/conferences/ialp2018/accepted-papers/)

Named Entity Recognition (NER)

  1. https://github.com/yohanesgultom/nlp-experiments/blob/master/data/ner/training_data.txt
  2. https://github.com/yusufsyaifudin/indonesia-ner/blob/master/resources/ner/data_train.txt

Universal Dependencies for Bahasa Indonesia

could be found here.

Feel free to comment if there are any other open sourced resources that haven’t been listed on here, and I will update the list.

--

--

Responses (2)