Latin to Thaana Transliteration with a Bidirectional LSTM

We already have a few good latin to Thaana/Dhivehi transliteration tools, however there are no implementations done using a machine learning model.

I think if we use a good ML model to do the transliterations, we might get better results than the present rule based transliteration tools. It is hard to get people to follow or enforce strict rules when it comes to transliteration. Though it is not given much priority; Latin is now widely used on social media and with text messaging services.

To generate good results we need a lot of data to train such a model. This applies to all ML based problems. When it comes to Thaana/Dhivehi datasets its hard to find any. To test out the solution I had to write a crawler that collected the data from a few local news websites. Most of the sites had the news headline in Latin in the <title> tags that matched 80% of the time.

Code for the model and dataset used is here

Results of the test run :

Basic code for the data scraper:

Process the text:

Code to test the model: