Focusing on the Basics

Excellence is achieved by the mastery of the fundamentals

Well, in any project that we dive into, it is important to set a basic layout or structure at first and try to follow it. This is easier said than done and trust when I say this. I am not that good at it. (** HAHA **).

In the case of developing Language Model and Acoustic Model, CMU Sphinx documentation have already set those for us. At least the way to organise stuffs.

This is how the plan is in my case.

  1. Setting up the environment for development. ( ✔ )
  2. Get source text from Malayalam Subtitles , extract the words and sentences, and thank them. ( ✔ )
  3. Build phonetic dictionary ( see sample here )that contains all the words transcribed to its corresponding phonetic representation and correspondingly build a phone-set file containing all the phones( see sample here ) present in phonetic dictionary. ( I am here )
  4. Building the Language Model using Language Modelling Toolkit.
  5. Preparing data for Acoustic model training and building the model.
  6. Training phase.
  7. Testing….
  8. More testing….

This week have been a busy one getting the source files downloaded. One thing that always troubles me, is that I always do my works on my desktop and unfortunately there is no Backup power supply. So, in a place like Kerala, where there is a erratic power failure during the rainy season (** err.. summer of code? **), things get a bit interesting. Nonetheless, it has not been much of an issue yet, and things are going somewhat smoothly.

Ohh, and I just received a call from FedEx. The Google goodies are on their way! (** giggles **)

Now, let’s go a bit into the extracted words and sentences. Here is the script I used to extract the words and sentences. The script is not complete and is buggy for this purpose. It was just a quick tweak to an already existing script that I used for some other purposes. But it get’s the job done with a little help from TextFX plugin of Notepad++

OOPs … hmmm … … I mean Object Oriented Programming…

The output file produced by the script had some significant number of blank lines between the sentences. ( the bug that I referred to) But this was fixed in a jiffy by the TextFX plugin of notepad++. The sentence file had around 5322 lines of dialogues and the word file contains around 7500 words.

Geez, those are some serious numbers!

The reason behind such a large number (this is nowhere near in reality) is for a better training model from the onset.

Hugo (Hollywood), Interstellar (Hollywood) and Queen (Bollywood) were the three films that I went for and downloaded. The reason is simply coz, the dialogues from theses movies seems to have very close relation to our natural way of speaking. I could not find much Malayalam film subtitles as such, but if you do have some suggestions from where I can get them, shoot me with the links.

Here is a glimpse of what the sentence file looks like after some development over the extraction file.

This is a transcription file. ( further in upcoming blog updates )

So, that’s how the week have been.

Looking forward to more interesting facts and numbers! (-_-) (** kidding with the character face expression - Facebook sarcastic habit. Giggles **)
puts "will soon share the current work"
puts "till then, ciao"