Demo available here.
While I was testing the ASR systems DeepSpeech and kaldi as a part of the deep learning team at Reckonsys, I realised that neither of them supports auto punctuation. This is of-course expected, since including punctuation symbols while training, will increase the total number of decoding tokens and will result in lower accuracy.
Although, punctuation isn’t really necessary in most of the use-cases like Sentiment Analysis or NER (punctuation helps, but isn’t essential), it is of utmost importance for transcription services. Imagine sending an email to your client with no punctuation and capitalisation. So, we tested the big guy’s (google) cloud speech api and it indeed offers an Auto Punctuation option. This post covers my initial implementation of auto punctuation implemented in under 7 hours.
Just like in my previous post about sentence segmentation, My most important prerequisite was that I needed to implement and test out the entire thing in under a day on the weekend.
Choosing the model for this project was very simple. Since this is a classic sequence to sequence task, I decided to put together a seq2seq model in Keras. Since, LSTM is the goto for any of the NLP tasks, I decided to go for a seq2seq model with LSTM encoder and decoders. In keras, seq2seq can be implemented very easily, without manually handling encoder states and decoder states. But, the problem with this implementation is that the decoding cannot be customised. So, I decided to manually define and handle encoder and decoder states and the encoding, decoding similar to the simple example seq2seq model by keras team .
# Defining the encoder. Max input length is set as 202 for this modelencoder_input = Input(shape=(max_input_length,))# The text input is encoded into one hot vectors beforehand and the embedding layer is used to create embeddings.encoder = Embedding(input_dict_size, 128, input_length=max_input_length, mask_zero=True)(encoder_input)encoder = Bidirectional(LSTM(128, return_sequences=True, unroll=True), merge_mode='concat')(encoder)encoder_last = encoder[:,-1,:]# Defining the decoder. Max output length is also set as 202
# Using different encodings for input and output, though not needed in this case.decoder_input = Input(shape=(max_output_length,))decoder = Embedding(output_dict_size, 256, input_length=max_output_length, mask_zero=True)(decoder_input)decoder = LSTM(256, return_sequences=True, unroll=True)(decoder, initial_state=[encoder_last, encoder_last])
Although we don’t expect the model to output sentence with perfect punctuation for every input, we expect the model to learn to at least copy the input correctly. For this particular reason (and other obvious reasons), I decided to add Attention to the model, so that it learns to copy over input easily and will make it easier to deal with lengthier inputs. I went with Luong Attention instead of Bahadanu (the choice won’t affect the final performance by much), the differences between these two implementations are explained excellently here.
attention = dot([decoder, encoder], axes=[2, 2])attention = Activation('softmax', name='attention')(attention)context = dot([attention, encoder], axes=[2,1])decoder_combined_context = concatenate([context, decoder])
For DeepSegment, I collected and cleaned, 1 million sentences from Tatoeba. Since DeepSegment uses GloVe vectors, I was able to get excellent results with just 1 million sentences. Since, I am using a sequence to sequence model with one hot encodings of characters as inputs, the model needs to learn everything from scratch. After downloading and cleaning some corpuses, I had 1448012 sentences, which was enough to train the initial version of this model. Currently, I am working on adding lot more correctly segmented sentences with correct grammar and punctuation to this dataset. I am trying to create and open source a curated dataset of at least 10 million English sentences which I hope will be useful to NLP enthusiasts like me.
Since I am using one hot character level encoding with a max length of 200, I wasn’t able to use model.fit, as the entire data cannot be loaded into memory at once. This is where the fit_generator function of Keras came to help and with a mini epoch size of 40000, I was able to start the training on a machine with 8GB memory and an excellent 1080Ti.
The model achieved convergence after 16 epochs at validation accuracy 99.993. Since just the validation accuracy doesn’t tell us much about the actual punctuation correction capability of the model, I employed the following methodology to calculate the performance of model.
Absolute accuracy is defined as the total number of perfectly corrected sentences/ total number of sentences in the test data. I tested the model on 3000 sentences which were randomly selected from the original data and were held out during training. The model achieved 72.3% Absolute Accuracy on this test data. This was far better than what I originally expected considering the ok-ish amount of training data(yay!).
The links for the test data, training data, code and the pre-trained models will be available at https://github.com/bedapudi6788/deeppunct by 26-Nov-2018. I am currently working on training DeepPunct on lot more data along with some minor changes to the model. I will release the latest code and the data after this is done, on github.
Meanwhile, the initial version of the model can be tested at http://bpraneeth.com/projects along with DeepSegment which was covered in my previous post.