Ai POS-(parts_of_speech) tagger #PART-3

Manish Pawar
5 min readNov 29, 2018

--

Voila, You are here again!

Well, this is the last part of the trilogy. If you are new to POS Tagging-parts of speech tagging, make sure you follow my PART-1 first, then PART-2 which I wrote a while ago. This article is more of an enhancement of the work done there. If you’ve really got some of prior part, then this part will be better to understand.

Here, we will be building our POS tagger with Keras. For non-machine learning folks, Keras is the most popular API for deep learning tasks. It can be integrated with tensorflow too. Plus, it’s an efficient way of coding (40 lines can be reduced to almost 10 lines).

Seeing, part-1 & part-2, you probably must’ve realized that we need a more appropriate approach for doing POS tagging.

Earlier in part-1&2, we used the Penn Treebank corpus. Let’s do the same here in order to compare the outcomes.

import nltk tagged_sentences = nltk.corpus.treebank.tagged_sents() print(tagged_sentences[0])

On printing tagged sentences and words, we get

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]Tagged sentences:  3914Tagged words: 100676

Now, we do some feature engineering, don’t get bored cause that’s the thing data_scientists do…

We separate words from tags as follows…

sentences, sentence_tags =[], [] for tagged_sentence in tagged_sentences: sentence, tags = zip(*tagged_sentence) sentences.append(np.array(sentence)) sentence_tags.append(np.array(tags))

Let’s verify by printing.

['Lorillard' 'Inc.' ',' 'the' 'unit' 'of' 'New' 'York-based' 'Loews'#  'Corp.' 'that' '*T*-2' 'makes' 'Kent' 'cigarettes' ',' 'stopped' 'using'#  'crocidolite' 'in' 'its' 'Micronite' 'cigarette' 'filters' 'in' '1956'# '.']# ['NNP' 'NNP' ',' 'DT' 'NN' 'IN' 'JJ' 'JJ' 'NNP' 'NNP' 'WDT' '-NONE-' 'VBZ'#  'NNP' 'NNS' ',' 'VBD' 'VBG' 'NN' 'IN' 'PRP$' 'NN' 'NN' 'NNS' 'IN' 'CD'#  '.'['Lorillard' 'Inc.' ',' 'the' 'unit' 'of' 'New' 'York-based' 'Loews'#  'Corp.' 'that' '*T*-2' 'makes' 'Kent' 'cigarettes' ',' 'stopped' 'using'#  'crocidolite' 'in' 'its' 'Micronite' 'cigarette' 'filters' 'in' '1956'# '.']# ['NNP' 'NNP' ',' 'DT' 'NN' 'IN' 'JJ' 'JJ' 'NNP' 'NNP' 'WDT' '-NONE-' 'VBZ'#  'NNP' 'NNS' ',' 'VBD' 'VBG' 'NN' 'IN' 'PRP$' 'NN' 'NN' 'NNS' 'IN' 'CD'#  '.']

Now, same as before, we need to split the dataset for training and testing.
So as familiar, let’s go with the train_test_split function from sklearn.

from sklearn.model_selection import train_test_split(train_sentences, test_sentences, train_tags, test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2) #20% for test

Now, our frameworks and debuggers understand number stuffs not word stuffs. So, we need to assign each word a unique number(index).
We’re computing a set of unique words (and tags) then transforming it in a list and indexing them in a dictionary.

words, tags = set([]), set([]) for s in train_sentences: for w in s: words.add(w.lower()) for ts in train_tags: for t in ts: tags.add(t) word2index = {w: i + 2 for i, w in enumerate(list(words))} tag2index = {t: i + 1 for i, t in enumerate(list(tags))} #convert the word dataset to integer dataset train_sentences_X, test_sentences_X, train_tags_y, test_tags_y = [], [], [], [] for s in train_sentences: s_int = [] for w in s: try: s_int.append(word2index[w.lower()]) except KeyError: s_int.append(word2index['-OOV-']) train_sentences_X.append(s_int) for s in test_sentences: s_int = [] for w in s: try: s_int.append(word2index[w.lower()]) except KeyError: s_int.append(word2index['-OOV-']) test_sentences_X.append(s_int) for s in train_tags: train_tags_y.append([tag2index[t] for t in s]) for s in test_tags: test_tags_y.append([tag2index[t] for t in s])

And at each step, it’s crucial to check out way of modelling.

print(train_sentences_X[0]) print(train_tags_y[0]) # [2385, 9167, 860, 4989, 6805, 6349, 9078, 3938, 862, 1092, 4799, 860, 1198, 1131, 879, 5014, 7870, 704, 4415, 8049, 9444, 8175, 8172, 10058, 10034, 9890, 1516, 8311, 7870, 1489, 7967, 6458, 8859, 9720, 6754, 5402, 9254, 2663] # [11, 35, 39, 3, 7, 9, 20, 42, 42, 3, 35, 39, 35, 35, 22, 7, 10, 16, 32, 35, 31, 17, 3, 11, 42, 7, 9, 3, 10, 16, 6, 25, 12, 11, 42, 17, 6, 44]

Now, we need to make lengths equal of all sentences,so we add series of 0s at the end. That’s called padding.

#need to print max length to pad MAX_LENGTH = len(max(train_sentences_X,key=len)) #padding from keras.preprocessing.sequence import pad_sequences train_sentences_X = pad_sequences(train_sentences_X, maxlen=MAX_LENGTH, padding='post') test_sentences_X = pad_sequences(test_sentences_X, maxlen=MAX_LENGTH, padding='post') train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post') test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')

Now, comes the main deep learning stuff…

LSTM stands for Long-short-term-memory. It can be thought of as multiple copies of the same network, each passing a message to a successor, remembering previous lessons(or results). Designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behaviour, not something they struggle to learn!

Be sure to visit here & here to learn more about LSTMs.

Perhaps I won’t be able to explain all the layers which we will use in our model or this post will be too long. So, refer here to learn about each layer. You can bet on me on this

from keras.models import Sequential from keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation from keras.optimizers import Adam model = Sequential() model.add(InputLayer(input_shape=(MAX_LENGTH, ))) model.add(Embedding(len(word2index), 128)) model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(TimeDistributed(Dense(len(tag2index)))) model.add(Activation('softmax')) # we compile the model model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])

Now, to increase our model’s efficiency we have a concept of one-hot encoding.
We transform the sequences of tags to sequences of One-Hot Encoded tags. This is what our Dense Layer outputs (I hope that you’ve read that keras documentation)

#one-hot encoding tags... from keras.utils import to_categorical cat_train_tags_y = to_categorical(train_tags_y, len(tag2index)) print(cat_train_tags_y[0])

Training time guys…

model.fit(train_sentences_X, to_categorical(train_tags_y, len(tag2index)), batch_size=128, epochs=40, validation_split=0.2)

Now, we test it out. So, we do all data preprocessing as we did above

test_samples = [ "running is very important for me .".split(), "I was running every day for a month .".split() ] #into padded sequences of word ids test_samples_X = [] for s in test_samples: s_int = [] for w in s: try: s_int.append(word2index[w.lower()]) except KeyError: s_int.append(word2index['-OOV-']) test_samples_X.append(s_int) test_samples_X = pad_sequences(test_samples_X, maxlen=MAX_LENGTH, padding='post') #reverse of to_categorical def logits_to_tokens(sequences, index): token_sequences = [] for categorical_sequence in sequences: token_sequence = [] for categorical in categorical_sequence: token_sequence.append(index[np.argmax(categorical)]) token_sequences.append(token_sequence) return token_sequences

Let’s predict now…

predictions = model.predict(test_samples_X) print(logits_to_tokens(predictions, {i: t for t, i in tag2index.items()})) # ['JJ', 'NNS', 'NN', 'NNP', 'NNP', 'NNS', '-NONE-', '-PAD-', ... # ['VBP', 'CD', 'JJ', 'CD', 'NNS', 'NNP', 'POS', 'NN', '-NONE-', '-PAD-', ...

Checking our model’s accuracy

scores = model.evaluate(test_sentences_X, to_categorical(test_tags_y, len(tag2index)))

On printing scores, we get

0.9909751977804825

Too Close to 100%. Astonishing right! It’s definitely better than the previous 2 parts.

But with a great score, there’s always a flaw. And flaw here is that we have included padding and padding is really easy to get right.
These are really easy to guess, hence the score.

That’s it for now. But do try it out with the sentences that are long enough and won’t need to pad. That will be a good practice.

P.S: Accuracy increases with a perfect blend of fine-tuning of parameters along with layers and all… C ya. Have a good day!

Originally published at blog.lipishala.com on November 15, 2018.

--

--