Stories by Heet Sankesara on Medium

U-Net

Heet Sankesara — Wed, 23 Jan 2019 06:08:44 GMT

UNet

Introducing Symmetry in Segmentation

Introduction

Vision is one of the most important senses humans possess. But have you ever wondered about the complexity of the task? The ability to capture the reflected light rays and get meaning out of it is a very convoluted task and yet we do it so easily. We developed it due to millions of years of evolution. So how can we give machines the same ability in a very small period of time? For computers, these images are nothing but matrices and understanding the nuances behind these matrices has been an obsession for many mathematicians for years. But after the emergence of artificial intelligence and particularly CNN architectures, the research has made progress like never before. Many problems which are previously considered untouchable are now showing astounding results.

One such problem is the image segmentation. In Image Segmentation, the machine has to partition the image into different segments, each of them representing a different entity.

Image Segmentation Example

As you can see above, how the image turned into two segments, one represents the cat and the other background. Image segmentation is useful in many fields from self-driving cars to satellites. Perhaps the most important of them all is medical imaging. The subtleties in medical images are quite complex and sometimes even challenging for trained physicians. A machine that can understand these nuances and can identify necessary areas can make a profound impact in medical care.

Convolutional Neural Networks gave decent results in easier image segmentation problems but it hasn't made any good progress on complex ones. That’s where UNet comes in the picture. UNet was first designed especially for medical image segmentation. It showed such good results that it used in many other fields after. In this article, we’ll talk about why and how UNet works. If you don’t know intuition behind CNN, please read this first. You can check out UNet in action here.

The Intuition Behind UNet

The main idea behind CNN is to learn the feature mapping of an image and exploit it to make more nuanced feature mapping. This works well in classification problems as the image is converted into a vector which used further for classification. But in image segmentation, we not only need to convert feature map into a vector but also reconstruct an image from this vector. This is a mammoth task because it’s a lot tougher to convert a vector into an image than vice versa. The whole idea of UNet is revolved around this problem.

While converting an image into a vector, we already learned the feature mapping of the image so why not use the same mapping to convert it again to image. This is the recipe behind UNet. Use the same feature maps that are used for contraction to expand a vector to a segmented image. This would preserve the structural integrity of the image which would reduce distortion enormously. Let’s understand the architecture more briefly.

UNet Architecture

How UNet Works

UNet Architecture

The architecture looks like a ‘U’ which justifies its name. This architecture consists of three sections: The contraction, The bottleneck, and the expansion section. The contraction section is made of many contraction blocks. Each block takes an input applies two 3X3 convolution layers followed by a 2X2 max pooling. The number of kernels or feature maps after each block doubles so that architecture can learn the complex structures effectively. The bottommost layer mediates between the contraction layer and the expansion layer. It uses two 3X3 CNN layers followed by 2X2 up convolution layer.

But the heart of this architecture lies in the expansion section. Similar to contraction layer, it also consists of several expansion blocks. Each block passes the input to two 3X3 CNN layers followed by a 2X2 upsampling layer. Also after each block number of feature maps used by convolutional layer get half to maintain symmetry. However, every time the input is also get appended by feature maps of the corresponding contraction layer. This action would ensure that the features that are learned while contracting the image will be used to reconstruct it. The number of expansion blocks is as same as the number of contraction block. After that, the resultant mapping passes through another 3X3 CNN layer with the number of feature maps equal to the number of segments desired.

Loss calculation in UNet

What kind of loss one would use in such an intrinsic image segmentation? Well, it is defined simply in the paper itself.

The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross-entropy loss function

UNet uses a rather novel loss weighting scheme for each pixel such that there is a higher weight at the border of segmented objects. This loss weighting scheme helped the U-Net model segment cells in biomedical images in a discontinuous fashion such that individual cells may be easily identified within the binary segmentation map.

First of all pixel-wise softmax applied on the resultant image which is followed by cross-entropy loss function. So we are classifying each pixel into one of the classes. The idea is that even in segmentation every pixel have to lie in some category and we just need to make sure that they do. So we just converted a segmentation problem into a multiclass classification one and it performed very well as compared to the traditional loss functions.

UNet Implementation

I implemented the UNet model using Pytorch framework. You can check out the UNet module here. Images for segmentation of optical coherence tomography images with diabetic macular edema are used. You can checkout UNet in action here.

https://medium.com/media/c862cc68c602a368ec45012a01baf2ff/href

The UNet module in the above code represents the whole architecture of UNet. contraction_block and expansive_block are used to create the contraction section and the expansion section respectively. The function crop_and_concat appends the output of contraction layer with the new expansion layer input. The training part can be written as

unet = Unet(in_channel=1,out_channel=2)
#out_channel represents number of segments desired
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(unet.parameters(), lr = 0.01, momentum=0.99)
optimizer.zero_grad()       
outputs = unet(inputs)
# permute such that number of desired segments would be on 4th dimension
outputs = outputs.permute(0, 2, 3, 1)
m = outputs.shape[0]
# Resizing the outputs and label to caculate pixel wise softmax loss
outputs = outputs.resize(m*width_out*height_out, 2)
labels = labels.resize(m*width_out*height_out)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

Conclusion

Image segmentation is an important problem and every day some new research papers are published. UNet contributed significantly in such research. Many new architectures are inspired by UNet. But still, there is so much to explore. There are so many variants of this architecture in the industry and hence it is necessary to understand the first one to understand them better. So if you have any doubts please comment below or refer to the resources page.

Resources

Author’s Note

This tutorial is the second article in my series of DeepResearch articles. If you like this tutorial please let me know in comments and if you don’t please let me know in comments more briefly. If you have any doubts or any criticism just flood the comments with it. I’ll reply as soon as I can. If you like this tutorial please share it with your peers.

U-Net was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hierarchical Attention Networks

Heet Sankesara — Fri, 24 Aug 2018 05:34:31 GMT

The most human way to classify text

What’s all this hype about text classification?

Since the uprising of Artificial Intelligence, text classification has become one of the most staggering tasks to accomplish. In layman terms, We can say Artificial Intelligence is the field which tries to achieve human-like intelligent models to ease the jobs for all of us. We have an astounding proficiency in text classification but even many sophisticated NLP models are failed to achieve proficiency even close to it. So the question arises is that what we humans do differently? How do we classify text?

First of all, we understand words not each and every word but many of them and we can guess even unknown words just by the structure of a sentence. Then we understand the message that those series of words (sentence) conveys. Then from those series of sentences, we understand the meaning of a paragraph or an article. The similar approach is used in Hierarchical Attention model.

So what’s so special about this hierarchical thing?

Well to put it in a “too complicated to comprehend even for a techie” way, It uses stacked recurrent neural networks on word level followed by attention model to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector. Then the same procedure applied to the derived sentence vectors which then generate a vector who conceives the meaning of the given document and that vector can be passed further for text classification.

Wait….What?

HAN Structure

The idea behind the paper is “Words make sentences and sentences make documents”. The intent is to derive sentence meaning from the words and then derive the meaning of the document from those sentences. But not all words are equally important. Some of them characterize a sentence more than others. Therefore we use the attention model so that sentence vector can have more attention on “important” words. Attention model consists of two parts: Bidirectional RNN and Attention networks. While bidirectional RNN learns the meaning behind those sequence of words and returns vector corresponding to each word, Attention network gets weights corresponding to each word vector using its own shallow neural network. Then it aggregates the representation of those words to form a sentence vector i.e it calculates the weighted sum of every vector. This weighted sum embodies the whole sentence. The same procedure applies to sentence vectors so that the final vector embodies the gist of the whole document. Since it has two levels of attention model, therefore, it is called hierarchical attention networks.

Enough talking… just show me the code

We used News category Dataset to classify news category. you can see the whole implementation here. Now the first question comes in mind is what the hell is attention?

Attention Model

The vectors from Bidirectional RNN pass through shallow neural network to decide weight corresponding to each vector. The weighted sum of each vector embodies the meaning of those vectors combined. To understand it more briefly just go to the code.

Data preprocessing

To process the data we need to convert it into a suitable form.

tokenizer = Tokenizer(num_words=max_features, oov_token=True)
tokenizer.fit_on_texts(texts)
data = np.zeros((len(texts), max_senten_num, max_senten_len), dtype='int32')
for i, sentences in enumerate(paras):
    for j, sent in enumerate(sentences):
        if j< max_senten_num:
            wordTokens = text_to_word_sequence(sent)
            k=0
            for _, word in enumerate(wordTokens):
                try:
                    if k                        data[i,j,k] = tokenizer.word_index[word]
                        k=k+1
                except:
                    print(word)
                    pass

We used the above code to convert training dataset to 3 dimension array: the first dimension represents the total number of documents, the second one represents each sentence in a document and the last one represents each word in a sentence. However, We have to set some upper limit in order to create a static graph which in this case are max_senten_len(max number of the sentence in a paragraph), max_senten_num(max number of words in a sentence) and max_features(max number of words Tokenizer can have).

Now isn’t it unfair to the model if we randomly initialize all the words? Hence we use trained embedded vectors which give the model an extra edge in the terms of performance and yields better results.

GLOVE_DIR = "../input/glove6b/glove.6B.100d.txt"
embeddings_index = {}
f = open(GLOVE_DIR)
for line in f:
    try:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    except:
        print(word)
        pass
f.close()
embedding_matrix = np.zeros((len(word_index) + 1, embed_size))
absent_words = 0
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
    else:
        absent_words += 1
print('Total absent words are', absent_words, 'which is', "%0.2f" % (absent_words * 100 / len(word_index)), '% of total words')

We replaced the known words with their corresponding vector in embedding_matrix. Indeed, some words will be missing but our model has to learn to cope up with that.

Now time for HAN model

embedding_layer = Embedding(len(word_index) + 1,embed_size,weights=[embedding_matrix], input_length=max_senten_len, trainable=False)

# Words level attention model
word_input = Input(shape=(max_senten_len,), dtype='float32')
word_sequences = embedding_layer(word_input)
word_lstm = Bidirectional(LSTM(150, return_sequences=True, kernel_regularizer=l2_reg))(word_sequences)
word_dense = TimeDistributed(Dense(200, kernel_regularizer=l2_reg))(word_lstm)
word_att = AttentionWithContext()(word_dense)
wordEncoder = Model(word_input, word_att)

# Sentence level attention model
sent_input = Input(shape=(max_senten_num, max_senten_len), dtype='float32')
sent_encoder = TimeDistributed(wordEncoder)(sent_input)
sent_lstm = Bidirectional(LSTM(150, return_sequences=True, kernel_regularizer=l2_reg))(sent_encoder)
sent_dense = TimeDistributed(Dense(200, kernel_regularizer=l2_reg))(sent_lstm)
sent_att = Dropout(0.5)(AttentionWithContext()(sent_dense))
preds = Dense(30, activation='softmax')(sent_att)
model = Model(sent_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['acc'])

For word2vec, we used Keras Embedding layer. TimeDistributed method is used to apply a Dense layer to each of the time-steps independently. We used Dropout and l2_reg regularizers to reduce overfitting.

Conclusion

You must be very much impressed or very much confused by now. Sometimes these things can go over the head but text classification is a trending field and despite the many new and prolific researches the scope of improvement is so much. So do not despair only now because you’ll have many more disappointments later 😅 . Just kidding if you have any doubts please comment below or refer to the resources page.

Resources

Go here to checkout code.
Go here to checkout implementation.
Hierarchical attention networks for information extraction from cancer pathology reports.
Hierarchical Attention Networks for Document Classification

Author’s Note

This tutorial is the first article in my series of DeepResearch articles. If you like this tutorial please let me know in comments and if you don’t please let me know in comments more briefly. If you have any doubts or any criticism just flood the comments with it. I’ll will reply as soon as I can. If you like this tutorial please share it with your peers.

Hierarchical Attention Networks was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.