Stories by Bibek Chaudhary on Medium

Pelee: Real-time Object Detection System on Mobile Devices

Bibek Chaudhary — Mon, 19 Aug 2019 06:31:32 GMT

The rise of deep learning in the past decade has been astronomical, especially after introduction of CNN(Convolutional Neural Network). But this rise has been accompanied with bigger models and need for large compute power. Often these large(and compute heavy) models are hard to deploy for real-life application, especially on edge-devices. This is why On-device AI is gaining popularity and has become an active area of research. On-device AI requires deep learning models to be light-weight, power-efficient and accurate.

One of those models is Pelee: Real-time Object Detection System on Mobile Devices, which I will review in this post. This post is divided into three parts:

SSD: Single Shot Multibox Detection
PeleeNet for Classification
Pelee: Real-time Object Detection for tiny devices

SSD: Single Shot Multibox Detection

Pelee is based on SSD,but for resource constrained devices. So in order to fully understand Pelee, we first have to understand the architecture and working mechanism of SSD.

SSD Architecture(pic taken from here)

SSD uses VGG-16 as base-network to extract high-level feature maps: 38x38 and 19x19 from the input image. Now there are many variants of SSD which uses different architectures like MobileNet(v1 and v2),SqueezeNet as base network to get high level features. SSD uses multi-scale feature maps to perform detection(classification+localization). It uses 38x38, 19x19, 10x10, 5x5, 3x3 and 1x1 feature maps to get predictions for object class and bounding box coordinates. If you want to learn more about SSD and its implementation, please go through this excellent tutorial.

Pelee uses PeleeNet, a variant of DenseNet for mobile devices as its base-network to get high level feature maps.

PeleeNet for Classification

Overall architecture of PeleeNet

The architecture of PeleeNet is designed considering the limited computing power and memory resource constraints of mobile devices. It contains three main parts: Stem Block, Dense Block, and Transition Layer. Let’s discuss about these blocks one by one.

Stem Block

Structure of Stem block

The architecture of stem block is designed to increase the feature expressive power without increasing the computational cost. Let’s code this block to understand it better.

https://medium.com/media/6f6c107b0f817ae0039e30ff91ab7f19/href

Dense Block

Structure of two-way dense block

Inspired by GoogLeNet, the original dense layer in DenseNet is modified into two way dense layer to learn visual patterns for large objects. The code for this architecture would be as follows:

https://medium.com/media/0b20a69bb6e773ed4b822fd19fad9fa9/href

Transition Layer

It contains 1x1conv along with a Maxpool layer; the output channel is same as the input channel in transition layers. The code for transition layer is:

https://medium.com/media/7eb541eb63a0f7fd5afe89313e9161a1/href

These were the three main building blocks of PeleeNet; the authors have open sourced their pytorch code for PeleeNet, so please refer to their repo to learn more. For keras lovers, please refer to this implementation.

Now that we have an overview of how SSD and PeleeNet, we can start discussing about Pelee in the next section.

Pelee: Real-time Object Detection for tiny devices

The architecture for Pelee is similar to that of SSD except that Pelee uses PeleeNet as its base network where as SSD used VGG16. Another main difference is that Pelee uses only 5 scale feature maps for prediction where as SSD used 6 scale feature maps.

5 scale of feature maps used in Pelee for prediction

The 38x38 feature map was not used to maintain balance between speed and accuracy of edge devices; each of the five feature maps passes through a residual block to perform classification and regression.

Residual Block prediction

Residual block is used to better extract features from each of the feature maps, and can be coded as follows:

https://medium.com/media/12cf295ba35b8c2032eb9d34a7a71630/href

Pelee outperformed every other approaches including including Yolov2, SSD+MobileNet in every metrics: speed, model size and accuracy.

The following table demonstrates its performance on PASCAL VOC 2007.

Pelee performance on PASCAL VOC 2007

In terms of speed, Pelee is significantly faster than SSD+MobileNet on iPhone and TX2 in FP32 mode

Speed on Real devices

With efficient architecture design Pelee achieved state-of-the-art performance for object detection on mobile devices but was surpassed by TinyDSOD, which I will review in my next post.

Reference:

PS: I wrote this post based on my understanding of Pelee. Any suggestion/improvement about the content and/or style of writing will be appreciated.

Segmentation in Robotic Surgery

Bibek Chaudhary — Sun, 11 Nov 2018 17:42:22 GMT

Application of Deep Learning in Robotic Surgery

This is my second blog based on the fastai lesson. I wanted to apply the learning from the lesson to a different dataset with binary labels. I decided to use the dataset from the Endoscopic Vision Challenge and focused only on segmenting binary classes.

After downloading the dataset, black borders from images and masks were copped using this data-preparation code.

sample Image(left) and corresponding mask(right)

The popular Unet architecture was used for this segmentation task with a slight tweak: left side of the Unet was pre-trained Resnet34. fastai-v1 allows you to build a Unet architecture with pre-trained models as encoders.

create a learner: Unet architecture with Resnet34 encoder

The metric dice, which measures the similarity between two sets(in this case, real mask and predicted mask) also comes with this library. fastaiers has implemented pretty everything that we need to get state-of-the-art results.

The first stage of training was done using one-eighth size of the data samples due to computational constraints and large image sizes of 1024*1280 pixels.

The training was done following this pipeline:

(freezed layers)train → find learning rate → unfreeze layers → train

This resulted with a dice coefficient of 0.85 with nice prediction results.

results from first stage training with one-eight image size

In the second stage, data sizes were increased to 512*640, which is half of the original size. Using full sizes of 1024*1280 resulted in repeated CUDA run- time Error. So, I decided to continue with 512*640 and updated the data-block with new size and batch-size.

data-block for updated size and batch-size

The training started from where it was stopped in the last stage and followed the same training pipeline. This resulted in dice coefficient of 0.941522 with sharp edge masks for the validation set.

results from second-stage training and 512*640 image size

The predicted masks are very similar to the ground truths, but I believe the results can be improved by using the full size of the image in the second stage.

This was nice learning experience; one thing that amazed me is that the model did well even though the input size was not square, it’s rectangle.

This is fastai for you!!!

Code for this post can be found here.

Are you Chinese, Japanese or Korean?

Bibek Chaudhary — Wed, 31 Oct 2018 19:38:17 GMT

Image Classifier based on fast.ai lecture

One of the challenges that I face living in S. Korea is to tell the difference between Chinese, Japanese and Koreans. The similarities in their appearances has led to many awkward moments during my stay.

I wish to avoid those awkward moments by building a classifier that will differentiate between them for me.

First, we need a dataset to train an image classifier. Since I did not found any public dataset for this task, I created my own dataset; images of Chinese, Japanese and Koreans of both gender(and of all ages) were scrapped from the internet. I ended up with a dataset that contained 171 images of Chinese, 168 images of Japanese, and 167 images of Koreans. Samples from the dataset is shown below.

female samples from dataset

male samples from dataset

Now that we have the dataset, we can build a model to train the image classifier. The architecture used to train on this dataset was Resnet50, pre-trained on ImageNet dataset. You can learn more about this architecture in this post.

learner for image classifier

At first, only the last layer of the Resnet50 was trained — freezing the weights of other layers. This accuracy after 20 epochs was around 70%.

Training details for freezed Resnet50

70%? Not bad, huh?

Now let’s fine-tune to see if it makes the model better.

We unfreeze and train all the layers. We will use be using a learning rate of 1e-6 to train the first layer and 1e-5 for the last layer. All the other layers will be trained using the learning rates that fall in the range of [1e-6,1e-5]. The first and other initial layers are trained with smaller learning rate because these layers learn generic features whereas upper and deeper layers learn task-specific features.To learn more about it, read this.

So after unfreezing and training for 20 epochs, the accuracy was around 70%. The performance did not improve with fine-tuning. One reason could be that our dataset is similar to ImageNet, which has human images as well.

Training details for unfreezed Resnet50

Now, let’s interpret the results. We will start by plotting confusion matrix which compares the prediction of image classifier with actual results.

confusion matrix

Let’s take the first column of the matrix and interpret its meaning.

The 17 images of Chinese were correctly predicted as Chinese by the classifier but 5 images of Japanese were incorrectly predicted as Chinese, and 8 Korean images were confused as of Chinese.

This is more clear if we look at samples of the cases that were confused.

samples of confused cases

We can also see the cases in which the classifier was most confused.

most confused cases

The classifier was most confused between Korean and Chinese. It misclassified Korean as Chinese 8 times.

Some Thoughts:

To differentiate between Chinese, Japanese and Koreans is not an easy task even for the AI-trained classifier.
The accuracy of the classifier can be improved if trained longer.
fastai and @jermey will help me to create things that will make my (and hopefully others) life easy.

Code and data for this post can be found in this repo.

Solving Markov Decision Process

Bibek Chaudhary — Thu, 27 Sep 2018 21:31:24 GMT

Policy Iteration+ Value Iteration

In the last post, I wrote about Markov Decision Process(MDP); this time I will summarize my understanding of how to solve MDP by policy iteration and value iteration.

So what is policy iteration and value iteration?

These are the algorithms in Dynamic Programming that are used to solve finite MDP. Dynamic Programming allows you to solve complex problems by breaking into simpler sub-problems and solving those sub-problems gives you the solution to main complex problem. It has two properties:

Divide and Conquer: This means dividing a bigger and complex problem into smaller and simpler sub-problems. Intuition: 4 is the sum of two 2s –4=2+2
Information Reuse: This means using the information that is already available to solve recurring sub-problems. Intuition: The concepts used to solve simpler problem can be used to solve complex problems.

Policy Iteration and Value iteration use these properties of MDP to find the optimal policy.

Policy Iteration: It contains two parts — policy evaluation and policy improvement.

In policy evaluation, given a policy ∏(pi) we need to evaluate how much future rewards can we get following this policy starting from the state s. This is done by evaluation state-value function:

state-value function

In the policy improvement step, this policy is improved by taking greedy actions with respect to state-value function(shown in above figure).

policy improvement via greedy action

So, what does being greedy means?

Here it means to select the action that maximizes the future rewards that we can get if we take action a in state s and follow policy ∏(pi) thereafter.

policy improvement via greedy action

Now we wanna know whether following this new greedified policy from state-s will give us more or less future reward that just following previous policy ∏(pi) from that state. It turns out that starting from the state-s and taking action-a according to this new greedified policy is atleast as good as good, or better than just following the previous policy ∏(pi) for one step.

So we can say that this new greedified policy improves our chances of getting more future rewards starting from the state-s.

If we apply this notion to all the successive steps, then we can show this new policy is atleast as good as,if not better than the previous policy for the whole trajectory.

Policy evaluation and improvement are done iteratively until the optimal policy is obtained; optimal policy is reached when policy stops improving and thus Bellman optimality equation is satisfied.

Let’s take an example to apply these concepts and make our understanding more concrete.

Given a 4 by 4 grid-world, we need to find the optimal policy via policy iteration to reach the goal. It is an undiscounted episodic MDP where every action is equally likely. States are {1,2……,14}

The policy iteration will start with a random policy and then improves it by taking greedy actions.

After certain iterations(in this case k=3), the policy stops improving and hence optimal policy is obtained.

One major drawback of policy iteration is the computational cost involved in evaluating policies. This cost is reduced in value iteration which stops policy evaluation after k=1 and updates the policy every step thereafter.

Value Iteration: Unlike policy iteration, it merges the policy evaluation and improvement steps into one and performs an iterative update using the value function of Bellman optimality equation.

In value iteration, given a policy ∏(pi), we evaluate the policy once and continuously improve via iterative Bellman optimality equation.

The iterative version of Bellman optimality can be written as:

The corresponding backup diagram is shown below:

This is based on the intuition that if the value function of successor state is known then the value function of current state can be found by one-step lookahead.

References:

PS: I wrote this post based on my understanding of Reinforcement Learning. Any suggestion/improvement about the content and/or style of writing will be appreciated.

Gradient Descent: Stochastic vs Batch

Bibek Chaudhary — Mon, 24 Sep 2018 15:55:13 GMT

The difference is in weight updating patterns

This is my first post on #100DaysofMLCode challenge; everyday I plan to read(which will result in blog to check my understanding) and/or coding.

In this post, I will talk about Stochastic Gradient Descent and its difference with Batch Gradient Descent. I will also explain about mini-batch and epoch. I am assuming that you are familiar about Gradient Descent; if you are not, then read this blog to get the intuition behind Gradient Descent.

Machine Learning is all about finding meaningful patterns and relations in the data. As a result of learning, parameter — called weight(+bias) is obtained which is used for prediction.

Gradient descent is a popular optimization algorithms which minimizes the loss function and updates the parameters(weights+biases); if the update takes place for every sample, then it is called Stochastic Gradient Descent.

For 100 samples, SGD updates the weights 100 times.

SGD Algorithm

In Batch Gradient Descent(BGD), the update takes place only once for the whole training samples.

For 100 samples, BGD updates the weight only once.

BGD Algorithm

Training using the whole sample is rare; mini-batches of samples are used commonly to train machine learning models.

For 100 samples, if the batch size is 20 then mini-batch gradient descent updates the weight five times.

Mini-BDG algorithm

In Machine Learning literature, you will often hear the word — epoch; it means means one full pass of training over the entire training samples. In the case of mini-BGD, 1 epoch of training is done when the update over the fifth mini-batch of 20 samples is completed; in SGD, this happens when the update over the last sample(100th in this case) is completed.

One epoch using mini-BGD

PS: I wrote this post based on my understanding of Stochastic and Batch Gradient Descent. Any suggestion/improvement about the content and/or style of writing will be appreciated.

Markov Decision Process(MDP) Simplified

Bibek Chaudhary — Tue, 18 Sep 2018 17:38:45 GMT

MDP gives the mathematical formulation of Reinforcement Learning Problem

Markov Decision Process(MDP) is an environment with Markov states; Markov states satisfy the Markov Property: the state contains all the relevant information from the past to predict future. Mathematically,

pic taken from David Silver lecture slide

So if I say that the state S is Markov, that means, it has all the important representations of the environment from previous states(which means you can throw away all the previous states). Think of it in this way: when you have the boarding pass, you do not need your ticket anymore to board the plane; your boarding pass already contains all the necessary information about boarding.

MDP is formally defined as:

MDP tuple

Let’s take an example to develop intuition about MDP.

Student MDP Example from David Silver lecture slide

Suppose that you are a student and the figure above portrays the scenario of one of your day at school. The circles and square represents the states you can be in and the words in red are the actions that you can take depending on the state you are in; for example, in the state Class 1 you can choose whether you want to study or check your Facebook and depending on what actions you take, a numerical reward is given. There is also an action node(back dot in the figure) from where you can end up in different states depending on the transition probability; for example, after you decide to go to Pub from Class 3, you have 0.2 probability of getting into Class 1. This node shows the randomness of the environment over which you have no control. In all other cases, the transition probability is 1 and if the discount factor is 1 then MDP can be defined as:

MDP Example

Now that we have MDP, we need to solve it to find the best path that will maximize the sum of rewards, which is the goal of solving reinforcement learning problems. Formally, we need to find an optimal policy that will maximize the overall reward that an agent can get.

To solve MDP, we first have to know about the policy and value function.

In simple terms, policy tells you which actions to take. It is defined as:

Policy definition taken from David Silver lecture slide

For MDPs, the policy depends only on the current state.

Value function can be defined in two ways: state-value function and action-value function. State-value function tells you “how good” is the state you are in where as Action-value function tells you “how good” is it to take a particular action in a particular state. The “how good” of a state(or state-action pair) is defined in terms of expected future rewards.

The state-value function is defined as:

state-value function definition taken from David Silver lecture slide

Similarly, the action-value function is defined as:

action-value function taken from David Silver lecture slide

If we take the maximum of the value function over all policies, we get the optimal value function. Once we know the optimal value function, we can solve MDP to find the best policy.

The value functions that we defined above satisfy the Bellman equation; it states: “the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.”

For example, if we take the path from Class 1 to Class 2 then we can write the Bellman equation in the following way:

Bellman equation for value function

The Bellman optimality equation can be written in similar ways:

Bellman optimality equation for Value function

These concepts can be easily extended to multiple paths with different actions taking to different states. In this case, the Bellman optimality equation is:

Optimal state-value function

Using above equation, we can find the optimal value function for each state in our student MDP example.

Optimal state-value function

The optimal action-value can be expressed in similar fashion as:

Optimal action-value function

This equation gives the following result in our student MDP example.

Optimal action-value function

Once we have action-value function, we can find the optimal policy by taking their maximum. Formally, it would be:

Optimal policy

The optimal policy, which will maximize the reward for our Student is shown by the red arcs in the figure below.

Optimal-policy

Summary:

MDP represents the reinforcement learning problem mathematically and the goal of solving MDP is to find an optimal policy that will maximize the sum of expected reward. Finding an optimal policy becomes easier once we have the action-value function. The intuition behind Bellman equation simplifies the process of finding action-value function.

References:

PS: I wrote this post based on my understanding of Reinforcement Learning. Any suggestion/improvement about the content and/or style of writing will be appreciated.

Spine Segmentation using U-Net

Bibek Chaudhary — Wed, 12 Sep 2018 07:04:51 GMT

This post is based on my internship experience where I worked on the segmentation of Spine using U-Net architecture.

Data-Set: CT scans of 11 patients collected from the institution-affiliated hospital. The data were in dicom format with no labels.

Image Pre-processing: Since the data had no labels, I had to generate labels manually. I used 3D slicer’s automatic segmentation feature to generate labels and save them as dicom files.

Automatic-Segmentation in 3D Slicer

The above figure shows how automatic segmentation can be used to generate labels(mask). You can slide the slider to adjust the noise in the mask.

To save the mask as dicom files:

Creating Dicom Series of masks

The dicom files then can be read,cropped and saved as .png files using python package: pydicom

An instance of image and label obtained after pre-processing are shown below:

Image(left) and Label(right)

There are still noises(white dots) in the label — It was acceptable in my case. You can manually reduce these noises in 3D slicer if you want.

Training: The images and labels obtained were split into train and test set, and trained using U-Net architecture. The training was done for 100 epochs using the Adam Optimizer with a learning rate of 0.001.

Evaluation metric: Jaccard Index, also known as Intersection over Union (IoU) was used as the evaluation metric during training. For two sets A and B, Jaccard index is defined as the following:

Jaccard Index( IoU)

Loss function: The loss function used for optimization can be defined as:

Loss function: L

Result: The Jaccard Index obtained after training for 100 epochs with U-Net architecture was 0.7; This can be improved by training longer and using data-augmentation, both of which were not used in this project.

Image with real label(left) and Image with predicted label(right)

PS: I cannot publicly share the data-set and codes of this project as it was not a personal project. However I edited and improved on this post.

Reinforcement Learning Simplified

Bibek Chaudhary — Tue, 11 Sep 2018 21:50:11 GMT

In simple terms — Reinforcement Learning is learning from experience

Just like humans, machines can also learn from its interaction with the environment; Reinforcement Learning is how they can do it. It is the branch of Machine Learning in which the learner is not trained(like other Machine Learning domains) rather, supposed to learn from its experience by interacting with the environment. The interaction includes taking actions through trial-and-error search, and getting feedback( positive or negative) from the environment. It has the following elements:

Agent: It learns and makes decision by interacting with its environment.
Environment: Everything that is outside of agent and cannot be directly controlled by the agent is known as the environment. It responds to agent’s action by giving feedback and presents new state to the agent.
Reward function: It defines the reward of the agent depending on its action. It tells the agent what kind of reward it will get if it takes a particular action.
Policy: The behavior of the agent is defined by the policy. It tells the agent what actions to take and what actions to avoid to achieve its goal.
Value function: It evaluates the action of the agent taken in a particular state considering futu re rewards. It give the agent information about the long term consequences its actions.
Model of the environment(optional): It is the representation of the environment based on which it gives feedback and presents new state to the agent.

I will illustrate the idea behind each element through a popular childhood game of tic-tac-toe.

tic-tac-toe game

Tic-Tac-Toe is a 3x3 board game of two players and the players who successfully place Os or Xs in three consecutive places either horizontally, vertically or diagonally wins the game. The game is draw otherwise. The above figure shows Xs in three consecutive places diagonally.

Now consider two players — player A and player B are playing against each other; Player A is is an imperfect player(who is semi-skilled and can make mistakes at times) and Player B is the one who can learn from experience.In this case the elements are:

Agent: Player B because it can learns and makes decisions based on its interaction with the environment.

Environment: everything(including Player A) is the environment as it gives feedback and presents new states to Player B.

Reward signal: Goal of the player B; In this case to win the game

Policy: What move to make when going from one state to another?

Value function: What moves are good or bad for Player B in the long term?

Model of the environment: representation of the environment which is used to give reward to player B

Now that we have an overview of the elements of reinforcement learning. Let me explain about the interaction between them.

Agent-Environment Ineraction

At each time step t, the environment sends some information about agent’s state s;In above example, s is column/row of the board. The agent then takes an action a depending on the s. In the case of tic-tac-toe game, a would be the move Player B makes after knowing about its state. As a consequence of agent’s action, the environment then sends a numerical reward r at time step t+1. This interaction continues until the agent achieves its goal.

References:

PS: This is my first online post. I wrote it based on my understanding of Reinforcement Learning. Any suggestion/improvement about the content and/or style of writing will be appreciated.