Simplest Neural Network & Activation

Long
6 min readMay 27, 2017

--

  1. Intro brain net
  2. Neural Network
  3. Activation

1.Intro brain net

let me intro some basic concepts:

DL is one of the best implementation of ML(Machine learning) so far, and neural network is best method to implement DL.

neural network is inspired by brain structure inside:

Brain neurons

there has hundreds of billion neurons in human brain, every neuron transfer information through axon, when one or many neuron get single, they will process the single and judge which axons can send the processed info to the next one, and so on.

Transfer information between neurons

2. Neural Network

let’s see how these neural work in DL, what the neural network like the below picture shows:

There are three types of layer : input layer ,hidden layer and output layer.

Input layer is source of input data, we need normalize the data in the specific type or dimensions. hidden layers like dark magic region, data flow through the region, boom~ a bird is flying out(kidding ^^),output layer is final result that we want to get, e.g.we want the neural network to predict the picture of cat or dog?

Ok, this time, we implement a simplest neural network , it consist of one input layer, one hidden layer and one output layer.

The previous blog, we use a linear formulation :y = mx + b, every variable is a number, this blog’s demo , we will do matrix operation, if u unfamiliar with matrix, click this URL.

Before we get start of our demo, let’s see it in high level:

Input Layer * weights_input_to_hidden = hidden layer

hidden layer * weights_hidden_to_output = output_layer

The datasets contains of input data and output data, what we need to calculate is two weights. This is the core code of neural network:

core code of neural network

We used numpy lib in the code ,in order to simplify matrix operation.

here is some reference:

  1. np.random.normal(): generate radome value
  2. np.array(): create array
  3. np.dot(): matrix multiplication
  4. np.mean(): compute the arithmetic mean along the specified axis.

Code explain:

In this demo, matrix is two dimension.__init__:input_nodes : number of input_layer(matrix) rows
hidden_nodes: number of hidden_input_weight(matrix) rows
output_nodes: number of hidden_output_weight(matrix) rows
weights_input_to_hidden: weight between input and hidden
weights_hidden_to_output: weight between hidden and output
learning_rate: size of step to close optimal value
activation function: sigmoid function(intro detail later)
derivate activation: derivate sigmoid function
As we mentioned before, our purpose is calculate the two weights.train:
for loop to train the model
we need first calculate the forward value, then use the forward value to calculate backpropagation
line 24 ~ 35 : forward process
line 36 ~ 42 : backward process
line 44 ~ 45 : update weights
update weight need derivate the forward process, if use tensorflow, we didn't need calculate it manually.if u don't understand the code, it's ok, u just know what the process in high level.run:
forward process, as same part as train function.

3.Activation

why use activation?

because activation is non-linear, linear function are simple but limit, we want implement it at any function, any function means u can't use a linear function to represent , must use non-linear .if we just use linear function ,no matter how many layers we make ,we also get a linear:f(w1 * x1 + w2 * x2 + w3 * x3)

what is activation?

we often mentioned activation function is non-linear function.

when we train the model, we don’t want every value through into the next layer, because some value is useless, to a certain extent it will affect the accuracy of model, so filter the negative value , amplify the positive value not only can accelerate the training process , but also promote accuracy.

Commonly used activation functions

sigmoid

the function is σ(x) = 1/(1+exp(-x)), it squashes real-value into range between[0,1],In particular, large negative numbers become 0 and large positive numbers become 1.

but sigmoid has two major drawbacks:

  1. kill gradients, cause value is near zero ,during backpropagation, this gradient will be multiplied gradient of this gate’s output gradient for whole objective. if this gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weight and recursively to its data.
  2. outputs are not zero-centered
sigmoid

Softmax

softmax function can squash value into [0,1],the difference with sigmoid is , all value through softmax added value is 1,so it’s often use to solve classification issue.

Mathematically the softmax function is shown below:

softmax

here is a question, why we not use raw value directly ,

e.g. 1.2 / (1.2+0.9+0.4) is also squash to [0,1] and sum is 1.

I guess, there has two reasons:

1)avoid negative value

2) backpropagation derivative operation

Tanh

the function is tanh(x) = 2σ(2x) - 1, it squashes value to the range[-1,1]. the drawback is same with sigmoid , but it is a zero-centered.Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.

tanh

ReLU

the function is f(x) = max(0,x), this function is simply thresholded at zero.

here is the drawback:

ReLU units can be fragile during training and “die”. e.g.a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again.(large gradient update make the weight value negative?). With a proper setting of the learning rate this is less frequently an issue.

ReLU

Leaky ReLU

the function is f(x) = max(ax , x) ,this activation is attempt to fix the “dying ReLU” problem, instead of the function being zero, when x is negative , we give a very small value a .

Leaky ReLU

Reference

Code:

Course:

Article:

--

--