- Intro brain net
- Neural Network
- Activation
1.Intro brain net
let me intro some basic concepts:
DL is one of the best implementation of ML(Machine learning) so far, and neural network is best method to implement DL.
neural network is inspired by brain structure inside:
there has hundreds of billion neurons in human brain, every neuron transfer information through axon, when one or many neuron get single, they will process the single and judge which axons can send the processed info to the next one, and so on.
2. Neural Network
let’s see how these neural work in DL, what the neural network like the below picture shows:
There are three types of layer : input layer ,hidden layer and output layer.
Input layer is source of input data, we need normalize the data in the specific type or dimensions. hidden layers like dark magic region, data flow through the region, boom~ a bird is flying out(kidding ^^),output layer is final result that we want to get, e.g.we want the neural network to predict the picture of cat or dog?
Ok, this time, we implement a simplest neural network , it consist of one input layer, one hidden layer and one output layer.
The previous blog, we use a linear formulation :y = mx + b, every variable is a number, this blog’s demo , we will do matrix operation, if u unfamiliar with matrix, click this URL.
Before we get start of our demo, let’s see it in high level:
Input Layer * weights_input_to_hidden = hidden layer
hidden layer * weights_hidden_to_output = output_layer
The datasets contains of input data and output data, what we need to calculate is two weights. This is the core code of neural network:
We used numpy lib in the code ,in order to simplify matrix operation.
here is some reference:
- np.random.normal(): generate radome value
- np.array(): create array
- np.dot(): matrix multiplication
- np.mean(): compute the arithmetic mean along the specified axis.
Code explain:
In this demo, matrix is two dimension.__init__:input_nodes : number of input_layer(matrix) rows
hidden_nodes: number of hidden_input_weight(matrix) rows
output_nodes: number of hidden_output_weight(matrix) rowsweights_input_to_hidden: weight between input and hidden
weights_hidden_to_output: weight between hidden and outputlearning_rate: size of step to close optimal value
activation function: sigmoid function(intro detail later)
derivate activation: derivate sigmoid functionAs we mentioned before, our purpose is calculate the two weights.train:
for loop to train the model
we need first calculate the forward value, then use the forward value to calculate backpropagation line 24 ~ 35 : forward process
line 36 ~ 42 : backward process
line 44 ~ 45 : update weightsupdate weight need derivate the forward process, if use tensorflow, we didn't need calculate it manually.if u don't understand the code, it's ok, u just know what the process in high level.run:
forward process, as same part as train function.
3.Activation
why use activation?
because activation is non-linear, linear function are simple but limit, we want implement it at any function, any function means u can't use a linear function to represent , must use non-linear .if we just use linear function ,no matter how many layers we make ,we also get a linear:f(w1 * x1 + w2 * x2 + w3 * x3)
what is activation?
we often mentioned activation function is non-linear function.
when we train the model, we don’t want every value through into the next layer, because some value is useless, to a certain extent it will affect the accuracy of model, so filter the negative value , amplify the positive value not only can accelerate the training process , but also promote accuracy.
Commonly used activation functions
sigmoid
the function is σ(x) = 1/(1+exp(-x)), it squashes real-value into range between[0,1],In particular, large negative numbers become 0 and large positive numbers become 1.
but sigmoid has two major drawbacks:
- kill gradients, cause value is near zero ,during backpropagation, this gradient will be multiplied gradient of this gate’s output gradient for whole objective. if this gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weight and recursively to its data.
- outputs are not zero-centered
Softmax
softmax function can squash value into [0,1],the difference with sigmoid is , all value through softmax added value is 1,so it’s often use to solve classification issue.
Mathematically the softmax function is shown below:
here is a question, why we not use raw value directly ,
e.g. 1.2 / (1.2+0.9+0.4) is also squash to [0,1] and sum is 1.
I guess, there has two reasons:
1)avoid negative value
2) backpropagation derivative operation
Tanh
the function is tanh(x) = 2σ(2x) - 1, it squashes value to the range[-1,1]. the drawback is same with sigmoid , but it is a zero-centered.Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.
ReLU
the function is f(x) = max(0,x), this function is simply thresholded at zero.
here is the drawback:
ReLU units can be fragile during training and “die”. e.g.a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again.(large gradient update make the weight value negative?). With a proper setting of the learning rate this is less frequently an issue.
Leaky ReLU
the function is f(x) = max(ax , x) ,this activation is attempt to fix the “dying ReLU” problem, instead of the function being zero, when x is negative , we give a very small value a .
Reference
Code:
Course:
Article: