# Non Linear Model — Solve using combination of Linear Models

In this article we will see how model learns non linear boundaries with complex data sets. Example of non linearly data separable. Source: Udacity

Background

Real world data-sets usually have one problem i.e. data can not be separated by just a single line. So what is the next thing after a line? May be a circle, maybe two lines or maybe some curve. This is where neural networks can show their full potential.

Let’s start ..

In a nutshell, for data which is not separable with a line, we are going to create a probability function where the points in the blue region are more likely to be blue and the points in the red region are more likely to be red. And curve that separates them is a set of points which are equally likely to be blue or red.

Combination

We are going to use a very simple trick i.e. just combine two linear models into a nonlinear model as shown by below figure. Combining two linear models. Source: Udacity

Before combining linear models lets see some mathematics for it. A linear model as we know is a whole probability space. It means that for every point it gives us the probability of the point being blue. Let pick any one point and check its probability of being blue in both linear models. As shown in below figure, Probability of point from 1st linear model is 0.7 and Probability of point from 2nd linear model is 0.8. Probability of point, two Linear Models. Source: Udacity

Now the question is, how do we combine these two? Well, the simplest way to combine two numbers is to add them, right? After Merge of Two linear Models (Summation of Probability). Source: Udacity

Sigmoid

But now, this doesn’t look like a probability anymore since it’s bigger than one (1.5>1). And probabilities need to be between 0 and 1. So what can we do? How do we turn this number that is larger than 1 into something between 0 and 1? we have a pretty good tool that turns every number into something between 0 and 1. It’s just a sigmoid function. We apply the sigmoid function to 1.5 and get 0.82 and that is the probability of this point being blue in the resulting probability space.

Weighting in Combination

Now, what if we wanted to weight this sum? What, if say, we wanted the model in the left to have more of a saying the resulting probability than the right? Well, we can add weights. For example, we can say “Seven times the left one plus five times the right one.” And when we combine the model is we take the first probability, multiply it by seven, then take the second one and multiply it by five and we can even add a bias if we want. Say, the bias is minus 6, then we add it to the whole equation.

7*0.7 + 5*0.8–6 = 2.9, sigmoid over 2.9 will give 0.95.

Get weights via Neural Network

Now, Lets learn how to achieve those weights through neural networks. Let’s say, linear model in left have linear equation is 5x1-2x2+8. And in right 7x1–3x2–1. Below image shows perceptron representation of both linear models. Perceptron for both linear models. Source: Udacity.

Now, let’s use another perceptron to combine these two models using the Linear Equation, seven times the first model plus five times the second model minus six. Perceptron for combining linear models. Source: Udacity

And now join all these together and we get a neural network.

Also,what we drew above that puts bias inside the node (left or middle one) and as well as bias on separate node (right most). Just to avoid confusion, middle one is just cleaned representation of left most.

So, here Neural Network will learn two type of weights. One is for each linear model (including bias) and other is for combining the linear model.

Different Neural Network Architectures

CASE 1 — Larger Hidden Layer

Now we’re combining three linear models i.e. added one more node in hidden layer to obtain the triangular boundary in the output layer. Have a look on Universal Approximation theorem which says one layer neural network can solve almost any problem or single hidden layer neural network can approximate any continuous function of x to any degree of precision.

CASE 2 — Output layer with more nodes

It just means that we have more outputs. In that case, we just have a multiclass classification model. So if our model is telling us if an image is a cat or dog or a bird, then we simply have each node in the output layer output a score for each one of the classes (dog, cat, bird). More nodes in output layer. Source: Udacity

CASE 3 — More Layers

It simply means we have deep neural network. Here our linear models combine to create nonlinear models and then these combine to create even more nonlinear models. In general, we can do this many times and obtain highly complex models with lots of hidden layers.