How neural networks learn nonlinear functions and classify linearly non-separable data?

Neural networks are very good at classifying data points into different regions, even in cases when the data are not linearly separable. Linearly separable data is data that can be classified into different classes by simply drawing a line (or a hyperplane) through the data. In cases where data is not linearly separable, kernel trick can be applied, where data is transformed using some nonlinear function so the resulting transformed points become linearly separable. A simple example is shown below where the objective is to classify red and blue points into different classes. Its not possible to use linear separator, however by transforming the variables, this becomes possible.

Here, I show a simple example to illustrate how neural network learning is a special case of kernel trick which allows them to learn nonlinear functions and classify linearly non-separable data. I will use the same example from above.

Example of linearly inseparable data

Neural networks can be represented as, y = W2 phi( W1 x+B1) +B2. The classification problem can be seen as a 2 part problem, one of learning W1 and other of learning W2. Changes in W1 result in different functional transformation of data via phi(W1X+B1), and as the underlying function phi is nonlinear, phi(W1X+B1) is a nonlinear transformation of data X. These nonlinear functions are then combined using linear neurons via W2 and B2. Lets got through this process in steps,

Define a nonlinear function.

Although any nonlinear function can work, a good candicate is Relu. Relu is described as a function that is 0 for X<0 and identity for X>0.

Relu activation function

Effect of changing weights and bias

We typically would compute weights for neurons using a backpropogation scheme, but as the objective is only to illustrate how nonlinear functions transform data, I will set these weights by hand. Consider the case where there are 2 features X1 and X2, and the activation input to relu is given by W1X1+X2. In this case, weight on second neuron was set to 1 and bias to zero for illustration. Figure below shows the effect of changing the weight.Therefore changing weight results in changing the region where the values are retained, and the white is where values of points are zero.

Red region is W1X1+X2>0 for different W1s.

Now we add bias to the special case where output of the neuron is X1+X2+B. The effect of changing B is changing the intercept or the location of the dividing line.

Effect of changing B in X1+X2+B

Figures above show that by changing B, the intercept of the line can be changed. Therefore, by changing B and W and having multiple regions, different regions in the space can be carved out to separate red from the blue points above. This is the primary mechanism of how neural networks are able to learn complex nonlinear functions and perform complex nonlinear transformations. Infact, if the activation function is set as a simple linear function, neural networks lose their nonlinear function approximation capabilities.

By changing weights and biasses, a region can be carved out such that for all blue points w2 relu(W1X+b1)+0.1>0. This is shown in the figure below.

Combining different nonlinear function (relu) regions allows for classification of linearly inseparable data