Visualizing function approximation using dense neural networks in 1D, Part I

Neural networks (NN) are at the heart of modern artificial intelligence and their inner workings might be mysterious to everyone but the savviest of AI practitioners and researchers. But it doesn’t have to be that way.

M.F.Benedetto

6 min readSep 26, 2020

A neural network is just a composition of linear transformations and non-linear functions. Armed with these non-linearities, NN can approximate any function given enough hidden units. This is called the Universal approximation theorem and has been proven with relatively weak assumptions. Intuitively, if a NN is able to represent a step function, which is a function made up of finitely many piecewise constant components, then it is able to approximate any other function with arbitrary accuracy given enough of these step components.

This seemingly innocent statement has very strong implications. It basically means that given enough model size, training data and computational time, any problem that can be reduced to computing a function of a set of inputs to a set of outputs can be solved by a NN. These can range from recognizing animals to self-driving helicopters, from speech recognition to automatic music composition. Thus, a wide enough(or deep enough to make it more efficient) network can seemingly solve any problem.

Not everything is as good as it seems though. Training neural networks for performing complex tasks can be painfully difficult, time consuming and not even feasible. For this article we’ll concentrate on the simplest of cases, i.e., a 1D function that we wish to approximate. A comparison will be made between a traditional Machine Learning and a more modern NN approach using Tensorflow (any Deep learning library works just as well for such a simple endeavor). The complete code for this article can be found here.

The target function that we want to approximate is a fifth order polynomial with some added Gaussian noise, namely:

from which we sample the training and and holdout set of test points that we’ll use to compare results.

Left: Original function and noisy counterpart. Right: Train and test

A classical machine learning approximation

We begin by solving this very simple problem using a classical machine learning technique based on polynomial features. We assume that the data is not simply 1D but instead it was generated by a linear combination of {1,x,x²,x³,…}. This is indeed a linear method, since we are seeking the coefficients of a linear combination of the polynomial features that minimize the loss, which in this case is the Mean Squared Error.

Using the sklearn library, it is very easy to expand the input data into its first n polynomial features with the function PolynomialFeatures(degree=n). Then, with LinearRegression(), we can fit a polynomial of a given degree to the training data to minimize the loss. We can afterwards compare results on the holdout set. Since the total number of parameters for a model is only degree+1, the method is very lightweight computationally.

The next table shows the results with varying degrees of polynomial approximations, along with the Loss on the training set and the MSE on the holdout set.

Approximations based on polynomial features of increasing degrees

The conclusion that can be drawn from this small experiment is that the method works extremely well. This is expected since we know that the original data was indeed generated by a noisy fifth order polynomial. We’ll use the resulting Test MSE of the degree 5 model as a base to compare it against the values that will be obtained using NN.

The constant and linear fits give a very biased approximation to the data. Approximations of orders 3 and 4 are, unsurprisingly, almost the same owing to the fact that the generating function has no order 4 component, so that any contribution of these terms is due to fitting the noise.

Note that considering order 6 exceeds the required complexity of the model for the problem but there is no overfitting due to the overabundance of data points. It is important to realize the fact that the best approximation for a low degree polynomial is not given by the generating function truncated up to that degree.

Approximations with Neural networks

Modern libraries like TensorFlow, Pytorch, Theano, etc., have made it extremely easy to define and train Neural Networks. However, there is a (growing) list of things that one needs to consider for a proper definition and training of a NN. The worse thing is that the default values usually won’t work. We’ll delve deeper into this in Part II. For now, we’ll concentrate in solving the problem at hand.

An initial model

We’ll begin with the most basic form of a neural network, which consists of a single output unit with no hidden layer.

It has only 2 parameters and the prediction (ŷ) is simply a linear combination of the input(x) and the bias(b) via the weights(w).

It is not surprising at all that the answer is very close to the one obtained with sklearn (prediction = -0.677 + 2.03x) . In fact, the fit with sklearn was computed exactly while this one is just an approximation of the minimum of the loss function obtained with gradient descent.

Nevertheless, it is usually a good idea to have a simple and trustworthy baseline model to make sure that everything is working as intended, specially regarding data loading, loss computations and the optimization method. Furthermore, this is a convex problem and as such, it has an easily obtainable unique solution for which the loss is minimal.

Left: Obtained fit — Right: Loss surface with minimizing parameters

One hidden unit

We’ll start adding complexity to the model and the first nonlinearity in the form of a ReLU function. In the end though, the approximation is simply in the shape of a single ReLU function which is only a slight improvement in complexity compared to a single line. Namely,

The results is that the optimization simply chooses to ignore that extra complexity that ReLU provides and the solution coincides with the previous one, as seen in the next figure.

Approximation and decomposition into components

Two hidden units

Things get interesting when a second hidden unit is introduced and the total number of parameters increases to 7.

In this case there are two ReLU components available to approximate the function and, as the solution shows, they both contribute in lowering the loss. The formula below provides a way to graphically split the solution into its two ReLU components and the constant bias term.

In this case the minimization of the loss arrives at a point where the two ReLU components along with the bias terms work together to provide a good fit to the data. The algorithm decided fto provide a better fit to the left and center data points, since a constant approximation at the right side does not result in a large MSE.

To be continued in Part II (Here)

3 hidden units
An increasing number of parameters
Networks with 2 hidden layers
Key takeaways from this exercise