Neural Networks using Python
When I was following the online course about Machine Learning by Andrew Ng on Coursera.com, I was literally astonished as I arrived at the Neural Network chapter. Up to that moment, I have only read a few things about it and I have never gone deep into the details. In the beginning, a few things were still a little bit confusing and also hard to understand. However, thanks to some exercises and extra lessons found online (see, for example, the IBM courses about Data Science and Machine Learning), I started having a deeper understanding of Neural Networks, and how they work, and I arrived to the point where I wanted to write a code by myself to build a Neural Network from scratch, using Python. In this article, I will give some very basic concepts before going directly to the final algorithm that will be explained step by step.
What is a Neural Network?
A Neural Network (also known as ANN — Artificial Neural Network) is an algorithm used in Machine Learning for Classification and Regression problems (whose structure was inspired by the neural cells and their connections in the brain) sometimes even faster and more accurate than other algorithms.
The structure is very simple: data used in a Neural Network (i.e., data we want to study or from which we want to extract information or recognize patterns) forms the so-called input layer. This section of the Neural Network is connected to one or more hidden layers, which are intermediate elements formed by “nodes” all interconnected to each other. Finally, the last layer called output layer corresponds to the result we are looking for.
A “simplified” way to see this structure is the following: data (input layer) are fed to a system (hidden layers) able to find an approximated description from the simplest to the most complex patterns, up to a final representation (output layer) where all the information are added up together. For example: if our problem is to recognize numbers, the input layer is the matrix containing information about the number images (e.g., 1s and 0s, corresponding to the filled and empty pixels, respectively) that will be used (and transformed) in the hidden layers where patterns are recognized (e.g., closed loops, straight lines,…) to be finally incorporated into an output (i.e., a number).
Each node of one layer is connected to each node of the next (creating a net) by a function called activation function: what they do is transform data into new machine-readable sets (i.e., matrices of numbers between 0 and 1) which are used, in each layer, for finding patterns and structures. To do this, weights are assigned to each connection, basically numbers contained in matrices, that translate the information from one layer to the next.
The structure of a Neural Network (i.e., how many hidden layers, how many nodes in each hidden layer, which activation function to use,…) is not fixed, and one should try different combinations to find the best set.
Forward and Back Propagation algorithms
Like all Machine Learning algorithms, also Neural Networks need an optimization procedure to get the best results: this translates into finding the best weights, assigned to each connection of the net, that give the closest results to the actual ones, i.e. minimizing the final error represented by the so-called Cost function (it is the sum of all the squared differences between the obtained values and the actual ones).
The optimizing algorithm is called Gradient Descent (more details here) that is based on the Cost-function gradient (a map of how the function behaves) to update the weights. The algorithm is based on the usage of a learning rate, a number telling how strongly weights are updated: if the value is too small, the algorithm may be very slow to adapt and “learn”; if too large, the algorithm may be very inefficient, leading to very wrong results.
For Neural Networks, the Gradient Descent algorithm comprises two parts that must be performed iteratively:
- Forward propagation: it is the first part where input data are transformed from one layer to the next, up to the output layer. This phase corresponds to the part of the algorithm where patterns and particular features inside the data are found and coded. In the first iteration, this phase will hardly return good results: the obtained values will be different from the actual ones and there must be a way to correct the errors.
- Back propagation: the second part is the opposite of Forward propagation. Instead of going from the input to the output layer, here the starting point is the result obtained with the Forward propagation: it is used for computing the error with respect to the actual values, and the process is iterated up to the input layer. This phase is a sort of inspection of how good the Neural Network weights are, and it is extremely important because it computes the gradient of the Cost function, used for the correction of the weights according to the errors.
This part may be quite complicated (also mathematically speaking) but the take-away message is that the Forward and Back propagation algorithms form a strong auto-corrective procedure: at each iteration, the two procedures are repeated and weights are updated to minimize the error. Therefore, the more the iterations, the more the chances to find the best weights that optimize the Neural Network.
Neural Network Python class for Classification and Regression
And now the Python algorithm! It contains all the procedures and calculations (not shown in this article) needed to build and optimize a Neural Network. For details, this incredible Youtube series starting with this video is extremely useful!
The class NeuralNet builds a Neural Network for Classification and Regression problems and has several attributes and modules. It also includes the possibility of using regularization (i.e., a way to simplify the initial modeling to reduce overfitting — for some details, here), momentum (i.e., a factor sometimes useful to speed up learning), and specifying the type of the Gradient Descent algorithm (Batch, when the entire sample is used, or Adam — see next) along with the size of the training set during training (Mini-batch algorithm). The first part of the Python script comprises the description of all the modules and attributes:
The following part comprises the definition of different activation functions and their derivative, that can be specified when setting the model:
Finally, before moving on with the Forward and Back propagation algorithms, some other pieces must be defined first: a function to initialize the weights randomly (this is important to avoid issues when weights are updated, especially in the first iterations; take a look here), and a function to add the so-called bias term (i.e., a constant term, used for adjusting the results obtained with the activation functions).
For the Forward and Back propagation algorithms, I defined two functions: the first one is just the Forward algorithm application while the second one is the same along with the Back propagation part. The first function returns the values obtained in the last layer, while the second one returns the values obtained in each layer along with the errors (in tuples). One important thing is that, between the input and the last hidden layer, the activation function is the one specified in the modeling (i.e., the same function if only one is specified, or a different function for each layer if a tuple is specified). Between the last hidden layer and the output, the Sigmoid function is used for binary Classification problems, while for multi-class problems, the Softmax function is used (details here). For Regression, no activation function is specified (i.e., it returns an unconstrained number).
At this point, it is possible to define the Cost function, i.e. the error that must be minimized to find the best set of weights, and its gradient. The following function will do the job, returning a number (for the Cost function), and a tuple (for the gradient).
Now, to understand if the model is doing well, there is the need to define a function to make predictions (using the weights found with the Forward and Back propagation) and a function to compute different metrics to evaluate the performance depending on the nature of the problem (these two functions will be used during the training of the model). Those metrics will tell how good the model is by comparing the results obtained with the Neural Network and the actual values.
Now, it is time to go to the training session, where the Forward and Back propagation algorithms are applied iteratively to adjust the weights. The function Training will use two sets of data: a training set (to train the model) and a test set (to evaluate the model). At each iteration, the function J_grad is used for computing the Cost function and its gradient: this latter is needed to update the weights that will be used in the next iteration. The function also includes:
- The possibility to choose the Gradient descent algorithm between Batch (when the whole set of data is used for updating the weights — it is the classic Gradient Descent algorithm) and Adam (a more complex version of the batch algorithm, usually faster; details here).
- The possibility to choose the size of the training set, defining a mini-batch (if the size is 1, the algorithm is Stochastic — for details, here).
- The possibility to apply the so-called early-stopping: when the performance on the test set does improve with more iterations, the algorithm will stop (meaning that it has reached a sort of saturation and no further improvement can be achieved).
The final part comprises two modules that can be used once the training is over: a module to make predictions using the best set of weights found during the training, and a module to compute the final probability related to the resulting class (for Classification).
Application
Now, it is possible to try the algorithm! Let’s use the dataset “Pima Indians Diabetes” (from here — see also here) for classifying the class (1 if positive to diabetes, 0 otherwise). Before proceeding, let’s transform the sample by making it balanced (i.e., with an equal number of negative and positive classes) just for convenience, and scaling it with the MinMaxScaler function.
Let’s compare different Neural Network models: the activation function will be the hyperbolic tangent, and only one hidden layer of 5 nodes will be used along with several Gradient Descent algorithms.
And finally, the results are:
Batch GD algorithm
Iteration: 293/500 ----- Training cost: 0.44587 - Validation cost: 0.44592 --- Training accuracy: 0.76000 - Validation accuracy: 0.73913Batch + Momentum GD algorithm
Iteration: 74/500 ----- Training cost: 0.44432 - Validation cost: 0.44466 --- Training accuracy: 0.76000 - Validation accuracy: 0.73913Adam GD algorithm
Iteration: 34/500 ----- Training cost: 0.44500 - Validation cost: 0.45056 --- Training accuracy: 0.74933 - Validation accuracy: 0.75155
Well, the Neural Network reached an accuracy of 75% on the test set (with the Adam algorithm). Notice how the Adam algorithm is extremely faster than the others (only 34 iterations to reach an optimum value!). Let’s take a look at the learning curves, i.e. the Cost function for the training and test set at each epoch.
The two curves are similar therefore we are not in the overfitting case. Great! This is just a quick and simple example of how to define and use a Neural Network and clearly, there is a vast space for exploration and improvement!
Conclusion
That’s it! Of course, the Python script can be improved, even simplified, but I’m pretty happy with the final result and pretty happy with myself for writing it :D