# Hyper-parameters used in Deep Learning

W*hen creating an artificial intelligence model that learns from data, we need some parameters to decide which algorithm or techniques to use in the model. For example, the programmer decides what the k value will be in the KNN classification algorithm. Likewise, the programmer decides which kernel function to use in the SVM algorithm. It is not clear at the beginning of the problem on what basis these parameters will be determined. It varies depending on the data set, the problem or the area in which it will be used. After these factors are finalized, the programmer determines the parameters to be used in the problem in the most appropriate range.*

*Size and Variety of Data Set*

We all know that the size and d iversity of the data set are the most important parameters for learning in deep learning applications. The larger our dataset, the better the learning will be. Not only is the size of the data set sufficient for a good model, its diversity is also important. As the diversity increases, the performance of the model will also increase.

If we have a very small data set, it is not very suitable to solve the problem with deep learning. The data set should be increased with synthetic data or feature transfer should be done with “transfer learning” methods. Or models such as SVM should be used, which are ideal for solving non-linear problems that work better on small data. The data set can be augmented by synthetic data generation techniques (data augmentation). This method will increase performance to a certain extent.

*2. “Mini-Batch” Size*

In deep learning applications, learning by processing all data in the data set in the same time is a costly process in terms of time and memory management. Because, in each iteration of learning, gradient descent is calculated on the neural network with the backpropagation process and the weight values are updated in this way. In the calculation process, the number of data and the calculation process increase in direct proportion. As a solution to this problem, data sets are divided into small groups and the learning process is carried out over these groups. The process of processing more than one input value in parts in this way is called “mini-batch”. The value assigned as the mini-batch parameter in the model design; decides how many data the model will process at the same time. In the studies carried out, it has been determined that the loss value increases but time is gained in the case of processing the data in groups (mini-batch).

*3. Learning Rate and Momentum Coefficient*

Updating the parameters in deep learning is done by backpropagation. In the backpropagation process, this update is done by finding the difference by making a backward derivative called “chain rule” and multiplying the difference value with the “learning rate” parameter, subtracting the result from the weight values and calculating the new weight value. The “learning rate” parameter used during this process can be determined as a fixed value, or it can be determined as a step-by-step value (for example, 0.001 until a certain learning step, 0.01 after that step), depending on the momentum value, or learning by adaptive algorithms. can be learned during

The noise generating methods such as Schoastic gradient descent are normalized with techniques such as exponential weight average and their oscillations are reduced. This normalization process is done with momentum beta coefficients. In this technique, instead of taking the newly produced value as it is, the new value is calculated by adding the previous value at the rate of the beta coefficient. Thus, a faster method is created by reducing noise and oscillations in the graph.

*4. Number of Epochs*

While the model is being trained, not all of the data are included in the training at the same time. They take part in education in a certain number of parts. The first piece is trained, the performance of the model is tested, and the weights are updated according to the success with backpropagation. Then the model is retrained with the new training set and the weights are updated again. This process is repeated at each training step to try to calculate the most appropriate weight values for the model. Each of these training steps is called an “epoch”.

Since the most suitable weight values to solve the problem in deep learning are calculated step by step, the performance will be low in the first epochs, and the performance will increase as the number of epochs increases. However, after a certain step, the learning situation of our model will decrease considerably.

The model usually takes a long time to train; There are models that take days or months to train. This is common in deep learning. For this reason, it is tried to shorten the training process as much as possible with other hyper parameters.

The size of the epoch number also varies according to the problem type. For example, the number of epochs should be kept larger in pattern-learned RNNs (Recurrent Neural Networks) compared to other models. As the number of epochs increases, the performance of the model increases significantly. Training can be terminated at these points, as performance will increase in very small units after a certain epoch.

*5. Determination of Weight Initial Values*

How the weights are determined affects the learning and speed of the model.

If the W weights are initialized to 0 at the beginning of the model, the input values appear as the output, as the matrix product is a sum. Since y = f(x,w) -> w.x, since the values of the w matrix are 0, the addition of x and the matrix containing zero will still give x itself. Therefore w weights should not be initialized to 0 initially.

If the W weights are started from randomly small numbers at the beginning of the model, the model will work in small networks, but will result in a heterogeneous distribution of activation between network layers.

*6.Dilution (Dropout) Value and Layers to be Diluted*

It has been shown that dilution of nodes below a certain threshold value in fully connected layers increases performance. In other words, forgetting weak information contributes positively to learning.

The dropout value is generally 0.5. It varies according to the problem and the data set. When the dropout value is used as the threshold value, it is defined as a value in the range of [0,1]. Finally, it is not mandatory to use the same dropout value in all layers.

*7.Convolutional Neural Network (CNN) Kernel Size*

In convolutional neural networks, kernels that operate on the matrix are used in each layer. These are like simple Gabor filters. The size of these kernels is highly effective on learning. Because the kernel size and how wide the data will affect each other are decided. Generally, 3x3, 5x5, 7x7 kernels are used. Using a large kernel will cause the image to be smaller after convolution is applied. In this case, small sized kernel such as 3x3 is generally used because it causes loss of information. In the edge detection process, in order to be able to look to the right, left, top and bottom of the pixel being processed, filters that can be centrally composed of odd numbers are used.

*8.Selection of the optimization algorithm*

In deep learning applications, the learning process is essentially an optimization problem. Optimization methods are used to find the optimum value in the solution of nonlinear problems. Used as a publication in deep learning; Optimization algorithms such as stochastic gradient descent, adagrad, adadelta, adam, RMSProp, adamax are used. There are differences between these algorithms in terms of performance and speed.

*9. Activation Function*

Activation functions are used for non-linear transformation operations in multilayer artificial neural networks.

Activation functions add non-linearity to the model. In our linear function y = f(x,w) in the hidden layers, matrix multiplication is performed and the weight of the neurons is calculated, and the output is converted to a non-linear value. Because deep learning methods are more effective than other methods in solving problems with nonlinear structure, the problem that is tried to be solved by deep learning methods is generally a nonlinear non-linear problem. The conversion of the value obtained as a result of matrix multiplication into non-linear is done with activation functions.

**Conclusion;**

1)The more the data set and its diversity, the greater the learning.

2) The more groups there are, the higher the loss value, but time is gained.

3) A high learning rate causes oscillation. Being small too

This will cause learning to take too long.

4) As the number of epochs increases, so does the performance.

5) Using large kernels may cause information loss.

The parameters described in the article are the most frequently encountered hyperparameters in deep learning applications. In addition to these described parameters, there are also parameters for different network structures.

If you want to follow me on Linledln account: linkedin.com/in/özden-özyurt-87b2301a9