Stories by Anar Abiyev on Medium

Missing Value Imputation in Python

Anar Abiyev — Sun, 04 Feb 2024 17:49:35 GMT

This blog will teach you how to deal with missing values in Python

My previous blog was about theoretical information on the topic:

Complete Guide to Missing Value Imputation

In this one, you will learn the Python implementation of those tips.

Without further ado, let’s get started!

Imports

First, we will import the necessary libraries and load the dataset into the pandas data frame.

import pandas as pd
import numpy as np
import missingno as msno

df = pd.read_csv('sample_dataset.csv')
df.head()

Overview of Missing Values

After that, we can use missingno matrix function to take a look at the distribution of missing values.

msno.matrix(df)

Fig 1. Missingno Matrix.

To check the number of missing values for each and every column, we can use .isnull().sum() functions:

df.isnull().sum()

Fig 2. Pandas .isnull().sum() Output.

For better insights, I have written this function to get the percentage values of missing values for every column.

def column_missing_value_percentiles(df):
  values = df.isnull().sum().values/df.shape[1]
  columns = df.columns
  for idx in range(len(columns)):
    print(f"{columns[idx]}: {values[idx].round()}%")

column_missing_value_percentiles(df)

Fig 3. Output for Percentiles.

Solutions

Now you will learn which solution method is suitable for the missing values.

Dropping Rows

If you check out percentiles again, you can see that some columns have a quite small amount of missing values — less than 5%.

For such columns, we can drop the rows which contain missing values from these columns.

def drop_rows(df, columns):
  df.dropna(subset=columns, inplace=True)

drop_rows(df, ['enrolled_university', 'education_level', 'last_new_job', 'experience'])

Dropping Columns

For the column called “company_type”, the percentile is 67%. In this case, we should drop the column, because the filling will create bias and not be helpful for analysis.

def drop_columns(df, columns):
  df.drop(columns, axis = 1, inplace = True)

drop_columns(df, ['company_type'])

Mean, Median, Mode methods

These methods are similar to each other. I have shown functions for each one and used one of them as an example.

def fill_mode(df, column):
  mode = df[column].mode()[0]
  df[column] = df[column].fillna(mode)

def fill_mean(df, column):
  mean = df[column].mean()
  df[column] = df[column].fillna(mean)

def fill_median(df, column):
  median = df[column].median()
  df[column] = df[column].fillna(median)

fill_mode(df, 'gender')

Divide and Conquer

In this method, I am using another column to get better insights for the target column (which is going to be filled).

I have called two columns here:

column to conquer — the column that is going to be filled.
column to divide — the column that is used.

If we apply the previous mode method, then the mode of the whole column will be used to fill all the NAs.

To apply a more advanced method, the column is divided into different groups and individual mode values will be found.

In this example, I assume that the person’s experience might be related to the company size; if you have more experience, you are likely to work in a bigger company.

In the first loop, individual mode values is found and stored in the list.

In the second loop, if the value of the divider matches, then the corresponding mode values are used to fill NAs.

df['company_size'].unique()
df['experience'].unique()

def divide_and_conquer(df, column_to_conquer, column_to_divide):
  modes = []
  for i in df[column_to_divide].unique():
    mode = df[df[column_to_divide] == i][column_to_conquer].mode()[0]
    modes.append(mode)

  for i in range(df[column_to_divide].nunique()):
    mask = df[column_to_divide] == df[column_to_divide].unique()[i]
    mode_value = modes[i]
    df.loc[mask, column_to_conquer] = df.loc[mask, column_to_conquer].fillna(mode_value)

column_to_conquer = 'company_size'
column_to_divide = 'experience'

divide_and_conquer(df, column_to_conquer, column_to_divide)

Random Imputation

Here, we use random values of the column in order to fill the missing values.

def random_imputation(df, column):
  options = df[column].dropna().unique()
  df[column] = df[column].apply(lambda x: np.random.choice(options) if pd.isna(x) else x)

random_imputation(df, 'major_discipline')

Model-based Methods

In this method, a model is trained to fill in missing values.

The target column is the one to be filled.

The train set is the rows without missing values.

The test set is the rows with missing values.

column_to_fill = 'gender'

df_train = df.dropna()
df_test = df[df[column_to_fill].isna()]

X = df.drop(column_to_fill, axis = 1)
y = df[column_to_fill]

X_train = df_train.drop(column_to_fill, axis = 1)
y_train = df_train[column_to_fill]

X_test = df_test.drop(column_to_fill, axis = 1)

X_train = pd.get_dummies(X_train, columns=X_train.select_dtypes(include = 'object').columns, drop_first=True)
X_test = pd.get_dummies(X_test, columns=X_test.select_dtypes(include = 'object').columns, drop_first=True

After we get the sets, the model can be defined and trained.

The predictions are the values that are used to fill the NAs.

from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=3)

knn_classifier.fit(X_train, y_train)

predictions = knn_classifier.predict(X_test)

Check the theoretical explanation of each solution shown here:

Complete Guide to Missing Value Imputation

Clap and Follow for support!

Thank you for reading!

Complete Guide to Missing Value Imputation

Anar Abiyev — Sun, 04 Feb 2024 17:48:57 GMT

You will learn all the essential knowledge to deal with missing values in the dataset!

In this blog, I will go through different scenarios of missing data problems and their solutions.

You will know how to approach each case as a data scientist.

After you read this guide, you can also check my blog about Python implementation of methods explained in this blog!

Missing Value Imputation in Python

Outline

What is missing data?
What are the reasons for missing data?
Solutions.

What is missing data?

In data science, missing data refers to the absence of values or information in a dataset. Dealing with missing data is a crucial aspect of the data cleaning and preprocessing stage, as it can impact the quality and accuracy of analyses and machine learning models. It reduces the effective sample size, potentially reducing the power of statistical tests and the generalizability of models.

The are two general ways to deal with missing data:

Deletion. Removing rows or columns with missing values. This can lead to the loss of valuable information and may introduce bias.
Imputation. Filling missing values with estimated or predicted values.

In the next paragraphs, I will be explaining what are the best methods to use for missing data imputation according to the different situations.

What are the reasons for missing data?

Prior to starting missing data imputation, it is a good practice to analyze the reasons behind the missing data problem.

The way I prefer to do it is to use missingno library in Python.

import pandas as pd
import missingno as msno

df = pd.read_csv('sample_dataset.csv')
msno.matrix(df)

The code produces a matrix like below. Here, the white lines are missing values. With such a tool, you can easily get a view of your dataset in terms of missing values.

Fig 1. Missing Value Matrix with missingno Library.

Before proceeding to the code section, let’s go through the reasons that might cause missing data.

There can be various reasons for missing data in a dataset. Understanding these reasons is crucial for handling missing data appropriately and making informed decisions in data analysis or modeling.

Here are some common reasons for missing data:

Non-response. Individuals or entities may choose not to respond to certain survey questions or provide specific information, leading to missing values.
Instrumentation Issues. Problems with measurement instruments or data collection tools can lead to missing values.
Technological Limitations. Technical constraints or limitations in data capture methods can result in missing data.
Unavailability of Historical Data. In longitudinal studies or time-series data, historical records may be missing due to various reasons such as system upgrades, changes in data collection methods, or data storage issues.

Solutions

In this section, I will go through various imputation strategies and explain which one you have to use for certain scenarios.

Dropping rows

This method involves removing entire rows from the dataset that contain missing values. It is simple and easy to implement, but it can only be helpful when the number of missing values is a minority (up to 10%). Otherwise, this can lead to data loss.

For example, you have 10,000 rows of dataset and 50 rows have missing values. You can drop those rows and continue with the remaining dataset as it is a very small proportion.

Dropping columns

In the previous method, I talked about rows. However, if the missing values are related to the same column, then you can drop that column.

If more than 60–70% of a column is missing, then you can make a case for dropping the entire column. Otherwise, if you try to fill the missing rows, most of the values in the column will be synthetic data and this can create bias.

Mean, Median, and Mode methods.

These are mostly used and straightforward methods in missing data imputation.

Mean is the average of a numerical column. If there are some missing values in a numerical column, then you can use the mean of the column to fill.
Median is the middle value in a numerical column.

Bonus Tip: If the data has many outliers, use Median imputation, otherwise use mean imputation.

Mode is the most frequent value of a categorical column. This method can be used to fill categorical columns. Here, you have to pay attention to the balance between different classes.

Divide and Conquer

The dataset is divided into subsets based on observed variables, and imputation is performed separately on each subset. This method addresses missing data based on related subsets, potentially capturing more nuanced patterns, but it requires careful consideration of how to divide the data. Complexity increases with multiple variables.

Let’s see the example below.

You have “age” and “marriage status” columns and the latter has some missing values. Instead of filling all the missing values with the mode of the column, you can divide rows based on age, because we can assume that more people get married when they get older.

So, you divide data into classes according to the age column: young, mature, old. After that, you fill in the missing values with the mode of each group separately.

Random imputation / hot deck.

Random imputation, also known as hot deck imputation, is a method for handling missing data by replacing missing values with randomly selected observed values from the same variable.

The term “hot deck” refers to a metaphorical deck of cards, where each card (or observation) is available to be selected to fill in the missing value.

Identify the variables with missing values in the dataset.
Create a pool or deck of observed values from the variable containing missing values.
Randomly select values from the pool and use them to replace the missing values.

Random imputation helps preserve the variability in the dataset by introducing randomness into the imputed values. It is a relatively simple method to implement, requiring minimal computational resources.

But,

Since values are selected randomly, there’s a possibility of imputing values that do not accurately represent the overall distribution or patterns in the data.

So,

Careful consideration should be given to how the imputation pool is created to ensure that it is representative of the variable’s distribution.

Random imputation is often more suitable for continuous variables rather than categorical ones.

To account for uncertainty introduced by randomness, multiple imputations can be performed, creating several datasets with different imputed values for each missing entry.

Model-based methods.

Model-based imputation is an advanced technique for handling missing data by using predictive models to estimate and impute missing values. Instead of relying solely on summary statistics like mean or median, this method leverages relationships within the dataset to make informed predictions. The choice of the model depends on the characteristics of the data and the relationships between variables.

Let’s see the example below.

There is a dataset in which one column has some missing and present rows.

Fig 2. Dataset with missing and full values.

In order to use a model-based approach to impute missing values, the strategy in the following image will be applied. The present or full rows will be used as a train set while missing rows will be the test set. The results of the model will be used to fill in the missing values.

After the imputation process is done, the dataset will be divided into dependent and independent columns according to the task. But for the imputation itself, the dependent column has to be the one that has missing values.

Fig 3. Train and Test set for Model Imputation.

Model-based imputation takes into account relationships between variables, allowing for more accurate imputations compared to simple statistical measures. This method can capture non-linear relationships, making it suitable for datasets with complex patterns. Model-based imputation can be computationally intensive, especially when using complex models or dealing with large datasets.

Converting NA into a feature

This is a method for user form data. When you have a question that can be answered or left blank by the user, then you will have missing data for blanks.

This column can be converted to a binary column with values of true and false; true when the user answers, false when the user does not answer the question.

Here, the assumption we make is that the user didn’t answer the question because of a reason. Thus this is a feature itself.

Check the Python implementation of each solution explained here:

Missing Value Imputation in Python

Clap and Follow for support!

Thank you for reading!

What is Dropout Regularization method?

Anar Abiyev — Fri, 05 Jan 2024 06:17:26 GMT

Does dropout really work? See the results of the experiment with the CNN model and CIFAR10 dataset!

In this article, you will learn about a regularisation method called Dropout.

The blog will be in two parts. In the first section, I will explain the idea behind the technique.

In the second part, you will see the results of the experiment I have carried out.

I have run the model 10 times and noted accuracies for each of the four hyperparameters:

Without Dropout.
Dropout p = 0.1.
Dropout p = 0.3.
Dropout p = 0.5.

Part 1.

Dropout is a regularization technique commonly used in deep learning models to prevent overfitting. Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, to the extent that it performs poorly on test data.

Dropout helps address this issue by introducing randomness during training.

Dropout involves randomly “dropping out” (i.e., setting to zero) a certain percentage of neurons in a layer during each forward and backward training pass.

This means that, during training, some neurons do not contribute to the computation. The dropout rate is a hyperparameter determining the fraction of neurons to drop out.

The idea behind dropout is to prevent the co-adaptation of neurons. When dropout is applied, the network cannot rely too heavily on any particular set of neurons because they may be turned off at any moment. This forces the network to learn more robust and generalized features from the data.

During the testing or inference phase, dropout is usually turned off, and all neurons are active. This ensures that the model utilizes the full capacity it has learned during training.

Before moving to the second part, let’s set our expectations from the experiment.

The dropout method is expected to lower training accuracy and raise the test accuracy.

Because some neurons will be set to zero (definition of dropout) during the training phase, the model will not learn the training set as well as without dropout.

As the regularization techniques aim to help the model to generalize better, the test set is expected to be learned better with the dropout.

Part 2.

Dataset

The experiment has been carried out using the CIFAR10 dataset. The dataset contains 50000 training and 10000 testing images of 28x28 pixels organized in 10 classes.

Model

The used model is a CNN architecture of two convolutional layers. The results of convolutional layers will be fed into a fully connected neural network with one hidden layer. The output layer will have 10 neurons — one neuron per class.

Fig 1. Convolutional Layers Architecture.

Fig 2. Fully connected Layers Architecture.

Experiment methodology

I have run the model 10 times and noted accuracies for each of the four hyperparameters:

Without Dropout.
Dropout p = 0.1.
Dropout p = 0.3.
Dropout p = 0.5.

Each training has been carried out with 10 epochs. You will see both train and test accuracy for all cases, their averages, and different analyses.

Results

Firstly, let’s see the results in the table. I have plotted line graphs below.

There are four sections:

no dropout.
p=0.1.
p=0.3.
p = 0.5.

Each section has train and test columns where you can find corresponding accuracies for each model run.

Table 1. Experiment Results.

I have plotted line graphs for train and test sets separately for easier comparison.

Let’s continue with the train set. We mentioned that dropout will cause the train accuracy to drop.

The experiment results confirm our claim. The highest accuracy has been achieved by “No Dropout”, while the lowest accuracy is by “Dropout p = 0.5”.

Fig 3. Train Set Accuracies with and without Dropout.

If we look at the test set, we can see the reverse behavior. As there is more dropout, the test set accuracy increases.

Fig 4. Test Set Accuracies with and without Dropout.

Summary

Overall, in this blog, you learned what is dropout regularization technique and observed the experiment results.

The experiment results confirmed our expectations of the dropout method. Train accuracy dropped, while test accuracy increased.

I have attached the Python code to the link below, you can run the code yourself and check the results.

Medium-Youtube/4. Dropout.ipynb at master · anarabiyev/Medium-Youtube

Thank you for reading. Clap and follow if you learned anything new.

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer️
Learn how you can also write for In Plain English️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture

What is Dropout Regularization method? was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Convolutional Neural Network Terminology for Beginners

Anar Abiyev — Wed, 27 Dec 2023 06:16:54 GMT

Learn what kernel, stride, pooling, and many other terms mean for CNN. Easy explanation with images!

The blog explains what each of the terms below means:

Kernel or Filter.
Channel.
Stride.
Pooling.
Padding.
Dropout.

Kernel or Filter

Kernels of filters are the core part of the convolution process. Each kernel can also be called an “information detector”. For example, the kernel on the left is used to detect vertical lines, whilst the one on the right is used for horizontal lines.

Fig 1. Vertical and Horizontal Kernels.

If a kernel is convolved over the image, then the resultant layer will contain information about the original image.

I have tasted an image with these kernels, let’s see the results:

Fig 2. Convolution of Vertical and Horizontal Kernels in Python.

To sum up, a kernel or filter is a matrix to extract information from an image with the help of a convolution operation.

Channel

The channel is a layer on the image. Firstly, the image is one layer or three layers. One layer for grayscale, and three layers for RGB images.

After a convolution layer, the image is separated into multiple layers with the help of kernels. Each new layer contains kernel results. For example, if the vertical line kernel is used and the image has a lot of vertical lines, then its layer will have larger positive values.

In the example below, the input image has one layer, after the convolution operation, 6 new layers have been derived.

Fig 3. Convolution Layers.

Stride

The stride determines the movement or step size of the kernel. If stride is 1, the kernel moves like below:

one pixel right until the end of the image,
one pixel down,
one pixel right until the end of the image,
one pixel down,
and so on …

The first image shows stride = 1, while the second image illustrates stride = 2.

Fig 4. CNN Stride 1.

Fig 5. CNN Stride 2.

Pooling

Pooling is a method to reduce the size of an image. The most frequently used pooling methods are average and max pooling.

As the animation shows, for the max pooling of size 2x2 and stride 2, only the maximum value of the four pixels is used in the resultant image. In other words, the maximum pixel represents the four pixels in the new image.

Fig 6. CNN Maxpool.

In this way, the image dimension is reduced by two times.

The goal of pooling is to reduce the number of pixels to have a lighter neural network with fewer parameters and prevent overfitting.

Padding

Padding means adding an extra layer of zeros around the image.

The primary purpose of padding is to preserve spatial information at the edges of the input, preventing a reduction in the spatial dimensions of the feature maps. This is crucial in maintaining accurate boundary information and preventing the loss of important details during the convolutional process.

The animation below illustrates one layer of padding.

Fig 7. Padding.

Padding also ensures that the convolutional operation is applied uniformly across the entire input, helping to mitigate issues like the shrinking receptive field and vanishing gradients.

CNN padding plays a vital role in enhancing the performance and effectiveness of convolutional neural networks by addressing edge-related challenges and preserving spatial information during feature extraction.

Dropout

Dropout is a step during the training phase. In dropout, some (usually 10%) random weights of the kernels are replaced by zero.

By doing so, the model will be more general and will not overfit.

Thank you for reading! If I added value to your learning, please don’t forget to clap and follow!

P.S.

The images without reference belong to the author, while the images of other people have been indicated by showing corresponding reference links on image names.

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer️
Learn how you can also write for In Plain English️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture

Convolutional Neural Network Terminology for Beginners was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Neural Network Terminology for Beginners

Anar Abiyev — Sun, 24 Dec 2023 13:37:23 GMT

Learn what neurons, layers, weights, biases, activation functions, epochs, forward & backward propagation, and other terms mean in deep learning!

The blog explains what each of the terms below means:

Neuron
Layer
Weight & Bias.
Activation Function
Forward & Backward Propagation
Epoch
Batch & Batch size.

Neuron

The image below illustrates a simple neural network. Every yellow circle you see in the image is a neuron. In other words, every node in the architecture is a neuron.

Fig 1. The simple architecture of a Neural Network.

Every node possesses a value. The combination of nodes creates a layer.

Layer

Layers are a group of nodes. There are three types of layers:

Input layer.
Hidden layer.
Output layer.

The first layer is the input layer, while the last one is the output layer. The other layers between them are called hidden layers.

I have a blog easily explaining the purpose of each layer, check it from the link below before continuing:

Neural Network Layers Explained for Beginners

Weight & Bias

In a neural network, calculation means multiplying the neuron value by the weight and sum up with the bias.

Weights are illustrated on the lines that connect neurons. In the example below, 0.54 and 0.48 are the values of neurons in the input layer. There is also a bias with the value of 0.06. 0.2 and 0.1 are the weights of the lines connecting the neurons of the input layer with the first neuron of the hidden layer.

Fig 2. Weights and Biases.

The calculation is like below:

Activation function

In reality, while calculating the value of a neuron, there is one more extra step which is the activation function.

Continuing with our example, the value of 0.216 is not directly assigned to the neuron. Before that, an activation function takes 0.216 as the input, and its output is assigned to the neuron. For example, if the activation function is the sigmoid function:

I have a blog that is a complete guide for activation functions, check it from the link below:

Complete Guide to Activation Functions in Deep Learning

Forward & Backward Propagation

In the neural network, there are two directions of calculations:

Forward

In the forward direction, input data is fed into the neural network. This data travels through the network layer by layer, where each layer consists of neurons connected by weighted edges.

At each node, the weighted sum of inputs is computed, and an activation function is applied to introduce non-linearity to the model. This transformed output becomes the input for the next layer.

The final layer produces the network’s output, which is compared to the desired or target output. This comparison helps evaluate the performance of the network and determine the error.

Backward

In the backward direction, the calculated error (the difference between the predicted and target outputs) is propagated backward through the network.

The key objective is to minimize the error. This is achieved by adjusting the weights and biases of the connections between neurons. The adjustments are proportional to the gradient of the error with respect to the weights and biases.

Backpropagation employs optimization algorithms like gradient descent to iteratively update the weights and biases, moving the network towards a configuration that reduces the overall error.

Epoch

The forward and backward processes are repeated through multiple iterations the neural network converges to a state where the error is minimized, and the model performs well on the training data.

The number of iterations is called epochs. The epoch value is set according to the resources available prior to the training process, but the progress is observed closely. If the accuracy of the model does not get better and there are some epochs left, then the training process is stopped.

Batch & Batch size

The dataset is divided into several parts before feeding the neural network. The batch size determines how many data points each part of the dataset will have. If the batch size is 32, then each section of the dataset will have 32 data points.

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer️
Learn how you can also write for In Plain English️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture

Neural Network Terminology for Beginners was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Neural Network Layers Explained for Beginners

Anar Abiyev — Sat, 23 Dec 2023 15:52:56 GMT

How to know the number of layers and neurons in a Neural Network.

In a Neural Network, there are three types of layers:

Input
Hidden
Output

I will explain what they are and how many neurons each should have.

Input Layer

The input layer of your neural network depends on the dataset you are going to use for the task. For example, if the dataset consists of 28x28 pixel images, then your input layer needs to have 784 (28x28) neurons. Each pixel value will correspond to a neuron in the input layer.

For the input layer, you need to analyze the dataset and see how many neurons you need to feed that data into the model.

Hidden Layer

The number of hidden layers in neural networks is some kind of hyperparameter.

There is no rule like you need two hidden layers for this or three hidden layers for that.

It is determined by trial and error method.

But,

Some guidelines will help you to find the answer more efficiently.

Start with simple architecture and increase complexity gradually.
If the dataset is more complex, more hidden layers will help.
Consider domain knowledge, if there is a solution to a similar problem, refer to that architecture.

Output Layer

The output layer of neural networks depends on the task. If it is a regression problem one neuron is enough. On the other hand, the number of neurons is determined by the number of classes in the classification problem.

For example, when you predict which digit the picture is, then 10 neurons output layer will be used, one neuron for the probability of each digit.

Thank you for reading, don’t forget to check this tutorial to learn about Activation Functions.

Complete Guide to Activation Functions in Deep Learning

Standardization and Normalization — Clearly Explained!

Anar Abiyev — Thu, 21 Dec 2023 15:18:23 GMT

Standardization and Normalization, Feature Scaling — Clearly Explained!

This story will clear all your questions about standardization vs / and normalization and you will never search this topic again!

You probably have many questions about standardization and normalization, you have searched many articles and watched some videos on YouTube.

After reading this blog until the end, I assure you that you will never search for standardization or normalization again!

In this blog, you will learn:

· What is feature scaling and why do we need it?

· Which models need scaling and which ones don’t?

· What is normalization?

· What is standardization?

· When to use normalization or standardization?

Firstly, let’s state that both normalization and standardization are types of feature scaling. They have different formulas and use cases, but both are used to change the scale of data.

What is even feature scaling?

Let’s say you have a column in your dataset that looks like the histogram on the left. The range of the column is between around 20 and 65. If we want to change the scale of the column, all we must do is divide the values by some constant, for example, 2. The histogram on the right shows the distribution after this scaling.

Fig 1. Feature Scaling by dividing with a constant.

Another method of scaling the data is by subtracting a constant. For instance, if you want the data to start from zero, you can subtract 20 from the column values.

Fig 2. Feature Scaling by subtracting a constant.

Please note that both multiplication and addition can be used as well, but usually, it is tried to make values around and close to zero, thus, subtraction and division are applied.

Why apply Feature Scaling?

Check out the dataset below, all three columns have different scales. If you feed this dataset as it is, the model will give more importance to the column with the higher values, the “income” column in this example. However, we want the model to approach each column as equals and calculate corresponding weights according to the optimization, not because of scales.

Fig 3. Example dataset with different scales.

The models I will mention below are the ones that benefit from feature scaling the most:

Gradient—based optimization algorithms. Models that use gradient descent for optimization, such as linear regression, logistic regression, and neural networks. Scaling will help to converge faster.
Distance—based algorithms. Models that use distances between data points, such as k-Nearest Neighbors (KNN) and Support Vector Machines (SVM), can benefit from feature scaling because it ensures that all features contribute equally to the distance computation.
PCA (Principal Component Analysis). PCA is a dimensionality reduction technique that involves finding the principal components of the data. Feature scaling is important for PCA because it ensures that all features have equal importance in determining principal components.

The models that do not benefit from feature scaling are the ones that are not built upon numerical values themselves, rather than comparing these values:

Tree-based models. Decision trees, Random Forests, and Gradient Boosted Trees. These models make decisions based on feature thresholds and are invariant to monotonic transformations of the features.
Naive Bayes. Naive Bayes classifiers are probabilistic models that assume independence between features given the class. They are generally not sensitive to the scale of individual features.

What is Normalization?

Normalization is moving the scale of data into the range between 0 and 1. It is done with the following formula:

Formula 1. Equation of Normalization.

Let’s apply normalization to the sample data we plotted above:

Fig 4. Normalized data.

As you can see, the shape of the histogram remains the same, but the range has been changed to 0–1.

This is all the theoretical background needed for normalization, changing the scale of the dataset to the range between zero and one. Let’s move to standardization.

What is Standardization?

The purpose of standardization is the same as normalization — changing the scale of the data. However, it achieves this by a different method. Instead of altering the range into a fixed range, the mean and variance of the data are changed.

It may seem complicated, but I will explain all the terms one by one.

Let’s continue with the formula to have a clear view of what it means “to apply standardization”:

Formula 2. Equation of Standardization.

Here,

- µ is the mean, which is the average of data.

- σ is the standard deviation.

Simply put, the mean of data is subtracted, and the result is divided by the standard deviation. After this operation, the mean of the resultant data will be equal to zero and the variance to one.

The best way to observe this is with two-dimensional data. See how the values and separation of points have changed when standardization is applied. The values are around zero, so the mean is zero and the distances between values have been decreased, so the variance is one.

Fig 5. Data points before standardization.

Fig 6. Data points after standardization.

If you have the intuition behind what changing mean and variance look like, let’s see our example data after standardization:

Fig 7. Standardized data.

An important point to underline here, there is a misconception that after standardization the distribution of data changes to normal distribution. However, this is a wrong conclusion about standardization. Yes, the mean and the variance are equal to 0 and 1 respectively in both normal distribution and the result of standardization, but it does not mean that the distribution of data becomes normal distribution. You can observe this in our example as well.

When to use normalization or standardization?

In general, the best approach is to try both methods and see which result is better.

If we dive more into the use cases:

- Normalization is preferred for neural networks, especially when working with images, the pixel values are scaled from 0–255 to 0–1 range.

- Standardization is preferred when there are outliers in the data because outliers can negatively affect normalization by shrinking other values.

You can check out the source code from the link below.

Medium-Youtube/Standardization_Normalization.ipynb at master · anarabiyev/Medium-Youtube

Thank you for reading, hope I added value to your journey in mastering data science / AI. If so, do not forget to clap and follow!

Check out the latest story about Activation Functions as well:

Complete Guide to Activation Functions in Deep Learning

How to Use Optune? Step-by-step Beginner Guide for Hyperparameter Tuning!

Anar Abiyev — Thu, 21 Dec 2023 10:16:52 GMT

Learn how to use Optuna for hyperparameter tuning. This is a complete step-by-step guide for beginners.

When I was searching for tutorials about Optuna, I could not find an easy-to-understand, step-by-step guide. I decided to write this blog to help anyone who wants to learn Optuna from scratch.

According to their GitHub page, Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning.

In this blog post, you will learn how to use Optuna in your projects!

Without further ado, let’s get started!

Step 1.

As you already might know, finetuning means running the model with different parameter combinations to find the most optimal set of parameters.

Thus, we need to have a measurement to compare different models, it can be an accuracy metric or a loss metric. If it is an accuracy metric, then we will select the model with the highest result, otherwise, we will choose the model with the lowest loss.

The first step is to have a model with a metric to measure its success.

The example below is a simple sklearn Random Forest Regression model which will be used to show how to apply Optuna.

# Import libraries
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

import optuna

# Data Preparation
df = pd.read_csv('optuna_dataset.csv')
df = pd.get_dummies(df, columns=df.select_dtypes(include = 'object').columns, drop_first=True)

X = df.drop('charges', axis = 1)
y = df.charges

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Modelling
rf_reg = RandomForestRegressor().fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)
print(mean_absolute_error(y_pred, y_test))

Step 1.5.

This section is to build an intuition for the second step.

As it might be confusing to run into the code directly, I have prepared a diagram that will help you understand the working principle of Optuna.

Fig 1. Diagram of Optuna workflow.

The tuning process starts with telling Optuna which parameters to try. Sklearn models have many parameters which can be found in sklearn documentation. The link below is documentation for Random Forest Regressor.

RandomForestRegressor

After parameter suggestions are set and ready, Optuna will select parameter combinations (with TPE Sampler) for each trial. Then for each parameter combination, a new model is trained, and an error is calculated.

In the final step, the parameter combination that caused the lowest error will be selected as the best parameters.

Step 2.

Now let’s move to Python and see how all these work in coding.

Optuna framework works by defining a function called “objective” with one parameter named “trial”. As shown in the diagram below, the function contains parameter suggestions and model, and returns error (or accuracy, depending on how you define it).

It is pretty straightforward; you suggest some parameters for the model and Optuna tests them and gives you the best parameters.

Fig 2. Diagram of “Objective” function.

To suggest parameters, the Optuna framework provides some options. In our example, we will use several of them, so you can get familiar with them.

Fig 3. Suggest functions of Optuna.

Now, let’s move to the code in Python.

Firstly, the suggestions are defined with the help of the functions shown above. Note that, the name of the parameter must be the same as shown in the documentation of the model and it is specified as a string inside “suggest_***” functions.

The second section is to define the model. Here, you write which parameter you have suggested and equal them to the corresponding variables defined as suggestions in the previous section.

The third section doesn’t have anything special or new, it is to fit the model and calculate error.

In the end, the error is returned.

def objective(trial):
    
    #1 Define hyperparameters to be tuned
    n_estimators = trial.suggest_int('n_estimators', 90, 110)
    max_depth = trial.suggest_int('max_depth', 5, 30)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 6)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 6)

    #2 Create a Random Forest Regressor with the suggested hyperparameters
    rf = RandomForestRegressor(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        random_state=42
    )

    #3 Fit the model and caluclate error
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)

    return mae

To summarize, you need to:

build a model without Optuna as usual.
determine which parameters you want to tune.
define “objective” function.
add suggestions for the parameter you want to tune.
define how to measure error.

Step 3.

After defining the “objective” function, you need to create a study for Optuna and use the code below to run the whole code.

The important point here is direction. You have to choose “minimize” or “maximize”:

if you defined an error to be returned in the “objective” function, then you need to use “minimize”.
if you defined accuracy to be returned, then you need to use “maximize”.

As we defined MAE (mean absolute error), the direction will be “minimize”.

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200)

After running the code above, you will have the results about what are the best parameters.

With the code below, you can print the best parameters and use them to fit the final model which will be trained with the best parameters. Then you can test it with predictions on the test set.

best_params = study.best_params
print("Best Hyperparameters:", best_params)

# Train the final model with the best hyperparameters
best_rf = RandomForestRegressor(
    n_estimators=best_params['n_estimators'],
    min_samples_split = best_params['min_samples_split'],
    min_samples_leaf = best_params['min_samples_leaf'],
    max_depth = best_params['max_depth'],
    random_state=42
)
best_rf.fit(X_train, y_train)

y_pred_best = best_rf.predict(X_test)
mean_absolute_error(y_pred_best, y_test)

Find the whole code and dataset from the links below:

Thank you for reading, if I added some value to your learning, don’t forget to clap and follow!

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer️
Learn how you can also write for In Plain English️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture

How to Use Optune? Step-by-step Beginner Guide for Hyperparameter Tuning! was originally published in Python in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Complete Guide to Activation Functions in Deep Learning

Anar Abiyev — Sun, 17 Dec 2023 01:31:46 GMT

This paper will answer all of your questions about activation. functions from why we need them, what are they, and which one to use!

An activation function is the last step before you assign the value to the neuron in the Neural Network. After multiplying the values of the previous layer’s neurons with corresponding weights, the results are summed up and fed into the activation function. The return of the activation function is assigned to the current neuron.

But why don’t we just use the sum itself and increase the computation cost by using activation functions? The short answer is that activation functions make the neural networks capable of learning non-linear features of the dataset.

Let’s break down what it means.

With the presence of an activation function, the z value which is calculated using the neurons of the previous layer, weights, and bias is fed into the activation function f.

Formula 1. Calculation of a neuron value.

Fig 1. Calculation of a neuron value.

If we don’t use the activation function the formula will be identical to linear regression, but we aim to build a model more powerful than linear regression. That is why by using activation functions the neural networks are much stronger than linear models such as linear regression itself. Activation functions introduce non-linearities into the network, allowing it to capture complex patterns and relationships in the data.

The ability to model non-linear features is crucial for deep learning models to effectively learn from complex datasets and solve tasks such as image recognition, natural language processing, and other pattern recognition problems. The hierarchical structure of deep neural networks, with multiple layers of non-linear transformations, allows them to automatically learn and extract hierarchical representations of features from the input data. This enables deep learning models to handle tasks that involve non-linear relationships within the data.

By now it must be clear why activation functions are necessary for neural networks. The next set of questions which can occur are like the following:

· Why there are different types of activation functions?

· What are the differences between them?

· Which one is the best?

· Usage in Python.

In the next paragraphs, I will go through the different types of activation functions individually, and break down all of them for a simple explanation. After going through all of them you will have a clear view of the differences and comparisons between various activation functions.

Sigmoid or logistic function

Formula and graph:

Formula 2. Calculation of Sigmoid.

Fig 2. Graph of Sigmoid

According the Wikipedia, a sigmoid function is any mathematical function having a characteristic “S”-shaped curve or sigmoid curve, it is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point.

The sigmoid function maps any value to the range between 0 and 1. You can imagine it as converting values into probabilities; thus, it is very common to apply this activation function in the output layer of classification models. Keep in mind that sigmoid can also be used in the hidden layers of the NN and it was a common practice in the early deep-learning architectures. Nowadays, it is still useful in certain scenarios, specifically when you want the output of the neurons to be between 0 and 1.

The reason sigmoid is not commonly used anymore is its most important drawback — the vanishing gradient problem.

What does this mean?

When the backpropagation algorithm is applied during optimization, the derivation of the activation function is also calculated. In the case of the sigmoid, its derivation becomes extremely small (as you multiply the gradient by quite small values several times); thus, it does not contribute to updating the weights of the network. In other words, the gradient vanishes.

This problem was later solved by introducing a new activation function which is the header of the next paragraph.

ReLU — Rectifier Linear Unit

Formula and graph:

Formula 3. Calculation of ReLU.

Fig 3. Graph of ReLU.

ReLU maps positive values as they are, and negative values as zero. It is the most popular activation function in deep learning. Its popularity comes from simplicity, efficiency, and the ability to mitigate the vanishing gradient problem.

As seen from the formula, ReLU doesn’t require any computation but rather max operation. This contributes to the reduction of computational costs. This efficiency is crucial in the training of large-scale neural networks, where millions or even billions of parameters need to be updated during each iteration of the optimization process.

Traditional activation functions like sigmoid and tanh can saturate for extreme values, leading to vanishing gradients. ReLU, on the other hand, does not saturate for positive inputs, allowing gradients to flow more freely during backpropagation. For sigmoid, the derivation is 0.25 maximum, while the derivation of ReLU is 0 or 1. When there are multiple layers, the sigmoid makes the gradient a very small value as the derivation is smaller than 1, but ReLU keeps the value the same or makes it zero. For deep learning, it is a better practice to have the gradient as zero or one, rather than a minimal number. Keep in mind that the majority of the values are mainly one as well.

In addition to addressing the vanishing gradient problem, ReLU introduces sparsity in the network. Since ReLU sets negative values to zero, some neurons in the network become inactive, leading to sparse activation patterns. Sparsity can be advantageous in terms of reducing overfitting, computational efficiency, and memory utilization.

As always, if there is no problem, there is no development. Thus, ReLU also has a problem called “dying ReLU” which is solved by Leaky Relu.

Leaky Relu

Formula and graph:

Formula 4. Calculation of Leaky ReLU.

Fig 4. Graph of Leaky ReLU.

The update to the classic ReLU is to multiply negative values by a small coefficient (you can see the negative side of the graph is not exactly zero), rather than making them zero. This will adjust small values for negative neurons and solve the “dying ReLU” problem. The graph above represents leaky ReLU with the alpha coefficient equal to 0.01, it is the default value but can be altered.

By allowing a controlled leak of information for negative inputs, Leaky ReLU promotes a more robust flow of gradients during backpropagation, addressing issues associated with the vanishing gradient problem. This characteristic makes Leaky ReLU a popular choice in deep learning architectures, offering a good balance between the linearity of traditional ReLU and the avoidance of complete inactivity in certain neurons, which can enhance the learning capabilities of neural networks.

Tanh

Formula and graph:

Formula 5. Calculation of tanh.

Fig 5. Graph of tanh.

The hyperbolic tangent function, commonly abbreviated as tanh, is a widely used activation function in neural networks. It is similar to sigmoid while tanh’s range is between -1 and 1. One significant advantage of the tanh function is that its output is zero-centered. This zero-centered property contrasts the sigmoid activation function, which outputs values in the range (0,1) and is not zero-centered. The zero-centeredness of tanh can be beneficial during the training of neural networks.

The tanh function squashes its inputs to the range of (-1,1). This bounded output range is advantageous in scenarios where it is desirable to constrain the outputs within specific bounds. In tasks such as image processing or text generation, where the intensity or relevance of features should be well-regulated, the bounded nature of tanh can be valuable.

The tanh function became preferred over the sigmoid function as it gave a better performance for multi-layer neural networks. However, it did not solve the vanishing gradient problem that sigmoid suffered.

Being like a sigmoid, tanh is useful in certain scenarios such as classification. The ability to generate non-linearities and capture both negative and positive values are among the advantages. Another great feature of tanh activation lies in its ability to avoid overfitting during training periods if regularization parameters are carefully tuned. Tanh smooths out output values, unlike ReLU which can lead to overfitting if not managed properly. This makes the learning process much more stable during long training periods and allows for better generalization of a dataset.

Softmax

Softmax is another activation function to discuss. Being somewhere similar to sigmoid it is used in the output layer of a neural network for multi-class classification problems. It takes an input vector and transforms it into a probability distribution. The output of the softmax function is a vector of probabilities that sums to 1. You can think of Softmax as a multiclass version of sigmoid.

The softmax function normalizes the input values to produce a probability distribution. The class with the highest probability is then typically chosen as the predicted class. The softmax activation is useful for converting raw scores or logits into probabilities, making it suitable for the final layer of a neural network used for classification tasks.

These are the most common and used activation functions, however, there are many more types as well. The majority of them are modifications of the ones we discussed. The following paragraphs will show their usage in Python.

Python

PyTorch provides a variety of activation functions that can be easily integrated into neural network architectures. The available functions can be checked from the link below:

torch.nn - PyTorch 2.4 documentation

From the practical point of view, at the initial step of deep learning training, an activation function is selected according to the characteristics of the solved problem.

In the code sample below, a simple pytorch neural network layer has been constructed. Here, instead of “relu”, “sigmoid”, “tanh”, “elu”, “softmax” or other functions can be used.

tf.keras.layers.Dense(128, activation='relu')

When you build a neural network, try to refer to the documentation and solutions to similar problems to find out which activation functions might work well for your case.

You can access the code from the link below to plot and analyze different activation functions:

Medium-Youtube/Activation_Functions.ipynb at master · anarabiyev/Medium-Youtube

Hope, I added value to your deep learning journey.

If so, please don’t forget to subscribe for more tutorials to come and clap the story!

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer️
Learn how you can also write for In Plain English️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture

Complete Guide to Activation Functions in Deep Learning was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to visualize loss and accuracy for Deep Learning models using TensorBoard (Part 2)

Anar Abiyev — Wed, 13 Dec 2023 08:17:19 GMT

How to Visualize Loss and Accuracy for Deep Learning Models Using TensorBoard (Part 2)

You will learn how to visualize loss and accuracy easily by using TensorBoard feature of TensorFlow.

This is Part 2 of TensorBoard tutorial, check Part 1 to learn about how to

set up TensorBoard.
visualize graphs, images and hyperparameter tuning.

Hyperparameter tuning is definitely worth reading technique because it visualizes the process in a great way!

In this section, you will learn how to make small modifications to your deep learning to visualize loss and accuracy of models by epoch.

Photo generated by Leonardo.Ai.

Introduction

The example will be based on MNIST dataset. I will not talk about this dataset, because I am sure you are aware of it if you are looking for TensorBoard.

The code below is the example given by Tensor Flow, check the reference for more:

Training a neural network on MNIST with Keras | TensorFlow Datasets

import tensorflow as tf
import tensorflow_datasets as tfds

(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'], shuffle_files=True, as_supervised=True,
    with_info=True,
)

def normalize_img(image, label):
  return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)

ds_test = ds_test.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)


model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(
    ds_train, epochs=6, validation_data=ds_test,
)

How to do

The changes we will do is:

Create TensorBoard callback:

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

2. Change “model.fit” to add callback:

model.fit(
    ds_train,
    epochs=6,
    validation_data=ds_test,
    callbacks=[tensorboard_callback]  # Add TensorBoard callback
)

Results

Start TensorBoard with the code below:

%load_ext tensorboard
%tensorboard --logdir logs

Navigate to Scalars tab to see the graphs:

loss

accuracy

This is your tutorial for TensorBoard, don’t forget to check Part 1!

Step-by-Step Guide to TensorBoard: Game Changer Visualization tool- Part 1

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer️
Learn how you can also write for In Plain English️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture

How to visualize loss and accuracy for Deep Learning models using TensorBoard (Part 2) was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Anar Abiyev on Medium

Missing Value Imputation in Python

This blog will teach you how to deal with missing values in Python

Imports

Overview of Missing Values

Solutions

Dropping Rows

Dropping Columns

Mean, Median, Mode methods

Divide and Conquer

Random Imputation

Model-based Methods

Clap and Follow for support!

Thank you for reading!

Complete Guide to Missing Value Imputation

You will learn all the essential knowledge to deal with missing values in the dataset!

You will know how to approach each case as a data scientist.

What is missing data?

What are the reasons for missing data?

Before proceeding to the code section, let’s go through the reasons that might cause missing data.

Solutions

Dropping rows

Dropping columns

Mean, Median, and Mode methods.

Divide and Conquer

Random imputation / hot deck.

Model-based methods.

Converting NA into a feature

Clap and Follow for support!

Thank you for reading!

What is Dropout Regularization method?

Does dropout really work? See the results of the experiment with the CNN model and CIFAR10 dataset!

You will love to see the results!

Part 1.

Before moving to the second part, let’s set our expectations from the experiment.

Part 2.

Dataset

Model

Experiment methodology

Results

Summary

Thank you for reading. Clap and follow if you learned anything new.

PlainEnglish.io 🚀

Convolutional Neural Network Terminology for Beginners

Learn what kernel, stride, pooling, and many other terms mean for CNN. Easy explanation with images!

Kernel or Filter

Channel

Stride

Pooling

Padding

Dropout

P.S.

PlainEnglish.io 🚀

Neural Network Terminology for Beginners

Learn what neurons, layers, weights, biases, activation functions, epochs, forward & backward propagation, and other terms mean in deep learning!

Neuron

Layer

Weight & Bias

Activation function

Forward & Backward Propagation

Forward

Backward

Epoch

Batch & Batch size

PlainEnglish.io 🚀

Neural Network Layers Explained for Beginners

How to know the number of layers and neurons in a Neural Network.

Input Layer

Hidden Layer

Output Layer

Standardization and Normalization — Clearly Explained!

Standardization and Normalization, Feature Scaling — Clearly Explained!

This story will clear all your questions about standardization vs / and normalization and you will never search this topic again!

What is even feature scaling?

Why apply Feature Scaling?

What is Normalization?

What is Standardization?

When to use normalization or standardization?

Thank you for reading, hope I added value to your journey in mastering data science / AI. If so, do not forget to clap and follow!

How to Use Optune? Step-by-step Beginner Guide for Hyperparameter Tuning!