<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Anar Abiyev on Medium]]></title>
        <description><![CDATA[Stories by Anar Abiyev on Medium]]></description>
        <link>https://medium.com/@anar-abiyev?source=rss-15151ff5820e------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*ZJMUyatTaw1R4sA5dWRFkQ.jpeg</url>
            <title>Stories by Anar Abiyev on Medium</title>
            <link>https://medium.com/@anar-abiyev?source=rss-15151ff5820e------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 19 May 2026 14:59:46 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@anar-abiyev/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Missing Value Imputation in Python]]></title>
            <link>https://anar-abiyev.medium.com/missing-value-imputation-in-python-4e64d64ac43c?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/4e64d64ac43c</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Sun, 04 Feb 2024 17:49:35 GMT</pubDate>
            <atom:updated>2024-02-04T17:49:35.868Z</atom:updated>
            <content:encoded><![CDATA[<h4>This blog will teach you how to deal with missing values in Python</h4><p>My previous blog was about theoretical information on the topic:</p><p><a href="https://anar-abiyev.medium.com/complete-guide-to-missing-value-imputation-1e75cb2a6dca">Complete Guide to Missing Value Imputation</a></p><p>In this one, you will learn the Python implementation of those tips.</p><p>Without further ado, let’s get started!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dRNxctzwDdQnQdjbbIENdg.png" /></figure><h3>Imports</h3><p>First, we will import the necessary libraries and load the dataset into the pandas data frame.</p><pre>import pandas as pd<br>import numpy as np<br>import missingno as msno<br><br>df = pd.read_csv(&#39;sample_dataset.csv&#39;)<br>df.head()</pre><h3>Overview of Missing Values</h3><p>After that, we can use missingno matrix function to take a look at the distribution of missing values.</p><pre>msno.matrix(df)</pre><figure><img alt="Missingno Matrix." src="https://cdn-images-1.medium.com/max/1024/1*F47nV6c0Ca4k_Bo_ESGr-g.png" /><figcaption>Fig 1. <strong>Missingno Matrix.</strong></figcaption></figure><p>To check the number of missing values for each and every column, we can use .isnull().sum() functions:</p><pre>df.isnull().sum()</pre><figure><img alt="Pandas .isnull().sum() Output." src="https://cdn-images-1.medium.com/max/218/1*WxCZ-wtv-yU7WV8EeuNYOA.png" /><figcaption>Fig 2. <strong>Pandas .isnull().sum() Output.</strong></figcaption></figure><p>For better insights, I have written this function to get the percentage values of missing values for every column.</p><pre>def column_missing_value_percentiles(df):<br>  values = df.isnull().sum().values/df.shape[1]<br>  columns = df.columns<br>  for idx in range(len(columns)):<br>    print(f&quot;{columns[idx]}: {values[idx].round()}%&quot;)<br><br>column_missing_value_percentiles(df)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/205/1*7tiuUOYpFs9dlfK0bkCfDw.png" /><figcaption>Fig 3. <strong>Output for Percentiles.</strong></figcaption></figure><h3>Solutions</h3><p>Now you will learn which solution method is suitable for the missing values.</p><h3>Dropping Rows</h3><p>If you check out percentiles again, you can see that some columns have a quite small amount of missing values — less than 5%.</p><p>For such columns, we can drop the rows which contain missing values from these columns.</p><pre>def drop_rows(df, columns):<br>  df.dropna(subset=columns, inplace=True)<br><br>drop_rows(df, [&#39;enrolled_university&#39;, &#39;education_level&#39;, &#39;last_new_job&#39;, &#39;experience&#39;])</pre><h3>Dropping Columns</h3><p>For the column called “company_type”, the percentile is 67%. In this case, we should drop the column, because the filling will create bias and not be helpful for analysis.</p><pre>def drop_columns(df, columns):<br>  df.drop(columns, axis = 1, inplace = True)<br><br>drop_columns(df, [&#39;company_type&#39;])</pre><h3>Mean, Median, Mode methods</h3><p>These methods are similar to each other. I have shown functions for each one and used one of them as an example.</p><pre>def fill_mode(df, column):<br>  mode = df[column].mode()[0]<br>  df[column] = df[column].fillna(mode)<br><br>def fill_mean(df, column):<br>  mean = df[column].mean()<br>  df[column] = df[column].fillna(mean)<br><br>def fill_median(df, column):<br>  median = df[column].median()<br>  df[column] = df[column].fillna(median)<br><br>fill_mode(df, &#39;gender&#39;)</pre><h3>Divide and Conquer</h3><p>In this method, I am using another column to get better insights for the target column (which is going to be filled).</p><p>I have called two columns here:</p><ul><li>column to conquer — the column that is going to be filled.</li><li>column to divide — the column that is used.</li></ul><p>If we apply the previous mode method, then the mode of the whole column will be used to fill all the NAs.</p><p>To apply a more advanced method, the column is divided into different groups and individual mode values will be found.</p><p>In this example, I assume that the person’s experience might be related to the company size; if you have more experience, you are likely to work in a bigger company.</p><p>In the first loop, individual mode values is found and stored in the list.</p><p>In the second loop, if the value of the divider matches, then the corresponding mode values are used to fill NAs.</p><pre>df[&#39;company_size&#39;].unique()<br>df[&#39;experience&#39;].unique()<br><br>def divide_and_conquer(df, column_to_conquer, column_to_divide):<br>  modes = []<br>  for i in df[column_to_divide].unique():<br>    mode = df[df[column_to_divide] == i][column_to_conquer].mode()[0]<br>    modes.append(mode)<br><br>  for i in range(df[column_to_divide].nunique()):<br>    mask = df[column_to_divide] == df[column_to_divide].unique()[i]<br>    mode_value = modes[i]<br>    df.loc[mask, column_to_conquer] = df.loc[mask, column_to_conquer].fillna(mode_value)<br><br>column_to_conquer = &#39;company_size&#39;<br>column_to_divide = &#39;experience&#39;<br><br>divide_and_conquer(df, column_to_conquer, column_to_divide)</pre><h3>Random Imputation</h3><p>Here, we use random values of the column in order to fill the missing values.</p><pre>def random_imputation(df, column):<br>  options = df[column].dropna().unique()<br>  df[column] = df[column].apply(lambda x: np.random.choice(options) if pd.isna(x) else x)<br><br>random_imputation(df, &#39;major_discipline&#39;)</pre><h3>Model-based Methods</h3><p>In this method, a model is trained to fill in missing values.</p><p>The target column is the one to be filled.</p><p>The train set is the rows without missing values.</p><p>The test set is the rows with missing values.</p><pre>column_to_fill = &#39;gender&#39;<br><br>df_train = df.dropna()<br>df_test = df[df[column_to_fill].isna()]<br><br>X = df.drop(column_to_fill, axis = 1)<br>y = df[column_to_fill]<br><br>X_train = df_train.drop(column_to_fill, axis = 1)<br>y_train = df_train[column_to_fill]<br><br>X_test = df_test.drop(column_to_fill, axis = 1)<br><br>X_train = pd.get_dummies(X_train, columns=X_train.select_dtypes(include = &#39;object&#39;).columns, drop_first=True)<br>X_test = pd.get_dummies(X_test, columns=X_test.select_dtypes(include = &#39;object&#39;).columns, drop_first=True</pre><p>After we get the sets, the model can be defined and trained.</p><p>The predictions are the values that are used to fill the NAs.</p><pre>from sklearn.neighbors import KNeighborsClassifier<br><br>knn_classifier = KNeighborsClassifier(n_neighbors=3)<br><br>knn_classifier.fit(X_train, y_train)<br><br>predictions = knn_classifier.predict(X_test)</pre><p>Check the theoretical explanation of each solution shown here:</p><p><a href="https://anar-abiyev.medium.com/complete-guide-to-missing-value-imputation-1e75cb2a6dca">Complete Guide to Missing Value Imputation</a></p><h3>Clap and Follow for support!</h3><h3>Thank you for reading!</h3><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4e64d64ac43c" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Complete Guide to Missing Value Imputation]]></title>
            <link>https://anar-abiyev.medium.com/complete-guide-to-missing-value-imputation-1e75cb2a6dca?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/1e75cb2a6dca</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[missing-values]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Sun, 04 Feb 2024 17:48:57 GMT</pubDate>
            <atom:updated>2024-02-04T17:50:13.106Z</atom:updated>
            <content:encoded><![CDATA[<h4>You will learn all the essential knowledge to deal with missing values in the dataset!</h4><p>In this blog, I will go through different scenarios of missing data problems and their solutions.</p><h4>You will know how to approach each case as a data scientist.</h4><p>After you read this guide, you can also check my blog about Python implementation of methods explained in this blog!</p><p><a href="https://anar-abiyev.medium.com/missing-value-imputation-in-python-4e64d64ac43c">Missing Value Imputation in Python</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_E1T3h9dIWZZBQicQgtGAw.png" /></figure><p>Outline</p><ul><li>What is missing data?</li><li>What are the reasons for missing data?</li><li>Solutions.</li></ul><h3>What is missing data?</h3><p>In data science, missing data refers to <strong>the absence of values or information</strong> in a dataset. Dealing with missing data is a crucial aspect of the data cleaning and preprocessing stage, as it can impact the quality and accuracy of analyses and machine learning models. It reduces the effective sample size, potentially reducing the power of statistical tests and the generalizability of models.</p><p>The are two general ways to deal with missing data:</p><ul><li><strong>Deletion.</strong> Removing rows or columns with missing values. This can lead to the loss of valuable information and may introduce bias.</li><li><strong>Imputation.</strong> Filling missing values with estimated or predicted values.</li></ul><p>In the next paragraphs, I will be explaining what are the <strong>best methods</strong> to use for missing data imputation according to the different situations.</p><h3>What are the reasons for missing data?</h3><p>Prior to starting missing data imputation, it is a good practice to analyze the reasons behind the missing data problem.</p><p>The way I prefer to do it is to use <strong><em>missingno</em></strong> library in Python.</p><pre>import pandas as pd<br>import missingno as msno<br><br>df = pd.read_csv(&#39;sample_dataset.csv&#39;)<br>msno.matrix(df)</pre><p>The code produces a matrix like below. Here, the white lines are missing values. With such a tool, you can easily get a view of your dataset in terms of missing values.</p><figure><img alt="Missing Value Matrix with missingno Library." src="https://cdn-images-1.medium.com/max/1024/1*dVfMpt6K8z8VyL2AnYKT6g.png" /><figcaption>Fig 1. <strong>Missing Value Matrix with missingno Library.</strong></figcaption></figure><h4>Before proceeding to the code section, let’s go through the reasons that might cause missing data.</h4><p>There can be various reasons for missing data in a dataset. Understanding these reasons is crucial for handling missing data appropriately and making informed decisions in data analysis or modeling.</p><p>Here are some common reasons for missing data:</p><ul><li><strong>Non-response.</strong> Individuals or entities may choose not to respond to certain survey questions or provide specific information, leading to missing values.</li><li><strong>Instrumentation Issues.</strong> Problems with measurement instruments or data collection tools can lead to missing values.</li><li><strong>Technological Limitations.</strong> Technical constraints or limitations in data capture methods can result in missing data.</li><li><strong>Unavailability of Historical Data.</strong> In longitudinal studies or time-series data, historical records may be missing due to various reasons such as system upgrades, changes in data collection methods, or data storage issues.</li></ul><h3>Solutions</h3><p>In this section, I will go through various imputation strategies and explain which one you have to use for certain scenarios.</p><h4>Dropping rows</h4><p>This method involves removing entire rows from the dataset that contain missing values. It is simple and easy to implement, but it can only be helpful when the number of missing values is a minority (up to 10%). Otherwise, this can lead to data loss.</p><p>For example, you have 10,000 rows of dataset and 50 rows have missing values. You can drop those rows and continue with the remaining dataset as it is a very small proportion.</p><h4>Dropping columns</h4><p>In the previous method, I talked about rows. However, if the missing values are related to the same column, then you can drop that column.</p><p>If more than 60–70% of a column is missing, then you can make a case for dropping the entire column. Otherwise, if you try to fill the missing rows, most of the values in the column will be synthetic data and this can create bias.</p><h4>Mean, Median, and Mode methods.</h4><p>These are mostly used and straightforward methods in missing data imputation.</p><ul><li><strong>Mean </strong>is the average of a numerical column. If there are some missing values in a numerical column, then you can use the mean of the column to fill.</li><li><strong>Median </strong>is the middle value in a numerical column.</li></ul><p><strong>Bonus Tip:</strong> If the data has many outliers, use Median imputation, otherwise use mean imputation.</p><ul><li><strong>Mode </strong>is the most frequent value of a categorical column. This method can be used to fill categorical columns. Here, you have to pay attention to the balance between different classes.</li></ul><h4>Divide and Conquer</h4><p>The dataset is divided into subsets based on observed variables, and imputation is performed separately on each subset. This method addresses missing data based on related subsets, potentially capturing more nuanced patterns, but it requires careful consideration of how to divide the data. Complexity increases with multiple variables.</p><p>Let’s see the example below.</p><p>You have “age” and “marriage status” columns and the latter has some missing values. Instead of filling all the missing values with the mode of the column, you can divide rows based on age, because we can assume that more people get married when they get older.</p><p>So, you divide data into classes according to the age column: young, mature, old. After that, you fill in the missing values with the mode of each group separately.</p><h4>Random imputation / hot deck.</h4><p>Random imputation, also known as hot deck imputation, is a method for handling missing data by replacing missing values with randomly selected observed values from the same variable.</p><p>The term “hot deck” refers to a metaphorical deck of cards, where each card (or observation) is available to be selected to fill in the missing value.</p><ul><li>Identify the variables with missing values in the dataset.</li><li>Create a pool or deck of observed values from the variable containing missing values.</li><li>Randomly select values from the pool and use them to replace the missing values.</li></ul><p>Random imputation helps preserve the variability in the dataset by introducing randomness into the imputed values. It is a relatively simple method to implement, requiring minimal computational resources.</p><p><strong>But,</strong></p><p>Since values are selected randomly, there’s a possibility of imputing values that do not accurately represent the overall distribution or patterns in the data.</p><p><strong>So,</strong></p><p>Careful consideration should be given to how the imputation pool is created to ensure that it is representative of the variable’s distribution.</p><p>Random imputation is often more suitable for continuous variables rather than categorical ones.</p><p>To account for uncertainty introduced by randomness, multiple imputations can be performed, creating several datasets with different imputed values for each missing entry.</p><h4>Model-based methods.</h4><p>Model-based imputation is an advanced technique for handling missing data by using predictive models to estimate and impute missing values. Instead of relying solely on summary statistics like mean or median, this method leverages relationships within the dataset to make informed predictions. The choice of the model depends on the characteristics of the data and the relationships between variables.</p><p>Let’s see the example below.</p><p>There is a dataset in which one column has some missing and present rows.</p><figure><img alt="Dataset with missing and full values." src="https://cdn-images-1.medium.com/max/364/1*J7QFenI9OdeESCtFdRiW3A.png" /><figcaption>Fig 2. <strong>Dataset with missing and full values.</strong></figcaption></figure><p>In order to use a model-based approach to impute missing values, the strategy in the following image will be applied. The present or full rows will be used as a train set while missing rows will be the test set. The results of the model will be used to fill in the missing values.</p><p>After the imputation process is done, the dataset will be divided into dependent and independent columns according to the task. But for the imputation itself, the dependent column has to be the one that has missing values.</p><figure><img alt="Train and Test set for Model Imputation." src="https://cdn-images-1.medium.com/max/488/1*idpCut8W4yY3q4PK_pjn6Q.png" /><figcaption>Fig 3. <strong>Train and Test set for Model Imputation.</strong></figcaption></figure><p>Model-based imputation takes into account relationships between variables, allowing for more accurate imputations compared to simple statistical measures. This method can capture non-linear relationships, making it suitable for datasets with complex patterns. Model-based imputation can be computationally intensive, especially when using complex models or dealing with large datasets.</p><h4>Converting NA into a feature</h4><p>This is a method for user form data. When you have a question that can be answered or left blank by the user, then you will have missing data for blanks.</p><p>This column can be converted to a binary column with values of true and false; true when the user answers, false when the user does not answer the question.</p><p>Here, the assumption we make is that the user didn’t answer the question because of a reason. Thus this is a feature itself.</p><p>Check the Python implementation of each solution explained here:</p><p><a href="https://anar-abiyev.medium.com/missing-value-imputation-in-python-4e64d64ac43c">Missing Value Imputation in Python</a></p><h3>Clap and Follow for support!</h3><h3>Thank you for reading!</h3><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1e75cb2a6dca" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What is Dropout Regularization method?]]></title>
            <link>https://ai.plainenglish.io/what-is-dropout-regularization-method-1eae267411ef?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/1eae267411ef</guid>
            <category><![CDATA[dropout-regularization]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[neural-networks]]></category>
            <category><![CDATA[dropout]]></category>
            <category><![CDATA[deep-learning]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Fri, 05 Jan 2024 06:17:26 GMT</pubDate>
            <atom:updated>2024-01-05T06:17:26.953Z</atom:updated>
            <content:encoded><![CDATA[<h4>Does dropout really work? See the results of the experiment with the CNN model and CIFAR10 dataset!</h4><p>In this article, you will learn about a regularisation method called <strong>Dropout.</strong></p><p>The blog will be in two parts. In the first section, I will explain the <strong>idea behind the technique.</strong></p><p>In the second part, you will see the <strong>results of the experiment I have carried out.</strong></p><p>I have run the model 10 times and noted accuracies for each of the four hyperparameters:</p><ul><li>Without Dropout.</li><li>Dropout p = 0.1.</li><li>Dropout p = 0.3.</li><li>Dropout p = 0.5.</li></ul><h3>You will love to see the results!</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dJ-RV32O77idLJHsSDvgQg.png" /></figure><h3>Part 1.</h3><p>Dropout is a regularization technique commonly used in deep learning models <strong>to prevent overfitting.</strong> Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, to the extent that it performs poorly on test data.</p><p>Dropout helps address this issue by introducing <strong>randomness </strong>during training.</p><p>Dropout involves randomly “dropping out” (i.e., setting to zero) a certain percentage of neurons in a layer during each forward and backward training pass.</p><p>This means that, during training, some neurons do not contribute to the computation. The dropout rate is a hyperparameter determining the fraction of neurons to drop out.</p><p>The idea behind dropout is to prevent the co-adaptation of neurons. When dropout is applied, the network cannot rely too heavily on any particular set of neurons because they may be turned off at any moment. This forces the network to learn more robust and generalized features from the data.</p><p>During the testing or inference phase, dropout is usually turned off, and all neurons are active. This ensures that the model utilizes the full capacity it has learned during training.</p><h4>Before moving to the second part, let’s set our <strong>expectations</strong> from the experiment.</h4><p>The dropout method is expected <strong>to lower training accuracy</strong> and <strong>raise the test accuracy.</strong></p><p>Because some neurons will be set to zero (definition of dropout) during the training phase, the model will not learn the training set as well as without dropout.</p><p>As the regularization techniques aim to help the model to generalize better, the test set is expected to be learned better with the dropout.</p><h3>Part 2.</h3><h4>Dataset</h4><p>The experiment has been carried out using the CIFAR10 dataset. The dataset contains 50000 training and 10000 testing images of 28x28 pixels organized in 10 classes.</p><h4>Model</h4><p>The used model is a CNN architecture of two convolutional layers. The results of convolutional layers will be fed into a fully connected neural network with one hidden layer. The output layer will have 10 neurons — one neuron per class.</p><figure><img alt="Convolutional Layers Architecture for CIFAR10" src="https://cdn-images-1.medium.com/max/1024/1*lKki9DSFvucDc8YG-t_8Ng.png" /><figcaption>Fig 1. <strong>Convolutional Layers Architecture.</strong></figcaption></figure><figure><img alt="Fully connected Layers architecture." src="https://cdn-images-1.medium.com/max/528/1*AZ-jehnWa0wc2jKOiGwcng.png" /><figcaption>Fig 2. <strong>Fully connected Layers Architecture.</strong></figcaption></figure><h4>Experiment methodology</h4><p>I have run the model 10 times and noted accuracies for each of the four hyperparameters:</p><ul><li>Without Dropout.</li><li>Dropout p = 0.1.</li><li>Dropout p = 0.3.</li><li>Dropout p = 0.5.</li></ul><p>Each training has been carried out with 10 epochs. You will see both train and test accuracy for all cases, their averages, and different analyses.</p><h4>Results</h4><p>Firstly, let’s see the results in the table. I have plotted <strong>line graphs</strong> below.</p><p>There are four sections:</p><ul><li><strong>no dropout.</strong></li><li><strong>p=0.1.</strong></li><li><strong>p=0.3.</strong></li><li><strong>p = 0.5.</strong></li></ul><p>Each section has train and test columns where you can find corresponding accuracies for each model run.</p><figure><img alt="Model accuracies with and without dropout" src="https://cdn-images-1.medium.com/max/690/1*IJCLPU9zompwHhE7Wjyy1w.png" /><figcaption>Table 1. <strong>Experiment Results.</strong></figcaption></figure><p>I have plotted line graphs for train and test sets separately for easier comparison.</p><p>Let’s continue with the train set. We mentioned that dropout will cause the train accuracy to drop.</p><p>The experiment results confirm our claim. The highest accuracy has been achieved by <strong>“No Dropout”</strong>, while the lowest accuracy is by <strong>“Dropout p = 0.5”.</strong></p><figure><img alt="Train Set Accuracies with and without Dropout." src="https://cdn-images-1.medium.com/max/876/1*dEN8fckGAFkNPEWSUgybIQ.png" /><figcaption>Fig 3. <strong>Train Set Accuracies with and without Dropout.</strong></figcaption></figure><p>If we look at the test set, we can see the reverse behavior. As there is more dropout, the test set accuracy increases.</p><figure><img alt="Test Set Accuracies with and without Dropout." src="https://cdn-images-1.medium.com/max/883/1*MjTrXGSItn8_7l-t3bgHzQ.png" /><figcaption>Fig 4. <strong>Test Set Accuracies with and without Dropout.</strong></figcaption></figure><h3>Summary</h3><p>Overall, in this blog, you learned what is dropout regularization technique and observed the experiment results.</p><p>The experiment results confirmed our expectations of the dropout method. Train accuracy dropped, while test accuracy increased.</p><p>I have attached the Python code to the link below, you can run the code yourself and check the results.</p><p><a href="https://github.com/anarabiyev/Medium-Youtube/blob/master/4.%20Dropout.ipynb">Medium-Youtube/4. Dropout.ipynb at master · anarabiyev/Medium-Youtube</a></p><h3>Thank you for reading. Clap and follow if you learned anything new.</h3><h3><a href="https://plainenglish.io/">PlainEnglish.io</a> 🚀</h3><p><em>Thank you for being a part of the In Plain English community! Before you go:</em></p><ul><li><em>Be sure to </em><strong><em>clap</em></strong><em> and </em><strong><em>follow</em></strong><em> the writer</em><strong>️</strong></li><li><em>Learn how you can also </em><a href="https://plainenglish.io/blog/how-to-write-for-in-plain-english"><strong><em>write for In Plain English</em></strong></a><em>️</em></li><li><em>Follow us: </em><a href="https://twitter.com/inPlainEngHQ"><strong><em>X</em></strong></a><strong><em> | </em></strong><a href="https://www.linkedin.com/company/inplainenglish/"><strong><em>LinkedIn</em></strong></a><strong><em> | </em></strong><a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw"><strong><em>YouTube</em></strong></a><strong><em> | </em></strong><a href="https://discord.gg/in-plain-english-709094664682340443"><strong><em>Discord</em></strong></a><strong><em> | </em></strong><a href="https://newsletter.plainenglish.io/"><strong><em>Newsletter</em></strong></a></li><li><em>Visit our other platforms: </em><a href="https://stackademic.com/"><strong><em>Stackademic</em></strong></a><strong><em> | </em></strong><a href="https://cofeed.app/"><strong><em>CoFeed</em></strong></a><strong><em> | </em></strong><a href="https://venturemagazine.net/"><strong><em>Venture</em></strong></a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1eae267411ef" width="1" height="1" alt=""><hr><p><a href="https://ai.plainenglish.io/what-is-dropout-regularization-method-1eae267411ef">What is Dropout Regularization method?</a> was originally published in <a href="https://ai.plainenglish.io">Artificial Intelligence in Plain English</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Convolutional Neural Network Terminology for Beginners]]></title>
            <link>https://ai.plainenglish.io/convolutional-neural-network-terminology-for-beginners-defbf51e8974?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/defbf51e8974</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[neural-networks]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[convolutional-neural-net]]></category>
            <category><![CDATA[convolutional-network]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Wed, 27 Dec 2023 06:16:54 GMT</pubDate>
            <atom:updated>2023-12-27T06:16:54.010Z</atom:updated>
            <content:encoded><![CDATA[<h4>Learn what kernel, stride, pooling, and many other terms mean for CNN. Easy explanation with images!</h4><p>The blog explains what each of the terms below means:</p><ul><li><strong>Kernel or Filter.</strong></li><li><strong>Channel.</strong></li><li><strong>Stride.</strong></li><li><strong>Pooling.</strong></li><li><strong>Padding.</strong></li><li><strong>Dropout.</strong></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3qnsphIR2nA_gldhRNcl-A.png" /></figure><h3>Kernel or Filter</h3><p>Kernels of filters are the core part of the convolution process. Each kernel can also be called an <strong>“information detector”.</strong> For example, the kernel on the left is used to detect vertical lines, whilst the one on the right is used for horizontal lines.</p><figure><img alt="Vertical and Horizontal Kernels." src="https://cdn-images-1.medium.com/max/1024/1*C3yO_nFFR-vvfLzOOtl32w.jpeg" /><figcaption>Fig 1. <a href="https://wttech.blog/blog/2022/edge-detection-and-processing-using-canny-edge-detector-and-hough-transform/"><strong>Vertical and Horizontal Kernels.</strong></a></figcaption></figure><p>If a kernel is convolved over the image, then the resultant layer will contain information about the original image.</p><p>I have tasted an image with these kernels, let’s see the results:</p><figure><img alt="Convolution of Vertical and Horizontal Kernels in Python." src="https://cdn-images-1.medium.com/max/1024/1*CtcAgIk-GDvTjFdN8L_Z0g.png" /><figcaption>Fig 2. <strong>Convolution of Vertical and Horizontal Kernels in Python.</strong></figcaption></figure><p>To sum up, a kernel or filter is a matrix to extract information from an image with the help of a convolution operation.</p><h3>Channel</h3><p>The channel is a layer on the image. Firstly, the image is one layer or three layers. <strong>One layer for grayscale, and three layers for RGB images.</strong></p><p>After a convolution layer, the image is separated into multiple layers with the help of kernels. <strong>Each new layer contains kernel results.</strong> For example, if the vertical line kernel is used and the image has a lot of vertical lines, then its layer will have larger positive values.</p><p>In the example below, the input image has one layer, after the convolution operation, 6 new layers have been derived.</p><figure><img alt="Convolution Layers." src="https://cdn-images-1.medium.com/max/585/1*sUxC0Xt_XRnHDUQx0m2K_w.png" /><figcaption>Fig 3. <a href="https://www.upgrad.com/blog/basic-cnn-architecture/"><strong>Convolution Layers.</strong></a></figcaption></figure><h3>Stride</h3><p>The stride determines <strong>the movement or step size of the kernel.</strong> If stride is 1, the kernel moves like below:</p><ul><li>one pixel right until the end of the image,</li><li>one pixel down,</li><li>one pixel right until the end of the image,</li><li>one pixel down,</li><li>and so on …</li></ul><p>The first image shows stride = 1, while the second image illustrates stride = 2.</p><figure><img alt="CNN Stride 1." src="https://cdn-images-1.medium.com/max/244/1*cFMF_uWgUFdVRMZAZ0Bfzg.gif" /><figcaption>Fig 4. <a href="https://hannibunny.github.io/mlbook/neuralnetworks/convolutionDemos.html"><strong>CNN Stride 1.</strong></a></figcaption></figure><figure><img alt="CNN Stride 2." src="https://cdn-images-1.medium.com/max/294/1*BMngs93_rm2_BpJFH2mS0Q.gif" /><figcaption>Fig 5. <a href="https://hannibunny.github.io/mlbook/neuralnetworks/convolutionDemos.html"><strong>CNN Stride 2.</strong></a></figcaption></figure><h3>Pooling</h3><p>Pooling is a method to <strong>reduce the size of an image.</strong> The most frequently used pooling methods are <strong>average</strong> and <strong>max pooling.</strong></p><p>As the animation shows, for the max pooling of size 2x2 and stride 2, only the maximum value of the four pixels is used in the resultant image. In other words, the maximum pixel represents the four pixels in the new image.</p><figure><img alt="CNN Maxpool." src="https://cdn-images-1.medium.com/max/728/1*WvHC5bKyrHa7Wm3ca-pXtg.gif" /><figcaption>Fig 6. <a href="https://nico-curti.github.io/NumPyNet/NumPyNet/layers/maxpool_layer.html"><strong>CNN Maxpool.</strong></a></figcaption></figure><p>In this way, the image dimension is reduced by two times.</p><p>The goal of pooling is to reduce the number of pixels to have a lighter neural network with fewer parameters and prevent overfitting.</p><h3>Padding</h3><p>Padding means adding <strong>an extra layer of zeros</strong> around the image.</p><p>The primary purpose of padding is to preserve spatial information at the edges of the input, preventing a reduction in the spatial dimensions of the feature maps. This is crucial in maintaining accurate boundary information and preventing the loss of important details during the convolutional process.</p><p>The animation below illustrates one layer of padding.</p><figure><img alt="CNN Padding" src="https://cdn-images-1.medium.com/max/395/1*1okwhewf5KCtIPaFib4XaA.gif" /><figcaption>Fig 7. <a href="https://hannibunny.github.io/mlbook/neuralnetworks/convolutionDemos.html"><strong>Padding.</strong></a></figcaption></figure><p>Padding also ensures that the convolutional operation is applied uniformly across the entire input, helping to mitigate issues like the shrinking receptive field and vanishing gradients.</p><p>CNN padding plays a vital role in enhancing the performance and effectiveness of convolutional neural networks by addressing edge-related challenges and preserving spatial information during feature extraction.</p><h3>Dropout</h3><p>Dropout is a step during the training phase. In dropout, some (usually 10%) random weights of the kernels are <strong>replaced by zero.</strong></p><p>By doing so, the model will be more general and will not overfit.</p><p>Thank you for reading! If I added value to your learning, please don’t forget <strong>to clap and follow!</strong></p><h3>P.S.</h3><p>The images without reference belong to the author, while the images of other people have been indicated by showing corresponding reference links on image names.</p><h3><a href="https://plainenglish.io/">PlainEnglish.io</a> 🚀</h3><p><em>Thank you for being a part of the In Plain English community! Before you go:</em></p><ul><li><em>Be sure to </em><strong><em>clap</em></strong><em> and </em><strong><em>follow</em></strong><em> the writer</em><strong>️</strong></li><li><em>Learn how you can also </em><a href="https://plainenglish.io/blog/how-to-write-for-in-plain-english"><strong><em>write for In Plain English</em></strong></a><em>️</em></li><li><em>Follow us: </em><a href="https://twitter.com/inPlainEngHQ"><strong><em>X</em></strong></a><strong><em> | </em></strong><a href="https://www.linkedin.com/company/inplainenglish/"><strong><em>LinkedIn</em></strong></a><strong><em> | </em></strong><a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw"><strong><em>YouTube</em></strong></a><strong><em> | </em></strong><a href="https://discord.gg/in-plain-english-709094664682340443"><strong><em>Discord</em></strong></a><strong><em> | </em></strong><a href="https://newsletter.plainenglish.io/"><strong><em>Newsletter</em></strong></a></li><li><em>Visit our other platforms: </em><a href="https://stackademic.com/"><strong><em>Stackademic</em></strong></a><strong><em> | </em></strong><a href="https://cofeed.app/"><strong><em>CoFeed</em></strong></a><strong><em> | </em></strong><a href="https://venturemagazine.net/"><strong><em>Venture</em></strong></a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=defbf51e8974" width="1" height="1" alt=""><hr><p><a href="https://ai.plainenglish.io/convolutional-neural-network-terminology-for-beginners-defbf51e8974">Convolutional Neural Network Terminology for Beginners</a> was originally published in <a href="https://ai.plainenglish.io">Artificial Intelligence in Plain English</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Neural Network Terminology for Beginners]]></title>
            <link>https://ai.plainenglish.io/neural-network-terminology-for-beginners-781b653c4d1b?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/781b653c4d1b</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[neural-networks]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[terminology]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Sun, 24 Dec 2023 13:37:23 GMT</pubDate>
            <atom:updated>2023-12-24T13:37:23.350Z</atom:updated>
            <content:encoded><![CDATA[<h4>Learn what neurons, layers, weights, biases, activation functions, epochs, forward &amp; backward propagation, and other terms mean in deep learning!</h4><p>The blog explains what each of the terms below means:</p><ul><li><strong>Neuron</strong></li><li><strong>Layer</strong></li><li><strong>Weight &amp; Bias.</strong></li><li><strong>Activation Function</strong></li><li><strong>Forward &amp; Backward Propagation</strong></li><li><strong>Epoch</strong></li><li><strong>Batch &amp; Batch size.</strong></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xFTbKdkYlA8Cg1Atf7F92w.png" /></figure><h3>Neuron</h3><p>The image below illustrates a simple neural network. Every yellow circle you see in the image is<strong> a neuron.</strong> In other words, every node in the architecture is a neuron.</p><figure><img alt="The simple architecture of a Neural Network." src="https://cdn-images-1.medium.com/max/478/1*2rzqP3vUtbWGIEgGdtdnyA.png" /><figcaption>Fig 1. <strong>The simple architecture of a Neural Network.</strong></figcaption></figure><p>Every node possesses a value. The combination of nodes creates a layer.</p><h3>Layer</h3><p>Layers are a group of nodes. There are three types of layers:</p><ul><li>Input layer.</li><li>Hidden layer.</li><li>Output layer.</li></ul><p>The first layer is the <strong>input layer</strong>, while the last one is the <strong>output layer</strong>. The other layers between them are called <strong>hidden layers.</strong></p><p>I have a blog easily explaining the purpose of each layer, check it from the link below before continuing:</p><p><a href="https://abiyevanar.medium.com/neural-network-layers-explained-for-beginners-bd8603c3dd5f">Neural Network Layers Explained for Beginners</a></p><h3>Weight &amp; Bias</h3><p>In a neural network, calculation means multiplying the neuron value by <strong>the weight</strong> and sum up with <strong>the bias.</strong></p><p>Weights are illustrated on the lines that connect neurons. In the example below, <strong>0.54</strong> and <strong>0.48</strong> are the values of neurons in the input layer. There is also <strong>a bias</strong> with the value of <strong>0.06.</strong> <strong>0.2</strong> and <strong>0.1</strong> are <strong>the weights</strong> of the lines connecting the neurons of the input layer with the first neuron of the hidden layer.</p><figure><img alt="Weights and Biases." src="https://cdn-images-1.medium.com/max/516/1*V27fQSAc2u9ep2RHOGAg2A.png" /><figcaption>Fig 2. <strong>Weights and Biases.</strong></figcaption></figure><p>The calculation is like below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/412/1*9wFMvtuTlh6Lbsz9gswbYg.png" /></figure><h3>Activation function</h3><p>In reality, while calculating the value of a neuron, there is one more extra step which is <strong>the activation function.</strong></p><p>Continuing with our example, the value of 0.216 is not directly assigned to the neuron. Before that, an activation function takes 0.216 as the input, and its output is assigned to the neuron. For example, if the activation function is the sigmoid function:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/415/1*vBZzvnCpGImk7GH9k31Fqg.png" /></figure><p>I have a blog that is a complete guide for activation functions, check it from the link below:</p><p><a href="https://ai.plainenglish.io/complete-guide-to-activation-functions-in-deep-learning-fb65aca121f9">Complete Guide to Activation Functions in Deep Learning</a></p><h3>Forward &amp; Backward Propagation</h3><p>In the neural network, there are two directions of calculations:</p><h4>Forward</h4><p>In the forward direction, input data is fed into the neural network. This data travels through the network layer by layer, where each layer consists of neurons connected by weighted edges.</p><p>At each node, the weighted sum of inputs is computed, and an activation function is applied to introduce non-linearity to the model. This transformed output becomes the input for the next layer.</p><p>The final layer produces the network’s output, which is compared to the desired or target output. This comparison helps evaluate the performance of the network and determine the error.</p><h4>Backward</h4><p>In the backward direction, the calculated error (the difference between the predicted and target outputs) is propagated backward through the network.</p><p>The key objective is to minimize the error. This is achieved by adjusting the weights and biases of the connections between neurons. The adjustments are proportional to the gradient of the error with respect to the weights and biases.</p><p>Backpropagation employs optimization algorithms like gradient descent to iteratively update the weights and biases, moving the network towards a configuration that reduces the overall error.</p><h3>Epoch</h3><p>The forward and backward processes are repeated through multiple iterations the neural network converges to a state where the error is minimized, and the model performs well on the training data.</p><p>The number of iterations is called epochs. The epoch value is set according to the resources available prior to the training process, but the progress is observed closely. If the accuracy of the model does not get better and there are some epochs left, then the training process is stopped.</p><h3>Batch &amp; Batch size</h3><p>The dataset is divided into several parts before feeding the neural network. The batch size determines how many data points each part of the dataset will have. If the batch size is 32, then each section of the dataset will have 32 data points.</p><h3><a href="https://plainenglish.io/">PlainEnglish.io</a> 🚀</h3><p><em>Thank you for being a part of the In Plain English community! Before you go:</em></p><ul><li><em>Be sure to </em><strong><em>clap</em></strong><em> and </em><strong><em>follow</em></strong><em> the writer</em><strong>️</strong></li><li><em>Learn how you can also </em><a href="https://plainenglish.io/blog/how-to-write-for-in-plain-english"><strong><em>write for In Plain English</em></strong></a><em>️</em></li><li><em>Follow us: </em><a href="https://twitter.com/inPlainEngHQ"><strong><em>X</em></strong></a><strong><em> | </em></strong><a href="https://www.linkedin.com/company/inplainenglish/"><strong><em>LinkedIn</em></strong></a><strong><em> | </em></strong><a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw"><strong><em>YouTube</em></strong></a><strong><em> | </em></strong><a href="https://discord.gg/in-plain-english-709094664682340443"><strong><em>Discord</em></strong></a><strong><em> | </em></strong><a href="https://newsletter.plainenglish.io/"><strong><em>Newsletter</em></strong></a></li><li><em>Visit our other platforms: </em><a href="https://stackademic.com/"><strong><em>Stackademic</em></strong></a><strong><em> | </em></strong><a href="https://cofeed.app/"><strong><em>CoFeed</em></strong></a><strong><em> | </em></strong><a href="https://venturemagazine.net/"><strong><em>Venture</em></strong></a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=781b653c4d1b" width="1" height="1" alt=""><hr><p><a href="https://ai.plainenglish.io/neural-network-terminology-for-beginners-781b653c4d1b">Neural Network Terminology for Beginners</a> was originally published in <a href="https://ai.plainenglish.io">Artificial Intelligence in Plain English</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Neural Network Layers Explained for Beginners]]></title>
            <link>https://anar-abiyev.medium.com/neural-network-layers-explained-for-beginners-bd8603c3dd5f?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/bd8603c3dd5f</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[beginner]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[neural-networks]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Sat, 23 Dec 2023 15:52:56 GMT</pubDate>
            <atom:updated>2023-12-23T15:52:56.348Z</atom:updated>
            <content:encoded><![CDATA[<h4>How to know the number of layers and neurons in a Neural Network.</h4><p>In a Neural Network, there are<strong> three types of layers:</strong></p><ul><li><strong>Input</strong></li><li><strong>Hidden</strong></li><li><strong>Output</strong></li></ul><p>I will explain what they are and how many neurons each should have.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jOF0uDQ845ckvSgTfDZlsw.png" /></figure><h3>Input Layer</h3><p>The input layer of your neural network <strong>depends on the dataset</strong> you are going to use for the task. For example, if the dataset consists of 28x28 pixel images, then your input layer needs to have 784 (28x28) neurons. Each pixel value will correspond to a neuron in the input layer.</p><p>For the input layer, you need to <strong>analyze the dataset</strong> and see how many neurons you need to feed that data into the model.</p><h3>Hidden Layer</h3><p>The number of hidden layers in neural networks is some kind of hyperparameter.</p><p>There is no rule like you need two hidden layers for this or three hidden layers for that.</p><p>It is determined by trial and error method.</p><p>But,</p><p><strong>Some guidelines </strong>will help you to find the answer more efficiently.</p><ul><li>Start with simple architecture and increase complexity gradually.</li><li>If the dataset is more complex, more hidden layers will help.</li><li>Consider domain knowledge, if there is a solution to a similar problem, refer to that architecture.</li></ul><h3>Output Layer</h3><p>The output layer of neural networks <strong>depends on the task.</strong> If it is a regression problem one neuron is enough. On the other hand, the number of neurons is determined by the number of classes in the classification problem.</p><p>For example, when you predict which digit the picture is, then 10 neurons output layer will be used, one neuron for the probability of each digit.</p><p>Thank you for reading, don’t forget to check this tutorial to learn about <strong>Activation Functions.</strong></p><p><a href="https://ai.plainenglish.io/complete-guide-to-activation-functions-in-deep-learning-fb65aca121f9">Complete Guide to Activation Functions in Deep Learning</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bd8603c3dd5f" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Standardization and Normalization — Clearly Explained!]]></title>
            <link>https://anar-abiyev.medium.com/standardization-and-normalization-clearly-explained-984db5e778f6?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/984db5e778f6</guid>
            <category><![CDATA[standardization]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[normalization]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Thu, 21 Dec 2023 15:18:23 GMT</pubDate>
            <atom:updated>2023-12-21T15:18:23.320Z</atom:updated>
            <content:encoded><![CDATA[<h3>Standardization and Normalization, Feature Scaling — Clearly Explained!</h3><h4>This story will clear all your questions about standardization vs / and normalization and you will never search this topic again!</h4><p>You probably have many questions about <strong>standardization </strong>and <strong>normalization</strong>, you have searched many articles and watched some videos on YouTube.</p><blockquote>After reading this blog until the end, I assure you that you will<strong> never search for stan</strong>dardization or normalization again!</blockquote><p>In this blog, you will learn:</p><p>· <strong>What </strong>is <strong>feature scaling</strong> and <strong>why </strong>do we need it?</p><p>· <strong>Which models </strong>need scaling and which ones don’t?</p><p>· What is <strong>normalization</strong>?</p><p>· What is <strong>standardization</strong>?</p><p>· <strong>When to use</strong> normalization or standardization?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JMk9Rymp0sbUe6M9DLO1-Q.png" /></figure><p>Firstly, let’s state that both normalization and standardization are types of <strong>feature scaling.</strong> They have different formulas and use cases, but both are used to change the scale of data.</p><h3>What is even feature scaling?</h3><p>Let’s say you have a column in your dataset that looks like the histogram on the left. The range of the column is between around 20 and 65. If we want to change the scale of the column, all we must do is <strong>divide </strong>the values by some constant, for example, <em>2</em>. The histogram on the right shows the distribution after this scaling.</p><figure><img alt="Feature Scaling by dividing with a constant." src="https://cdn-images-1.medium.com/max/1024/1*llZUhMyTnPs8_WFiNkN9ew.png" /><figcaption>Fig 1.<strong> Feature Scaling by dividing with a constant.</strong></figcaption></figure><p>Another method of scaling the data is by <strong>subtracting </strong>a constant. For instance, if you want the data to start from zero, you can subtract 20 from the column values.</p><figure><img alt="Feature Scaling by subtracting a constant." src="https://cdn-images-1.medium.com/max/828/1*9l7pYccDtsc_ZQA5hfvsRQ.png" /><figcaption>Fig 2. <strong>Feature Scaling by subtracting a constant.</strong></figcaption></figure><p>Please note that both multiplication and addition can be used as well, but usually, it is tried to make values around and close to zero, thus, subtraction and division are applied.</p><h3>Why apply Feature Scaling?</h3><p>Check out the dataset below, all three columns have different scales. If you feed this dataset as it is, the model will give <strong>more importance</strong> to the column with the <strong>higher values</strong>, the “income” column in this example. However, we want the model to approach<strong> each column as equals</strong> and calculate corresponding weights according to the optimization, not because of scales.</p><figure><img alt="Example dataset with different scales." src="https://cdn-images-1.medium.com/max/350/1*uslbWF5gxXYrcThw-DW-Xg.png" /><figcaption>Fig 3. <strong>Example dataset with different scales.</strong></figcaption></figure><p>The <strong>models </strong>I will mention below are the ones that <strong>benefit from feature scaling the most:</strong></p><ul><li><strong>Gradient—based optimization algorithms.</strong> Models that use gradient descent for optimization, such as linear regression, logistic regression, and neural networks. Scaling will help to converge faster.</li><li><strong>Distance—based algorithms.</strong> Models that use distances between data points, such as k-Nearest Neighbors (KNN) and Support Vector Machines (SVM), can benefit from feature scaling because it ensures that all features contribute equally to the distance computation.</li><li><strong>PCA (Principal Component Analysis).</strong> PCA is a dimensionality reduction technique that involves finding the principal components of the data. Feature scaling is important for PCA because it ensures that all features have equal importance in determining principal components.</li></ul><p>The models that <strong>do not benefit from feature scaling</strong> are the ones that are not built upon numerical values themselves, rather than comparing these values:</p><ul><li><strong>Tree-based models.</strong> Decision trees, Random Forests, and Gradient Boosted Trees. These models make decisions based on feature thresholds and are invariant to monotonic transformations of the features.</li><li><strong>Naive Bayes.</strong> Naive Bayes classifiers are probabilistic models that assume independence between features given the class. They are generally not sensitive to the scale of individual features.</li></ul><h3>What is Normalization?</h3><p>Normalization is moving the scale of data into the range between 0 and 1. It is done with the following formula:</p><figure><img alt="Equation of Normalization." src="https://cdn-images-1.medium.com/max/408/1*kTpARcHihholw1T47-9xiw.png" /><figcaption>Formula 1. <strong>Equation of Normalization.</strong></figcaption></figure><p>Let’s apply normalization to the sample data we plotted above:</p><figure><img alt="Normalized data." src="https://cdn-images-1.medium.com/max/828/1*66SXbLxBgpCvNq5GjwDKeQ.png" /><figcaption>Fig 4. <strong>Normalized data.</strong></figcaption></figure><p>As you can see, the shape of the histogram remains the same, but the range has been changed to 0–1.</p><p>This is all the theoretical background needed for normalization, changing the scale of the dataset to the range between zero and one. Let’s move to standardization.</p><h3>What is Standardization?</h3><p>The <strong>purpose</strong> of standardization is the same as normalization — changing the scale of the data. However, it achieves this by a different method. Instead of altering the range into a fixed range, <strong>the mean </strong>and <strong>variance </strong>of the data are changed.</p><p>It may seem complicated, but I will explain all the terms one by one.</p><p>Let’s continue with the <strong>formula </strong>to have a clear view of what it means “to apply standardization”:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/274/1*vSIM03hdodlkwxWLQe4chg.png" /><figcaption>Formula 2. <strong>Equation of Standardization.</strong></figcaption></figure><p>Here,</p><p>- µ is the mean, which is the average of data.</p><p>- σ is the standard deviation.</p><p>Simply put, <strong>the mean of data is subtracted, and the result is divided by the standard deviation. </strong>After this operation,<strong> the mean of the resultant data will be equal to zero and the variance to one.</strong></p><p>The best way to observe this is with two-dimensional data. See how the values and separation of points have changed when standardization is applied. The values are around zero, so the mean is zero and the distances between values have been decreased, so the variance is one.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/974/1*A_urC1-PKTPZkofGkhQBXA.png" /><figcaption>Fig 5. <strong>Data points before standardization.</strong></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/974/1*zjsZ5c6WXJBNB9tj7FxbVg.png" /><figcaption>Fig 6. <strong>Data points after standardization.</strong></figcaption></figure><p>If you have the intuition behind what changing mean and variance look like, let’s see our example data after standardization:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/530/1*tJBchfpW3dnYnSr_voEYTA.png" /><figcaption>Fig 7. <strong>Standardized data.</strong></figcaption></figure><p><strong>An important point to underline here,</strong> there is a <strong>misconception </strong>that after standardization the distribution of data changes to <strong>normal distribution.</strong> However,<strong> this is a wrong conclusion</strong> about standardization. Yes, the mean and the variance are equal to 0 and 1 respectively in both normal distribution and the result of standardization, but it does not mean that the distribution of data becomes normal distribution. You can observe this in our example as well.</p><h3>When to use normalization or standardization?</h3><p>In general, the best approach is to try both methods and see which result is better.</p><p>If we dive more into the use cases:</p><p>- <strong>Normalization</strong> is preferred for neural networks, especially when working with images, the pixel values are scaled from 0–255 to 0–1 range.</p><p>- <strong>Standardization </strong>is preferred when there are outliers in the data because outliers can negatively affect normalization by shrinking other values.</p><p>You can check out the source code from the link below.</p><p><a href="https://github.com/anarabiyev/Medium-Youtube/blob/master/Standardization_Normalization.ipynb">Medium-Youtube/Standardization_Normalization.ipynb at master · anarabiyev/Medium-Youtube</a></p><h4>Thank you for reading, hope I added value to your journey in mastering data science / AI. If so, do not forget to clap and follow!</h4><p>Check out the latest story about <strong>Activation Functions </strong>as well:</p><p><a href="https://ai.plainenglish.io/complete-guide-to-activation-functions-in-deep-learning-fb65aca121f9">Complete Guide to Activation Functions in Deep Learning</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=984db5e778f6" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Use Optune? Step-by-step Beginner Guide for Hyperparameter Tuning!]]></title>
            <link>https://python.plainenglish.io/how-to-use-optune-step-by-step-beginner-guide-for-hyperparameter-tuning-b07650168f0a?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/b07650168f0a</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[optuna]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[hyperparameter-tuning]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Thu, 21 Dec 2023 10:16:52 GMT</pubDate>
            <atom:updated>2023-12-21T10:16:52.043Z</atom:updated>
            <content:encoded><![CDATA[<h4>Learn how to use Optuna for hyperparameter tuning. This is a complete step-by-step guide for beginners.</h4><p>When I was searching for tutorials about Optuna, <strong>I could not find an easy-to-understand, step-by-step guide.</strong> I decided to write this blog to help anyone who wants to learn Optuna from scratch.</p><p>According to their GitHub page, <strong>Optuna </strong>is an automatic <strong>hyperparameter optimization software framework</strong>, particularly designed for machine learning.</p><p>In this blog post, you will learn how to use Optuna in your projects!</p><h4>Without further ado, let’s get started!</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8GB0FTo6UJxfZgyPNTf7aA.png" /></figure><h3>Step 1.</h3><p>As you already might know, finetuning means running the model with different parameter combinations to find t<strong>he most optimal set of parameters.</strong></p><p>Thus, we need to have a measurement to compare different models, it can be an accuracy metric or a loss metric. If it is an accuracy metric, then we will select the model with the highest result, otherwise, we will choose the model with the lowest loss.</p><p>The first step is to have a model with a metric to measure its success.</p><p>The example below is a simple <strong>sklearn Random Forest Regression</strong> model which will be used to show how to apply Optuna.</p><pre># Import libraries<br>import pandas as pd<br><br>from sklearn.model_selection import train_test_split<br>from sklearn.ensemble import RandomForestRegressor<br>from sklearn.metrics import mean_absolute_error<br><br>import optuna<br><br># Data Preparation<br>df = pd.read_csv(&#39;optuna_dataset.csv&#39;)<br>df = pd.get_dummies(df, columns=df.select_dtypes(include = &#39;object&#39;).columns, drop_first=True)<br><br>X = df.drop(&#39;charges&#39;, axis = 1)<br>y = df.charges<br><br>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)<br><br># Modelling<br>rf_reg = RandomForestRegressor().fit(X_train, y_train)<br>y_pred = rf_reg.predict(X_test)<br>print(mean_absolute_error(y_pred, y_test))</pre><h3>Step 1.5.</h3><p>This section is to build <strong>an intuition</strong> for the second step.</p><p>As it might be confusing to run into the code directly, I have prepared a diagram that will help you understand the working principle of Optuna.</p><figure><img alt="Diagram of Optuna workflow." src="https://cdn-images-1.medium.com/max/947/1*fr0xgcj1hnuFvFVfVcVH0w.png" /><figcaption>Fig 1. <strong>Diagram of Optuna workflow.</strong></figcaption></figure><p>The tuning process starts with telling Optuna <strong>which parameters to try.</strong> Sklearn models have many parameters which can be found in sklearn documentation. The link below is documentation for Random Forest Regressor.</p><p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">RandomForestRegressor</a></p><p>After parameter suggestions are set and ready, Optuna will select parameter combinations (with TPE Sampler) for each trial. Then for each parameter combination, a new model is trained, and an error is calculated.</p><p>In the final step, the parameter combination that caused the lowest error will be selected as <strong>the best parameters.</strong></p><h3>Step 2.</h3><p>Now let’s move to <strong>Python </strong>and see how all these work in coding.</p><p>Optuna framework works by defining a function called<em> </em><strong><em>“objective”</em></strong> with one parameter named<strong><em> “trial”</em></strong>. As shown in the diagram below, the function contains <strong>parameter suggestions</strong> and <strong>model</strong>, and <strong>returns error</strong> (or accuracy, depending on how you define it).</p><p>It is pretty straightforward; you suggest some parameters for the model and Optuna tests them and gives you the best parameters.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/475/1*Pbcf7-liCfV9x6sgzA_jRw.png" /><figcaption>Fig 2. <strong>Diagram of “Objective” function.</strong></figcaption></figure><p>To suggest parameters, the Optuna framework provides some options. In our example, we will use several of them, so you can get familiar with them.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/974/1*NQEARPBG8OvknPUkrGzRUA.png" /><figcaption>Fig 3. <strong>Suggest functions of Optuna.</strong></figcaption></figure><p>Now, let’s move to the code in Python.</p><p>Firstly, <strong>the suggestions</strong> are defined with the help of the functions shown above. Note that, the name of the parameter must be the same as shown in the documentation of the model and it is specified as a string inside “suggest_***” functions.</p><p>The second section is <strong>to define the model</strong>. Here, you write which parameter you have suggested and equal them to the corresponding variables defined as suggestions in the previous section.</p><p>The third section doesn’t have anything special or new, it is to fit the model and calculate error.</p><p>In th<strong>e end, the error is returned.</strong></p><pre>def objective(trial):<br>    <br>    #1 Define hyperparameters to be tuned<br>    n_estimators = trial.suggest_int(&#39;n_estimators&#39;, 90, 110)<br>    max_depth = trial.suggest_int(&#39;max_depth&#39;, 5, 30)<br>    min_samples_split = trial.suggest_int(&#39;min_samples_split&#39;, 2, 6)<br>    min_samples_leaf = trial.suggest_int(&#39;min_samples_leaf&#39;, 1, 6)<br><br>    #2 Create a Random Forest Regressor with the suggested hyperparameters<br>    rf = RandomForestRegressor(<br>        n_estimators=n_estimators,<br>        max_depth=max_depth,<br>        min_samples_split=min_samples_split,<br>        min_samples_leaf=min_samples_leaf,<br>        random_state=42<br>    )<br><br>    #3 Fit the model and caluclate error<br>    rf.fit(X_train, y_train)<br>    y_pred = rf.predict(X_test)<br>    mae = mean_absolute_error(y_test, y_pred)<br><br>    return mae</pre><p>To summarize, you need to:</p><ul><li>build a model without Optuna as usual.</li><li>determine which parameters you want to tune.</li><li>define “objective” function.</li><li>add suggestions for the parameter you want to tune.</li><li>define how to measure error.</li></ul><h3>Step 3.</h3><p>After defining the “objective” function, you need to create a study for Optuna and use the code below to run the whole code.</p><p>The important point here is <strong>direction</strong>. You have to choose <strong><em>“minimize”</em></strong> or <strong><em>“maximize”</em></strong>:</p><ul><li>if you defined an error to be returned in the <strong><em>“objective” </em></strong>function, then you need to use <strong><em>“minimize”</em></strong>.</li><li>if you defined accuracy to be returned, then you need to use <strong><em>“maximize”</em></strong>.</li></ul><p>As we defined MAE (mean absolute error), the direction will be <strong><em>“minimize”</em></strong>.</p><pre>study = optuna.create_study(direction=&#39;minimize&#39;)<br>study.optimize(objective, n_trials=200)</pre><p>After running the code above, you will have the results about what are <strong>the best parameters.</strong></p><p>With the code below, you can print the best parameters and use them to fit the final model which will be trained with the best parameters. Then you can test it with predictions on the test set.</p><pre>best_params = study.best_params<br>print(&quot;Best Hyperparameters:&quot;, best_params)<br><br># Train the final model with the best hyperparameters<br>best_rf = RandomForestRegressor(<br>    n_estimators=best_params[&#39;n_estimators&#39;],<br>    min_samples_split = best_params[&#39;min_samples_split&#39;],<br>    min_samples_leaf = best_params[&#39;min_samples_leaf&#39;],<br>    max_depth = best_params[&#39;max_depth&#39;],<br>    random_state=42<br>)<br>best_rf.fit(X_train, y_train)<br><br>y_pred_best = best_rf.predict(X_test)<br>mean_absolute_error(y_pred_best, y_test)</pre><p>Find the whole code and dataset from the links below:</p><ul><li><a href="https://github.com/anarabiyev/Medium-Youtube/blob/master/3.%20Optuna.ipynb">Medium-Youtube/3. Optuna.ipynb at master · anarabiyev/Medium-Youtube</a></li><li><a href="https://github.com/anarabiyev/Medium-Youtube/blob/master/3.%20optuna_dataset.csv">Medium-Youtube/3. optuna_dataset.csv at master · anarabiyev/Medium-Youtube</a></li></ul><h3>Thank you for reading, if I added some value to your learning, don’t forget to clap and follow!</h3><h3><a href="https://plainenglish.io/">PlainEnglish.io</a> 🚀</h3><p><em>Thank you for being a part of the In Plain English community! Before you go:</em></p><ul><li><em>Be sure to </em><strong><em>clap</em></strong><em> and </em><strong><em>follow</em></strong><em> the writer</em><strong>️</strong></li><li><em>Learn how you can also </em><a href="https://plainenglish.io/blog/how-to-write-for-in-plain-english"><strong><em>write for In Plain English</em></strong></a><em>️</em></li><li><em>Follow us: </em><a href="https://twitter.com/inPlainEngHQ"><strong><em>X</em></strong></a><strong><em> | </em></strong><a href="https://www.linkedin.com/company/inplainenglish/"><strong><em>LinkedIn</em></strong></a><strong><em> | </em></strong><a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw"><strong><em>YouTube</em></strong></a><strong><em> | </em></strong><a href="https://discord.gg/in-plain-english-709094664682340443"><strong><em>Discord</em></strong></a><strong><em> | </em></strong><a href="https://newsletter.plainenglish.io/"><strong><em>Newsletter</em></strong></a></li><li><em>Visit our other platforms: </em><a href="https://stackademic.com/"><strong><em>Stackademic</em></strong></a><strong><em> | </em></strong><a href="https://cofeed.app/"><strong><em>CoFeed</em></strong></a><strong><em> | </em></strong><a href="https://venturemagazine.net/"><strong><em>Venture</em></strong></a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b07650168f0a" width="1" height="1" alt=""><hr><p><a href="https://python.plainenglish.io/how-to-use-optune-step-by-step-beginner-guide-for-hyperparameter-tuning-b07650168f0a">How to Use Optune? Step-by-step Beginner Guide for Hyperparameter Tuning!</a> was originally published in <a href="https://python.plainenglish.io">Python in Plain English</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Complete Guide to Activation Functions in Deep Learning]]></title>
            <link>https://ai.plainenglish.io/complete-guide-to-activation-functions-in-deep-learning-fb65aca121f9?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/fb65aca121f9</guid>
            <category><![CDATA[tutorial]]></category>
            <category><![CDATA[activation-functions]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Sun, 17 Dec 2023 01:31:46 GMT</pubDate>
            <atom:updated>2023-12-18T21:47:35.384Z</atom:updated>
            <content:encoded><![CDATA[<h4>This paper will answer all of your questions about activation. functions from <strong>why </strong>we need them, <strong>what </strong>are they, and <strong>which </strong>one to use!</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pqLynNnq0QfIDjXxjhZwrw.png" /></figure><p>An activation function is the last step before you assign the value to the neuron in the Neural Network. After multiplying the values of the previous layer’s neurons with corresponding weights, the results are summed up and fed into <strong>the activation function.</strong> The return of the activation function is assigned to the current neuron.</p><p>But <strong>why don’t we just use the sum itself</strong> and increase the computation cost by using activation functions? The short answer is that activation functions make the neural networks capable of learning <strong>non-linear features</strong> of the dataset.</p><h4>Let’s break down what it means.</h4><p>With the presence of an activation function, the z value which is calculated using the neurons of the previous layer, weights, and bias is fed into the activation function <strong><em>f.</em></strong></p><figure><img alt="Formula of a neuron calculation in neural network" src="https://cdn-images-1.medium.com/max/384/1*fLYCQYA-wBbml6JQpPXkrQ.png" /><figcaption>Formula 1. <strong>Calculation of a neuron value.</strong></figcaption></figure><figure><img alt="Graph of neural network calculation" src="https://cdn-images-1.medium.com/max/586/1*xzVBwwv9L15CHLJE9iAJdQ.png" /><figcaption>Fig 1. <strong>Calculation of a neuron value.</strong></figcaption></figure><p>If we don’t use the activation function the formula will be identical to <strong>linear regression, </strong>but we aim to build a model more powerful than linear regression. That is why by using activation functions the neural networks are much stronger than linear models such as linear regression itself. Activation functions <strong>introduce non-linearities</strong> into the network, allowing it to capture complex patterns and relationships in the data.</p><p>The ability to model non-linear features is crucial for deep learning models to effectively learn from complex datasets and solve tasks such as image recognition, natural language processing, and other pattern recognition problems. The hierarchical structure of deep neural networks, with multiple layers of non-linear transformations, allows them to automatically learn and extract hierarchical representations of features from the input data. This enables deep learning models to handle tasks that involve non-linear relationships within the data.</p><p>By now it must be clear why activation functions are necessary for neural networks. The next set of questions which can occur are like the following:</p><p>· <strong>Why</strong> there are <strong>different types</strong> of activation functions?</p><p>· <strong>What </strong>are the <strong>differences</strong> between them?</p><p>· <strong>Which </strong>one is <strong>the best?</strong></p><p>· Usage in <strong>Python.</strong></p><p>In the next paragraphs, I will go through the different types of activation functions individually, and break down all of them for a simple explanation. After going through all of them you will have a clear view of the differences and comparisons between various activation functions.</p><h3>Sigmoid or logistic function</h3><p>Formula and graph:</p><figure><img alt="sigmoid activation function formula" src="https://cdn-images-1.medium.com/max/264/1*-E18ntBeKzNkTvMY16vzAA.png" /><figcaption>Formula 2. <strong>Calculation of Sigmoid.</strong></figcaption></figure><figure><img alt="sigmoid activation function graph" src="https://cdn-images-1.medium.com/max/544/1*goDEUQjnlX-gaGiogMc-0w.png" /><figcaption>Fig 2. <strong>Graph of Sigmoid</strong></figcaption></figure><p>According the Wikipedia, <a href="https://en.wikipedia.org/wiki/Sigmoid_function">a <strong>sigmoid function</strong></a> is any <a href="https://en.wikipedia.org/wiki/Mathematical_function">mathematical function</a> having a characteristic “S”-shaped curve or <strong>sigmoid curve,</strong> it is a <a href="https://en.wikipedia.org/wiki/Bounded_function">bounded</a>, <a href="https://en.wikipedia.org/wiki/Differentiable_function">differentiable</a>, real function that is defined for all real input values and has a non-negative derivative at each point.</p><p>The sigmoid function maps any value to the <strong>range between 0 and 1.</strong> You can imagine it as converting values into <strong>probabilities</strong>; thus, it is very common to apply this activation function in the output layer of classification models. Keep in mind that sigmoid can also be used in the <strong>hidden layers of the NN</strong> and it was a common practice in the early deep-learning architectures. Nowadays, it is still useful in certain scenarios, specifically when you want the output of the neurons to be between 0 and 1.</p><p>The reason sigmoid is <strong>not </strong>commonly used anymore is its most important drawback — <strong>the vanishing gradient problem.</strong></p><p>What does this mean?</p><p>When the backpropagation algorithm is applied during optimization, the derivation of the activation function is also calculated. In the case of the sigmoid, its derivation becomes <strong>extremely small (</strong>as you multiply the gradient by quite small values several times<strong>)</strong>; thus, it does not contribute to updating the weights of the network. In other words, <strong>the gradient vanishes.</strong></p><p>This problem was later solved by introducing a new activation function which is the header of the next paragraph.</p><h3>ReLU — Rectifier Linear Unit</h3><p>Formula and graph:</p><figure><img alt="ReLU activation function formula" src="https://cdn-images-1.medium.com/max/255/1*239ZK1FIplAzoL6FTbKIuw.png" /><figcaption>Formula 3. <strong>Calculation of ReLU.</strong></figcaption></figure><figure><img alt="ReLU activation function graph" src="https://cdn-images-1.medium.com/max/831/1*9AcivlBo0HWx_XCYp4cDiA.png" /><figcaption>Fig 3. <strong>Graph of ReLU.</strong></figcaption></figure><p><strong>ReLU</strong> maps positive values as they are, and negative values as zero. It is the <strong>most popular activation function</strong> in deep learning. Its popularity comes from simplicity, efficiency, and the ability to mitigate the vanishing gradient problem.</p><p>As seen from the formula, ReLU doesn’t require any computation but rather max operation. This contributes to the reduction of computational costs. This efficiency is crucial in the training of large-scale neural networks, where millions or even billions of parameters need to be updated during each iteration of the optimization process.</p><p>Traditional activation functions like sigmoid and tanh can saturate for extreme values, leading to vanishing gradients. ReLU, on the other hand, does not saturate for positive inputs, allowing gradients to flow more freely during backpropagation. For sigmoid, the derivation is 0.25 maximum, while the derivation of ReLU is 0 or 1. When there are multiple layers, the sigmoid makes the gradient a very small value as the derivation is smaller than 1, but ReLU keeps the value the same or makes it zero. For deep learning, it is a better practice to have the gradient as zero or one, rather than a minimal number. Keep in mind that the majority of the values are mainly one as well.</p><p>In addition to addressing the vanishing gradient problem, ReLU introduces <strong>sparsity </strong>in the network. Since ReLU sets negative values to zero, some neurons in the network become inactive, leading to sparse activation patterns. Sparsity can be advantageous in terms of <strong>reducing overfitting, computational efficiency, and memory utilization.</strong></p><p>As always, if there is no problem, there is no development. Thus, ReLU also has a problem called <strong>“dying ReLU”</strong> which is solved by Leaky Relu.</p><h3>Leaky Relu</h3><p>Formula and graph:</p><figure><img alt="leaky relu activation function formula" src="https://cdn-images-1.medium.com/max/264/1*NZiUuFNsZBQ2CzvKAFweMw.png" /><figcaption>Formula 4. <strong>Calculation of Leaky ReLU.</strong></figcaption></figure><figure><img alt="leaky relu activation function graph" src="https://cdn-images-1.medium.com/max/831/1*JBxydvfKDw26o1bxMka6GQ.png" /><figcaption>Fig 4. <strong>Graph of Leaky ReLU.</strong></figcaption></figure><p>The update to the classic ReLU is to multiply negative values by a small coefficient (you can see the negative side of the graph is not exactly zero), rather than making them zero. This will adjust <strong>small values for negative neurons</strong> and solve the <strong>“dying ReLU”</strong> problem. The graph above represents leaky ReLU with the <strong>alpha </strong>coefficient equal to 0.01, it is the default value but can be altered.</p><p>By allowing a controlled leak of information for negative inputs, Leaky ReLU promotes a more robust flow of gradients during backpropagation, addressing issues associated with the vanishing gradient problem. This characteristic makes Leaky ReLU a popular choice in deep learning architectures, offering a good balance between the linearity of traditional ReLU and the avoidance of complete inactivity in certain neurons, which can enhance the learning capabilities of neural networks.</p><h3>Tanh</h3><p>Formula and graph:</p><figure><img alt="tanh activation function formula" src="https://cdn-images-1.medium.com/max/202/1*cNar2h4kACYnYYoQB6t4oA.png" /><figcaption>Formula 5. <strong>Calculation of tanh.</strong></figcaption></figure><figure><img alt="tanh activation function graph" src="https://cdn-images-1.medium.com/max/881/1*65RYy_QPdyJVL86AkjXAuw.png" /><figcaption>Fig 5. <strong>Graph of tanh.</strong></figcaption></figure><p>The hyperbolic tangent function, commonly abbreviated as <strong>tanh</strong>, is a widely used activation function in neural networks. It is similar to sigmoid while tanh’s <strong>range is between -1 and 1.</strong> One significant advantage of the tanh function is that its output is <strong>zero-centered.</strong> This zero-centered property contrasts the sigmoid activation function, which outputs values in the range (0,1) and is not zero-centered. The zero-centeredness of tanh can be beneficial during the training of neural networks.</p><p>The tanh function squashes its inputs to the range of (-1,1). This bounded output range is advantageous in scenarios where it is desirable to constrain the outputs within specific bounds. In tasks such as image processing or text generation, where the intensity or relevance of features should be well-regulated, the bounded nature of tanh can be valuable.</p><p>The tanh function became preferred over the sigmoid function as it gave a better performance for <strong>multi-layer neural networks.</strong> However, it did not solve the vanishing gradient problem that sigmoid suffered.</p><p>Being like a sigmoid, tanh is useful in certain scenarios such as classification. The ability to generate non-linearities and capture both negative and positive values are among the advantages. Another great feature of tanh activation lies in its ability to avoid overfitting during training periods if regularization parameters are carefully tuned. Tanh smooths out output values, unlike ReLU which can lead to overfitting if not managed properly. This makes the learning process much more stable during long training periods and allows for better generalization of a dataset.</p><h3>Softmax</h3><p><strong>Softmax </strong>is another activation function to discuss. Being somewhere similar to sigmoid it is used in the output layer of a neural network for <strong>multi-class classification</strong> problems. It takes an input vector and transforms it into a <strong>probability distribution.</strong> The output of the softmax function is a vector of probabilities that sums to 1. You can think of Softmax as a multiclass version of sigmoid.</p><p>The softmax function normalizes the input values to produce a probability distribution. The class with <strong>the highest probability</strong> is then typically chosen as <strong>the predicted class. </strong>The softmax activation is useful for converting raw scores or logits into probabilities, making it suitable for the final layer of a neural network used for classification tasks.</p><p>These are the most common and used activation functions, however, there are many more types as well. The majority of them are modifications of the ones we discussed. The following paragraphs will show their usage in <strong>Python.</strong></p><h3>Python</h3><p>PyTorch provides a variety of activation functions that can be easily integrated into neural network architectures. The available functions can be checked from the link below:</p><p><a href="https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity">torch.nn - PyTorch 2.4 documentation</a></p><p>From the practical point of view, at the initial step of deep learning training, an activation function is selected according to the characteristics of the solved problem.</p><p>In the code sample below, a simple pytorch neural network layer has been constructed. Here, instead of “relu”, “sigmoid”, “tanh”, “elu”, “softmax” or other functions can be used.</p><pre>tf.keras.layers.Dense(128, activation=&#39;relu&#39;)</pre><p>When you build a neural network, try to refer to the documentation and solutions to similar problems to find out which activation functions might work well for your case.</p><p>You can access the code from the link below to plot and analyze different activation functions:</p><p><a href="https://github.com/anarabiyev/Medium-Youtube/blob/master/Activation_Functions.ipynb">Medium-Youtube/Activation_Functions.ipynb at master · anarabiyev/Medium-Youtube</a></p><h4>Hope, I added value to your deep learning journey.</h4><h3>If so, please don’t forget to subscribe for more tutorials to come and clap the story!</h3><h3><a href="https://plainenglish.io/">PlainEnglish.io</a> 🚀</h3><p><em>Thank you for being a part of the In Plain English community! Before you go:</em></p><ul><li><em>Be sure to </em><strong><em>clap</em></strong><em> and </em><strong><em>follow</em></strong><em> the writer</em><strong>️</strong></li><li><em>Learn how you can also </em><a href="https://plainenglish.io/blog/how-to-write-for-in-plain-english"><strong><em>write for In Plain English</em></strong></a><em>️</em></li><li><em>Follow us: </em><a href="https://twitter.com/inPlainEngHQ"><strong><em>X</em></strong></a><strong><em> | </em></strong><a href="https://www.linkedin.com/company/inplainenglish/"><strong><em>LinkedIn</em></strong></a><strong><em> | </em></strong><a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw"><strong><em>YouTube</em></strong></a><strong><em> | </em></strong><a href="https://discord.gg/in-plain-english-709094664682340443"><strong><em>Discord</em></strong></a><strong><em> | </em></strong><a href="https://newsletter.plainenglish.io/"><strong><em>Newsletter</em></strong></a></li><li><em>Visit our other platforms: </em><a href="https://stackademic.com/"><strong><em>Stackademic</em></strong></a><strong><em> | </em></strong><a href="https://cofeed.app/"><strong><em>CoFeed</em></strong></a><strong><em> | </em></strong><a href="https://venturemagazine.net/"><strong><em>Venture</em></strong></a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=fb65aca121f9" width="1" height="1" alt=""><hr><p><a href="https://ai.plainenglish.io/complete-guide-to-activation-functions-in-deep-learning-fb65aca121f9">Complete Guide to Activation Functions in Deep Learning</a> was originally published in <a href="https://ai.plainenglish.io">Artificial Intelligence in Plain English</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to visualize loss and accuracy for Deep Learning models using TensorBoard (Part 2)]]></title>
            <link>https://ai.plainenglish.io/how-to-visualize-loss-and-accuracy-for-deep-learning-models-using-tensorboard-part-2-02f8de1db460?source=rss-15151ff5820e------2</link>
            <guid isPermaLink="false">https://medium.com/p/02f8de1db460</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[tutorial]]></category>
            <category><![CDATA[tensorflow]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Anar Abiyev]]></dc:creator>
            <pubDate>Wed, 13 Dec 2023 08:17:19 GMT</pubDate>
            <atom:updated>2023-12-13T08:17:19.736Z</atom:updated>
            <content:encoded><![CDATA[<h3>How to Visualize Loss and Accuracy for Deep Learning Models Using TensorBoard (Part 2)</h3><h4>You will learn how to visualize loss and accuracy easily by using TensorBoard feature of TensorFlow.</h4><p>This is Part 2 of TensorBoard tutorial, <a href="https://medium.com/ai-in-plain-english/step-by-step-guide-to-tensorboard-game-changer-visualization-tool-part-1-90b74663fd74"><strong>check Part 1</strong></a> to learn about how to</p><ul><li><strong>set up TensorBoard.</strong></li><li><strong>visualize graphs, images and hyperparameter tuning.</strong></li></ul><p>Hyperparameter tuning is definitely worth reading technique because it visualizes the process in a great way!</p><p>In this section, you will learn how to make small modifications to your deep learning to visualize loss and accuracy of models by epoch.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SOLN6B1oBkbDUpbja0Jjqw.jpeg" /><figcaption>Photo generated by <a href="https://app.leonardo.ai/">Leonardo.Ai</a>.</figcaption></figure><h3>Introduction</h3><p>The example will be based on MNIST dataset. I will not talk about this dataset, because I am sure you are aware of it if you are looking for TensorBoard.</p><p>The code below is the example given by Tensor Flow, check the reference for more:</p><p><a href="https://www.tensorflow.org/datasets/keras_example">Training a neural network on MNIST with Keras | TensorFlow Datasets</a></p><pre>import tensorflow as tf<br>import tensorflow_datasets as tfds<br><br>(ds_train, ds_test), ds_info = tfds.load(<br>    &#39;mnist&#39;,<br>    split=[&#39;train&#39;, &#39;test&#39;], shuffle_files=True, as_supervised=True,<br>    with_info=True,<br>)<br><br>def normalize_img(image, label):<br>  return tf.cast(image, tf.float32) / 255., label<br><br>ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)<br>ds_train = ds_train.cache()<br>ds_train = ds_train.shuffle(ds_info.splits[&#39;train&#39;].num_examples)<br>ds_train = ds_train.batch(128)<br>ds_train = ds_train.prefetch(tf.data.AUTOTUNE)<br><br>ds_test = ds_test.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)<br>ds_test = ds_test.batch(128)<br>ds_test = ds_test.cache()<br>ds_test = ds_test.prefetch(tf.data.AUTOTUNE)<br><br><br>model = tf.keras.models.Sequential([<br>  tf.keras.layers.Flatten(input_shape=(28, 28)),<br>  tf.keras.layers.Dense(128, activation=&#39;relu&#39;),<br>  tf.keras.layers.Dense(10)<br>])<br>model.compile(<br>    optimizer=tf.keras.optimizers.Adam(0.001),<br>    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),<br>    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],<br>)<br><br>model.fit(<br>    ds_train, epochs=6, validation_data=ds_test,<br>)</pre><h3>How to do</h3><p>The changes we will do is:</p><ol><li>Create TensorBoard callback:</li></ol><pre>tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=&quot;logs&quot;)</pre><p>2. Change “model.fit” to add callback:</p><pre>model.fit(<br>    ds_train,<br>    epochs=6,<br>    validation_data=ds_test,<br>    callbacks=[tensorboard_callback]  # Add TensorBoard callback<br>)</pre><h3>Results</h3><p>Start TensorBoard with the code below:</p><pre>%load_ext tensorboard<br>%tensorboard --logdir logs</pre><p>Navigate to Scalars tab to see the graphs:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/266/1*Uzwk2fw8SGv71ZzI-hKTWw.png" /><figcaption>loss</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/252/1*8pZmKpuzg8bq1aUh6Dipzg.png" /><figcaption>accuracy</figcaption></figure><p>This is your tutorial for TensorBoard, don’t forget to check Part 1!</p><p><a href="https://ai.plainenglish.io/step-by-step-guide-to-tensorboard-game-changer-visualization-tool-part-1-90b74663fd74">Step-by-Step Guide to TensorBoard: Game Changer Visualization tool- Part 1</a></p><h3><a href="https://plainenglish.io/">PlainEnglish.io</a> 🚀</h3><p><em>Thank you for being a part of the In Plain English community! Before you go:</em></p><ul><li><em>Be sure to </em><strong><em>clap</em></strong><em> and </em><strong><em>follow</em></strong><em> the writer</em><strong>️</strong></li><li><em>Learn how you can also </em><a href="https://plainenglish.io/blog/how-to-write-for-in-plain-english"><strong><em>write for In Plain English</em></strong></a><em>️</em></li><li><em>Follow us: </em><a href="https://twitter.com/inPlainEngHQ"><strong><em>X</em></strong></a><strong><em> | </em></strong><a href="https://www.linkedin.com/company/inplainenglish/"><strong><em>LinkedIn</em></strong></a><strong><em> | </em></strong><a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw"><strong><em>YouTube</em></strong></a><strong><em> | </em></strong><a href="https://discord.gg/in-plain-english-709094664682340443"><strong><em>Discord</em></strong></a><strong><em> | </em></strong><a href="https://newsletter.plainenglish.io/"><strong><em>Newsletter</em></strong></a></li><li><em>Visit our other platforms: </em><a href="https://stackademic.com/"><strong><em>Stackademic</em></strong></a><strong><em> | </em></strong><a href="https://cofeed.app/"><strong><em>CoFeed</em></strong></a><strong><em> | </em></strong><a href="https://venturemagazine.net/"><strong><em>Venture</em></strong></a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=02f8de1db460" width="1" height="1" alt=""><hr><p><a href="https://ai.plainenglish.io/how-to-visualize-loss-and-accuracy-for-deep-learning-models-using-tensorboard-part-2-02f8de1db460">How to visualize loss and accuracy for Deep Learning models using TensorBoard (Part 2)</a> was originally published in <a href="https://ai.plainenglish.io">Artificial Intelligence in Plain English</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>