<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Babatunde Oreoluwa on Medium]]></title>
        <description><![CDATA[Stories by Babatunde Oreoluwa on Medium]]></description>
        <link>https://medium.com/@babatundeoreoluwa35?source=rss-34103bd9ac9b------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*9q6Tt9K4XegkxMIBSFvcDg.jpeg</url>
            <title>Stories by Babatunde Oreoluwa on Medium</title>
            <link>https://medium.com/@babatundeoreoluwa35?source=rss-34103bd9ac9b------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 17 May 2026 10:15:22 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@babatundeoreoluwa35/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Image Classification Using Transfer Learning: Crop Disease Classification]]></title>
            <link>https://medium.com/@babatundeoreoluwa35/image-classification-using-transfer-learning-crop-disease-classification-dc4dedab17cb?source=rss-34103bd9ac9b------2</link>
            <guid isPermaLink="false">https://medium.com/p/dc4dedab17cb</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[image-classification]]></category>
            <category><![CDATA[computer-vision]]></category>
            <dc:creator><![CDATA[Babatunde Oreoluwa]]></dc:creator>
            <pubDate>Sat, 25 Jun 2022 09:32:01 GMT</pubDate>
            <atom:updated>2022-06-25T09:32:01.018Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*090S-tkAAx5SoI4jodfoRw.jpeg" /><figcaption>Image by<a href="https://analyticsindiamag.com/7-types-classification-algorithms/"> analyticsindia</a></figcaption></figure><p>Artificial intelligence has numerous interesting branches, we will be discussing a branch of AI which is called Computer Vision.</p><p>Computer vision (CV) is a branch of artificial intelligence (AI) that allows computers and systems to extract useful information from digital photos, videos, and other visual inputs, as well as to make meaningful decisions based on those data.</p><p>Computer vision has numerous subfields which include Image classification, Image segmentation, scene reconstruction, object detection, event detection, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, visual servoing, and 3D scene modeling, and image restoration.</p><p>In this piece, we shall be focusing on a subfield in CV known as Image classification. Image classification is the task of assigning a label to an image from a predefined set of categories. In practice, this implies that we must analyze an input image and then produce a label that categorizes it. The label is always chosen from a predetermined set of options.</p><p>In this article, we would be building a model using transfer learning(pre-trained models)to classify if a plant has been affected by a fall armyworm using the images of the plant. The data source for this task is the <a href="https://zindi.africa/competitions/makerere-fall-armyworm-crop-challenge/data">Makerere Fall Armyworm Crop Challenge data</a> on Zindi.</p><p>The data for this project has a train.CSV file that contains the 1,619train images name and Labels, a test.CSV file that contains the 1,080 image names only, and the Images folder which contains the 2,699 images for the train CSV and test CSV. I would be using Google colab as the IDE for this project. After downloading and uploading the data on your drive, you mount the drive.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/ed6a9b61be06db87ff35c12c6d153da8/href">https://medium.com/media/ed6a9b61be06db87ff35c12c6d153da8/href</a></iframe><p>Then import all the necessary libraries. For this project, we would be using the Tensorflow and Keras framework. I choose this framework because it is easy to use, flexible and both have simpler APIs.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/195566b63efe6569e450beac56015701/href">https://medium.com/media/195566b63efe6569e450beac56015701/href</a></iframe><p>The libraries are now imported, it’s time to load the data.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b1a02f2ed50b9e251104f806d5a74ebb/href">https://medium.com/media/b1a02f2ed50b9e251104f806d5a74ebb/href</a></iframe><p>Let’s check what the data looks like;</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a3b7a1db7983aeb1800b91020a4ad95d/href">https://medium.com/media/a3b7a1db7983aeb1800b91020a4ad95d/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/318/1*pRMHS2OKF25BF9KPV4jPgA.png" /></figure><p>We can see that the train images CSV file contains the Image_id and the labels. Where 1 indicates that the image has been affected by a fall armyworm and 0 if it has not been affected.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/4844dc6e8b795f730dee75922577cf3e/href">https://medium.com/media/4844dc6e8b795f730dee75922577cf3e/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/239/1*2OZnYpf-nqluwxLE_iSbiw.png" /></figure><p>The test data does not have a Label column because it is what we are classifying after building the model. Let’s give a variable name to the path of the image directory.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/2760f6d57e9f0be731810c28378f0acb/href">https://medium.com/media/2760f6d57e9f0be731810c28378f0acb/href</a></iframe><p>Now, we would be merging the images_path with the train [‘Image_Id] so that the train [‘Image_Id] column in the csv file will have the full image path.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/5cabb30c66c6f6f57752a8bbd345921a/href">https://medium.com/media/5cabb30c66c6f6f57752a8bbd345921a/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/451/1*LqxrVdXhyDHsdy2fX_t_Sg.png" /></figure><p>The train [‘Image_Id] column of the train CSV file now has the train images full path containing 1,619 image file paths. We would do the same for the test CSV.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/935a39c6827aaa544e5b8b81342f7df1/href">https://medium.com/media/935a39c6827aaa544e5b8b81342f7df1/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/394/1*eAiwbW_4I-ZLoXnjVlPt4g.png" /></figure><p>Let’s now preprocess the data for modelling. As I said earlier we would be using Tensorflow.keras framework for this project. Tensorflow.keras framework has different ways of loading and augmenting data into the model. For this project, we would be using the <a href="https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator"><strong>ImageDataGenerator</strong></a> to augment the images and <strong>.flow from_dataframe</strong> to load the data. After rescaling all the images to a particular size we then augment the images.</p><p>Note:<strong> It is not compulsory to augment image data, we only augment image data to improve the performance and the outcome of the model by forming new and different examples to the train and validation datasets</strong></p><p>When you use the <strong>.flow from_dataframe </strong>data loader, pass in the csv file, the image_id as the x column ,the target as the y column, the target_size of the images the class_mode, subset,seed and batch _size. There are other arguments you can pass in depending on what you want to do.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0c441676044bfa770fa16f5a7b6d4041/href">https://medium.com/media/0c441676044bfa770fa16f5a7b6d4041/href</a></iframe><p>For the test data loader we put the y_col as <strong>None</strong> because that is what we are predicting.</p><p>Let’s build our model!!!!!</p><p>In image classification, deep learning models require training from scratch, which is computationally expensive and requires a large amount of data to achieve high performance. On the other hand,<strong> </strong>using a pre-trained model is what we call <strong>Transfer Learning. </strong>Transfer Learning is a method in machine learning where a model developed for a task is reused as the starting point for a model on another similar task. Transfer learning is computationally efficient and helps achieve better results using a small amount of data. For this project, we are using a pre-trained model to classify the plant image which is infected or uninfected.</p><p>We have several pre-trained models for image classification. Examples are VGG16, VGG19, InceptionV3, ResNet50, ResNetV2, etc. These pre-trained models have been trained on millions of images making it easier to perform better on our data.</p><p>Let’s import the pre-trained model we are using.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b965f39e8a3fe19456418a2e40761730/href">https://medium.com/media/b965f39e8a3fe19456418a2e40761730/href</a></iframe><p>We would be using VGG19 as our pretrained model. VGG19 is a variant of the VGG model which consists of 19 layers (16 convolution layers, 3 fully connected layers, 5 MaxPool layers, and 1 SoftMax layer).</p><p>Now, we are done importing the pre-trained model. We would not be loading the output layer for the model, because the VGG19 model was initially trained with the ImageNet database that contains a million images of 1000 classes. Since we are working on a binary image classification we, therefore, freeze the initial output layers that have 1000 classes and add our output layer. We freeze the output layer by putting False to the include_top argument. VGG19 takes the input shape of the images 224x224 size.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0f033357a8e54ff4b683cf2f2273526f/href">https://medium.com/media/0f033357a8e54ff4b683cf2f2273526f/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d13cb8b848e6a9dd3e34f10e5c8aea98/href">https://medium.com/media/d13cb8b848e6a9dd3e34f10e5c8aea98/href</a></iframe><p>In the above code we set our trainable argument to false because we don’t want the VGG19 model’s weight to interfere with our current data so we freeze the weight of the pre-trained model.</p><p>Let’s add our output layer .</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/8521110e8ca0060afdefdf1c1e278821/href">https://medium.com/media/8521110e8ca0060afdefdf1c1e278821/href</a></iframe><p>Let’s compile the model using <strong>Adam </strong>as the optimizer, loss as <strong>binary_crossentropy,</strong> because it’s a binary image classification<strong> </strong>and metrics as <strong>Accuracy </strong>score</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/e2470a7744be7b09eeabbc60c52b8d3d/href">https://medium.com/media/e2470a7744be7b09eeabbc60c52b8d3d/href</a></iframe><p>Let’s check for the summary</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d09aaa3c5ece49677750e1d7e2752f4f/href">https://medium.com/media/d09aaa3c5ece49677750e1d7e2752f4f/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/315/1*7JHEXNwrnbCu7__P0hkX5A.png" /></figure><p>The above image is the output when you run the <strong>model.summary()</strong> code, it brings out the layers of the model. We can also see that there are 26,447,682 total parameters for the model, of which 6,423,298 are trainable parameters. The remaining 20,024,384 are non-trainable parameters, which are the weights we froze.</p><p>Let’s now train the model!!!</p><p>To train our model we pass in the train data, validation data, epochs, and steps_per_epochs this is the number of unique samples of your dataset divided by the batch size and the verbose. The verbose will show the output of the model while training.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/cca251c0398759af6752a8a82b8f4fa0/href">https://medium.com/media/cca251c0398759af6752a8a82b8f4fa0/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/518/1*D_A-J0qVaSHXBlqwhEiSLQ.png" /></figure><p>Wow!!! The accuracy of our model is 98.4%. Let’s now create a submission CSV file by using the model to classify the images in the test data set.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/340ec034eca611a57cb368cd281a0b82/href">https://medium.com/media/340ec034eca611a57cb368cd281a0b82/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/387/1*kLNP8OH7Fb6ZPf0ZnnY57Q.png" /></figure><p>When we print out the predictions, we get two values which are the probability of each label being the actual value. This can be quite confusing, to avoid this we use the argmax() function from the Numpy library. The argmax function prints the index of the maximum value. If the first value is maximum it prints out 0, if the second value is maximum it prints out 1.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d31a8508d100967cec1adf47eb2edf49/href">https://medium.com/media/d31a8508d100967cec1adf47eb2edf49/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/987/1*LibVJOKn2xA_0Zn9U5ac6Q.png" /></figure><p>Now we have converted the prediction probability to target labels</p><p>The submission csv contains the test.csv file image_id , let’s read the submission CSV file and create a Label column</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0f78f6dc990d0cc6bd053dbc0ff50c21/href">https://medium.com/media/0f78f6dc990d0cc6bd053dbc0ff50c21/href</a></iframe><p>When this code is done executing , a new csv file called my_submission csv will be created on your drive. This is what you will download and upload as your submission .</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/180/1*WBv_8lXAJIUCVAwv7NW_4Q.png" /></figure><p>Finally, we have a model that can predict if a plant is infected or uninfected. I recommend you try out other pre-trained models and compare their accuracy scores.</p><p>For more clarifications on this project check out my <a href="https://github.com/Oreoluwa1234/Zindi-Makerere-crop-disease-Prediction">GitHub portfolio</a>.</p><p>Thank you for reading all through, I hope this article is useful and of benefit to you. Don’t forget to read, practice, learn, clap and share.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=dc4dedab17cb" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A step-by-step approach to building a machine learning model]]></title>
            <link>https://medium.com/@babatundeoreoluwa35/a-step-by-step-approach-to-building-a-machine-learning-model-396f07dc571c?source=rss-34103bd9ac9b------2</link>
            <guid isPermaLink="false">https://medium.com/p/396f07dc571c</guid>
            <category><![CDATA[technology]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[improvement]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Babatunde Oreoluwa]]></dc:creator>
            <pubDate>Thu, 28 Apr 2022 11:08:37 GMT</pubDate>
            <atom:updated>2022-04-28T12:56:32.028Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/553/1*mVwkX0wCJQkNlmc38mLN1w.jpeg" /><figcaption>Image by <a href="https://unsplash.com/s/photos/machine-learning">Unsplash</a></figcaption></figure><p>Machine learning (ML) is a subfield of artificial intelligence (AI) that focuses on developing systems that learn from the data they consume to improve their performance. Most AI applications execute with ML models, and most beginners are unaware that there is a step-by-step process for creating an ML model.</p><p>We’ll go over the processes for creating a machine learning model in this article. The following is a step-by-step approach to building a machine learning model:</p><p><strong>(1)Understanding the problem: </strong>Understanding the problem is the first step in creating a machine learning model, once the problem is understood, it provides us with a structured way to solve it.</p><p><strong>(2)Data collection: </strong>The practice of collecting and acquiring data from a variety of sources is known as data collection. Data must be collected and kept in a form that makes sense for the problem at hand to be used to develop viable artificial intelligence (AI) and machine learning solutions. The accuracy of a machine learning model is only as good as the data used to train them. Different websites serve as a source of data for ML projects e.g<a href="https://www.kaggle.com/datasets"> Kaggle</a>, <a href="https://zindi.africa/competitions">Zindi</a>, <a href="https://archive.ics.uci.edu/ml/datasets.php">UCL machine learning repository</a>.</p><p><strong>(3)Data Preprocessing: </strong>As we know we can’t work with raw data, we must transform the data into an understandable format by preprocessing the data. There are various preprocessing methods for the various data type. Data preprocessing is considered one of the crucial phases in developing a machine learning model because it prepares the data in the most meaningful way for the subsequent data modeling.</p><p><strong>(4)Data Modelling: </strong>The data is ready for training and testing at this point. We may now select a model and train the data on it. When it comes to selecting a model. You can choose from a variety of models based on your data. The model selection process includes classification, regression, clustering, and other methods. You will now be required to train datasets for them to run smoothly. Several algorithms and techniques are used in the stage of training and testing the machine model.</p><p><strong>(5)-Model Evaluation: </strong>The outcome of the model can be used to evaluate the model. Model evaluation is done by using evaluation metrics such as accuracy score, Root Mean Square Error(RMSE), confusion metrics, classification report, Mean Square Error (MSE), and so on to be able to check the quality of the data. This stage ensures that a machine learning model is of high performance.</p><p><strong>(6)-Model improvement: </strong>After checking your model’s performance by evaluating it with some metrics, there could be an improvement if the model is not performing as expected. So this stage is an optional one.</p><p><strong>(7)-Model deployment: </strong>The model is now ready to be put into production to test how it performs in the real world. It could be deployed as a web app or anything you wish it could be</p><p>Finally, these are the steps needed to build a machine learning model. Thank you for reading this far; I hope you now have a clear understanding of how to build a machine learning model. Remember to read, learn, practice, clap, and share what you’ve learned.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=396f07dc571c" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Breast Cancer Classification with Deep learning]]></title>
            <link>https://medium.com/@babatundeoreoluwa35/breast-cancer-classification-with-deep-learning-43aea8127ac8?source=rss-34103bd9ac9b------2</link>
            <guid isPermaLink="false">https://medium.com/p/43aea8127ac8</guid>
            <category><![CDATA[science-and-technology]]></category>
            <category><![CDATA[healthcare]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Babatunde Oreoluwa]]></dc:creator>
            <pubDate>Wed, 23 Mar 2022 14:22:01 GMT</pubDate>
            <atom:updated>2022-03-23T14:22:01.033Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/690/1*brjYKZ2j0YMWYv1DvTA7FA.jpeg" /><figcaption>Image by <a href="https://healthitanalytics.com/features/what-is-deep-learning-and-how-will-it-change-healthcare">Health IT Analytics</a></figcaption></figure><p>Breast cancer is a disease in which the cells of the breast grow out of control. A breast cancer tumor can be benign (meaning it is not harmful to one’s health) or malignant (meaning it is harmful to one’s health) (has the potential to be dangerous). Benign tumors are not cancerous because their cells have a similar appearance to normal cells, they develop slowly, and do not invade neighboring tissues or spread to other parts of the body. Malignant tumors are cancerous. Malignant cells can eventually expand beyond the original tumor to other regions of the body if left untreated.</p><p>Deep learning is a subset of machine learning that is essentially a neural network with three or more layers. These neural networks aim to imitate the activity of the human brain by allowing it to “learn” from large amounts of data. Deep learning is a machine learning technique that allows computers to learn by example in the same way that humans do. Deep learning is a critical component of self-driving automobiles, allowing them to detect a stop sign or discriminate between pedestrians and lamppost. It enables voice control in consumer electronics such as phones, tablets, televisions, and hands-free speakers. Deep learning has gotten a lot of press recently, and with good cause. It’s accomplishing previously unattainable accomplishments.</p><p>In deep learning, a computer model learns to perform classification tasks directly from images, text, or Sundeep learning models can attain state-of-the-art accuracy, even surpassing human performance in some cases. Models are trained using a huge quantity of labeled data and multilayer neural network architectures.</p><p>In this article, I would be walking you through how to classify with deep learning whether a breast cancer tumor is benign or malignant.</p><p>The whole process is broken down into 4 stages;</p><ul><li>Data Collection.</li><li>Data cleaning and preprocessing</li><li>Building Neural Network</li><li>Making a predictive system</li></ul><p><strong>Data collection: </strong>The data used for this project is the publicly available dataset from Kaggle titled ‘Breast Cancer Wisconsin (Diagnostic) Data Set’. Here is the<a href="https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data"> link</a>.</p><p><strong>Data cleaning and preprocessing: </strong>Importing libraries and datasets is the initial step in data cleaning and preprocessing. A Python library is a group of related modules that may be called and used together. Pandas (for data analysis), Numpy (for numerical operations), Seaborn (for data visualization and exploratory data analysis), and matplotlib.py plot( for data visualization and graphical plotting). The “import” keyword can be used to access and use these libraries.</p><p>import all the necessary libraries</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0dcc811e3d6749f04bb0a9d4135a4164/href">https://medium.com/media/0dcc811e3d6749f04bb0a9d4135a4164/href</a></iframe><p>To be able to access the dataset from the drive ,we must make sure we mount the google drive</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/35fc2520eb24de5c90bb3f675209b657/href">https://medium.com/media/35fc2520eb24de5c90bb3f675209b657/href</a></iframe><p>After mounting the drive, load and read the dataset</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0c5555c29c5baf5bee421957cc1dcd98/href">https://medium.com/media/0c5555c29c5baf5bee421957cc1dcd98/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9LnADmN5YOtp98Nt2ZMcYQ.png" /></figure><p>The above shows the sample of the dataset</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/efcd86e5c9f4fe8c835faf2fdca8ab6a/href">https://medium.com/media/efcd86e5c9f4fe8c835faf2fdca8ab6a/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/782/1*MuylLZQN9pTqc2g0mYW-Lg.png" /></figure><p>From the above code, we can see that the data contains 569 rows and 33 columns.</p><p>Let’s drop the columns which aren’t needed for the prediction</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/9f76e3f39cff405c48c4663752f7b426/href">https://medium.com/media/9f76e3f39cff405c48c4663752f7b426/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ncjBfeescPmpAtxCRtB6bg.png" /></figure><p>After dropping the unneeded columns, we can see that we now have31 columns.</p><p>Let’s take a look at the dataset</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a3d21dca656ff45d1469f07feceeed74/href">https://medium.com/media/a3d21dca656ff45d1469f07feceeed74/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/289/1*cemhX0wRsK23kAyTXDC9-g.png" /></figure><p>The above shows that we have 569 data entries.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f73701da0b1ef63bf8f37e4419c6c54a/href">https://medium.com/media/f73701da0b1ef63bf8f37e4419c6c54a/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hXYCmxdM8rJORtbUzhXO1A.png" /></figure><p>The above shows the statistical measures of the dataset.</p><p>Let’s check for categorical features</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/376a2c32394f17099ad90abb73a2e028/href">https://medium.com/media/376a2c32394f17099ad90abb73a2e028/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/141/1*B2lZzjI1L6Jn1UW0EfYp4g.png" /></figure><p>Here we see that the target column which is the diagnosis is the only categorical column in the dataset. Let’s check for the distribution of the target column.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/6798a9bd56292166ce8c0172b0d1102f/href">https://medium.com/media/6798a9bd56292166ce8c0172b0d1102f/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ALt7bkI8MGIX5EHdhnZFfA.png" /></figure><p>The image above shows the value counts for the target column.</p><p>In my previous <a href="https://medium.com/@babatundeoreoluwa35/salary-prediction-with-machine-learning-part-1-d88364ed7d6b">article</a>, I talked about how to transform categorical features into numerical features using LabelEncoder. So we are going to encode the distribution of the target column into numerical features.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/6e36acad23fccf037b1cfd278338133a/href">https://medium.com/media/6e36acad23fccf037b1cfd278338133a/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/402/1*_5UmSSBqf8OgpAJPluoyiA.png" /></figure><p>Now, the label encoder has given each target value a unique integer value.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jkwaWhyFan9a_BpO6AZ4NQ.png" /></figure><p>0-Benign</p><p>1-Malignant</p><p>Let’s take a look at the data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fMWxsWLepgg1dnZbzPcOCQ.png" /></figure><p>Our data is ready for modeling!!!!</p><p>Now, we would split our data into X and y. X will contain the features and y will contain the target which is the diagnosis.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/6435a144e93866d4dcd23683bcf3939e/href">https://medium.com/media/6435a144e93866d4dcd23683bcf3939e/href</a></iframe><p>Splitting the data into Test and Train</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3f1c56ffc34488eac19910b2a80a9b32/href">https://medium.com/media/3f1c56ffc34488eac19910b2a80a9b32/href</a></iframe><p>To ensure that the data is internally consistent, we will use the standard scaler function from the sklearn.preprocessing library. Data standardization helps to increase the quality of your data and improve the accuracy of the model ,you can read more about <a href="https://www.journaldev.com/45025/standardscaler-function-in-python#:~:text=Python%20sklearn%20library%20offers%20us,values%20into%20a%20standard%20format.&amp;text=According%20to%20the%20above%20syntax,the%20data%20and%20standardize%20it.">data standardization.</a></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/ffaa6da1f5aed6afd9f5c71106c71c7b/href">https://medium.com/media/ffaa6da1f5aed6afd9f5c71106c71c7b/href</a></iframe><p>Now that the data has been cleaned and preprocessed, the next stage is building the neural network.</p><ul><li>Building Neural Network: To train this data, we’d build a three-layer network.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/263/1*7Uc5jpmOWyrcwQPgZkKW0g.png" /><figcaption>Image by <a href="https://healthitanalytics.com/features/what-is-deep-learning-and-how-will-it-change-healthcare">ResearchGate</a></figcaption></figure><p>Hence, we’d import the TensorFlow library, set the seed, specify a value, and import Keras from TensorFlow. Tensorflow is a deep learning library created by google used to create neural networks. Here is the <a href="https://www.tensorflow.org/guide/basics">documentation</a>. Keras is an open-source software library for artificial neural networks, it serves as a user interface for TensorFlow. Here is the <a href="https://keras.io/getting_started/">documentation</a></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/5a723503cebaffaa3d42b8f460f223dd/href">https://medium.com/media/5a723503cebaffaa3d42b8f460f223dd/href</a></iframe><p>Let’s create the neural network by calling the keras.Sequential() function. <a href="https://www.tensorflow.org/api_docs/python/tf/keras/Sequential"><strong>Sequential</strong></a> groups a linear stack of layers into a <strong>tf</strong>.keras.Model.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d4569c4ab8804d381312a6d1eb84ca64/href">https://medium.com/media/d4569c4ab8804d381312a6d1eb84ca64/href</a></iframe><p><strong>keras.layers.Flatten</strong>: this layer is the input layer, and it’s responsible for converting data into a one-dimensional array so that it may be passed on to the next layer. All of the feature columns are taken in this layer.</p><p><strong>keras.layers.Dense: t</strong>his is the hidden layer, where every neuron in the previous and next layers is connected to every other neuron in this layer. This layer is in between the input and output layers, it also contains the given number of neurons and activation function.</p><p><strong>keras.layers.Dense: </strong>this is the output layer, it contains the number of target values and the activation function.</p><p>An activation function is a very important feature of an artificial neural network, they decide whether the neuron should be activated or not.</p><p>We then compile the neural network after it has been created. Compilation of neural networks is a process in deep learning that converts the previously created basic sequence of layers into a highly efficient series of matrix transformations. Compilation can be thought of as a stage before computing that allows the computer to train the model. The compilation is performed using one single method which is shown below</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/6e1bce676d17b87ab067be0bd1dd3f44/href">https://medium.com/media/6e1bce676d17b87ab067be0bd1dd3f44/href</a></iframe><p>Optimizer: An optimizer is a function or algorithm that modifies the characteristics of a neural network, such as its weights and learning rate. As a result, it aids in the reduction of total loss and the improvement of accuracy.</p><p>Loss: the loss function in a neural network quantifies the difference between the expected outcome and the outcome produced by the machine learning model.</p><p>Metrics: A metric is a function that can be used to assess your model’s performance. Metric functions are similar to loss functions, except that the outcomes of assessing a metric are not included in the model’s training.</p><p>The next stage is to train the neural network.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/26797a6c76fff0bdf5bbcd0c4908ce6c/href">https://medium.com/media/26797a6c76fff0bdf5bbcd0c4908ce6c/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*m7St9rJ8Wvb8QSsN5s-cVg.png" /></figure><p>From this, we can see that the loss function and accuracy are inversely proportional. The lower the loss function the higher the accuracy of the neural network and vice versa.</p><p>Let’s check the accuracy of the test data set.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/95dc7bdde906db02c402f66368fd3d77/href">https://medium.com/media/95dc7bdde906db02c402f66368fd3d77/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/968/1*4xMA6Dn10wcCsq57ZFlbsg.png" /></figure><p><strong>Building a predictive system:</strong> This is the most interesting part of this project. Now we are going to be building a predictive system.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f4f7aa05e925172e156c1444abf00e07/href">https://medium.com/media/f4f7aa05e925172e156c1444abf00e07/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eZztoadlXx9H3SbXMVx4Wg.png" /></figure><p>The above code and output show the probability of the first data point labels as 0 and the probability of the second label as 1. So we can say that the model is 64% sure that the output is 0(Benign) and 43% sure that the output is 1(Malignant). So when we print out y.pred we get two values which are the probability of each label being the actual value. This can be quite confusing, to avoid this we use the argmax() function from the Numpy library, the argmax function prints the index of the maximum value. If the first value is maximum it prints out 0, if the second value is maximum it prints out 1.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/76aba72430231480c2275dfdc1ca59d4/href">https://medium.com/media/76aba72430231480c2275dfdc1ca59d4/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*X5eJk97hOxhxgVYQP-ORUA.png" /></figure><p>Now we have converted the prediction probability to target labels</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/4091cf11965e9722d0102e347d8d33ee/href">https://medium.com/media/4091cf11965e9722d0102e347d8d33ee/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8l2qvtfjF-d7EKKT89_Z_g.png" /></figure><p>Now, we have successfully built a deep learning model that can predict if a breast cancer tumor is Benign or malignant.</p><p>To test if your model is accurate you could just copy and paste a data point from your dataset and pass it into the input() and run the code. You can view the code on my <a href="https://github.com/Oreoluwa1234/Breast-cancer-prediction-with-deep-learning">GitHub</a> portfolio.</p><p>Deep learning models may not be the best approach for this project, given the complexities of deep learning models.<br>It is a recommended practice in machine learning to experiment with basic models before moving on to more complicated model approaches like neural networks, which are the foundation for deep learning. So I recommend you try out some machine learning models with this dataset.</p><p>Thank you for reading all through, I hope you now have a clear understanding of this project. Don’t forget to read, learn, practice, clap and share.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=43aea8127ac8" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Salary Prediction with Machine Learning (Part 2)]]></title>
            <link>https://medium.com/@babatundeoreoluwa35/salary-prediction-with-machine-learning-part-2-cb707d8b8567?source=rss-34103bd9ac9b------2</link>
            <guid isPermaLink="false">https://medium.com/p/cb707d8b8567</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[salary-negotiations]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[technology]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Babatunde Oreoluwa]]></dc:creator>
            <pubDate>Sat, 12 Feb 2022 04:54:04 GMT</pubDate>
            <atom:updated>2022-02-12T05:24:19.689Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/640/1*AcZgjPN8B0J98CABXLqj9A.jpeg" /></figure><p>In my last <a href="https://gist.github.com/Oreoluwa1234/78ce0ac55aa72e3a808a109808faddfa">article</a>, I built a model that can predict the annual salaries of data scientists. In this article, we will be deploying that model to create a Machine Learning web app that can predict the annual salaries of data scientists.</p><p>Before creating the ML web app, we must save the model, and we do this by using the library called pickle. Pickle is the standard way of serializing objects in Python. You can use the pickle operation to serialize your machine learning algorithms and save the serialized format to a file.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1798dbf448546d9b270b6aefc53da594/href">https://medium.com/media/1798dbf448546d9b270b6aefc53da594/href</a></iframe><p>After importing the pickle, we save the model, the label encoder for the country, and edlevel which is lb_country, lb_edlevel saved inside a dictionary. We then open a pickle file in the write binary mode “wb”, then dump the data into the file.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1b838786f9b5391f7455f49a3fd43360/href">https://medium.com/media/1b838786f9b5391f7455f49a3fd43360/href</a></iframe><p>After running the code above, the pickle file(Saved_step.pkl) will be saved automatically on our google drive directory. We can check it again by loading it again in the read binary format “ rb”. We can access the model, the label encoder for the country, and edlevel which is lb_country, lb_edlevel by giving them a key.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/542e3e695ee429b8e75aa0fd2fa0a42f/href">https://medium.com/media/542e3e695ee429b8e75aa0fd2fa0a42f/href</a></iframe><p>Now that we are through with saving the model. Next thing to do is to deploy this model into a web app using streamlit.</p><p><strong>Streamlit is an open-source python framework for creating and sharing web apps and interactive dashboards for data science and machine learning projects</strong></p><p>Open the editor of your choice, I would be using visual studio code. Inside the Vscode, open a folder containing the dataset, colab notebook, and the pickle file we created (Saved_step.pkl).</p><p>Create three new files called app.py, predict_page.py for the prediction page, and explore_page.py. for the explore page.</p><p>For the predict_page.py page, we import all the libraries used which are streamlit, numpy , and pickle</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/68f464f11d00d170aa55e7bf1e463793/href">https://medium.com/media/68f464f11d00d170aa55e7bf1e463793/href</a></iframe><p>After running the above code, we load and execute our data by writing a function.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/849679f368ce302693f94e7a39bc934a/href">https://medium.com/media/849679f368ce302693f94e7a39bc934a/href</a></iframe><p>Now, we want to access the keys to the model and label encoder for country and edlevel.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3aec6a54c1fd64bc3268ab1ff07aa673/href">https://medium.com/media/3aec6a54c1fd64bc3268ab1ff07aa673/href</a></iframe><p>Now, let’s create a function containing streamlit widgets.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/403fc2b27d3563622590890b47ed5f8b/href">https://medium.com/media/403fc2b27d3563622590890b47ed5f8b/href</a></iframe><p>In order for us to run these codes, we would input “streamlit run predict_page.py ” in the terminal to activate the terminal, then we go to our app file to import streamlit as st and import show_predict_page.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0fde1265f5b8dce79bac6fc996801ca6/href">https://medium.com/media/0fde1265f5b8dce79bac6fc996801ca6/href</a></iframe><p>After doing all these, input streamlit run app.py in the Vscode terminal.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/885/1*ZOxEWjPj4CgPFX27dJnzPw.png" /></figure><p>The above output is what is going to show in the browser.</p><p>The next thing we are going to add to the predict_page are two select boxes to the countries and edlevel.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/7f868ab7f7694516d7d5a68f6b39eaec/href">https://medium.com/media/7f868ab7f7694516d7d5a68f6b39eaec/href</a></iframe><p>The select box can contain a list or a tuple but we will be using a tuple here since the country and edlevel are tuples.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a106f3111cf82c3fa19d134bb1fb9aa6/href">https://medium.com/media/a106f3111cf82c3fa19d134bb1fb9aa6/href</a></iframe><p>After doing all these, click save, go back to your browser and click rerun.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/836/1*u6dRnMZxj4nYJ_3P7hzbdQ.png" /></figure><p>The above is the output it shows.</p><p>When you click on the country, it brings out the list of countries ;</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/770/1*o78tvQkwjfiwcd_-UgRZJw.png" /></figure><p>When you click on education, it brings out the list of education levels.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/757/1*d19tOP_sYYEX2dwDU_I4dQ.png" /></figure><p>For the years of experience, we would create a slider by calling the slider method and giving it a min -value, max value, and default value</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d72b4e954f49dcee7b247455821bcec2/href">https://medium.com/media/d72b4e954f49dcee7b247455821bcec2/href</a></iframe><p>Click save and rerun in your browser</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/744/1*JHjIJjJoovAA7JYkDK_VVA.png" /></figure><p>This is what it looks like when runned.</p><p>Now let’s add a button to calculate the salary after filling in all the information needed. We would call the button method and assign it to a variable.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/da531365cef539c7201f21e37c9d76e6/href">https://medium.com/media/da531365cef539c7201f21e37c9d76e6/href</a></iframe><p>Click save and rerun in the browser</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/327/1*PWy7YkTCJd41I1J4nMIFOg.png" /></figure><p>The above is the output after rerunning.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/715/1*jC-YGPJUYsH_hVSQv8Ab7g.png" /></figure><p>You get the estimated salary after clicking the “Calculate Salary” button.</p><p>Now that we are done with the prediction page, let’s take an example where we have the country to be Canada, education level to be master’s degree, and 10 years of experience.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/817/1*Jx7wqq500oY27kxd-47aMQ.png" /></figure><p>When we click the calculate salary button we have:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/794/1*O1-oVi-gCGSzOPHJchI6wQ.png" /></figure><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/6ed498fc9b9f40bcca404af4888ebbe2/href">https://medium.com/media/6ed498fc9b9f40bcca404af4888ebbe2/href</a></iframe><p>If you followed all the steps above, your prediction page should have these contents in your browser.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/794/1*O1-oVi-gCGSzOPHJchI6wQ.png" /></figure><p>We now have the prediction page ready, we are going to create a sidebar and add the second page called the explore page.</p><p>To create a sidebar, go to the app.py file and create a sidebar and selectbox method, passing the first and second argument.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/bc6580959f364dfeb67480a95af119fc/href">https://medium.com/media/bc6580959f364dfeb67480a95af119fc/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/314/1*qpC6BXf9sVyXy567SMeMXw.png" /></figure><p>The above is the result.</p><p>The only thing left to do is to implement the explore page.</p><p>For that import all the libraries</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/61d9190a414223d21a4080435b8901a6/href">https://medium.com/media/61d9190a414223d21a4080435b8901a6/href</a></iframe><p>The reason we imported the zip file is that our dataset is in the zip format.</p><p>Now we are going to clean and load the data the same way we did in the notebook in the previous article. To do this, we will copy all the functions we used to clean the data.</p><p>After cleaning and loading this data, we will apply all the transformations we did.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a7ac4e79eea77a45d84716d2ce2362ee/href">https://medium.com/media/a7ac4e79eea77a45d84716d2ce2362ee/href</a></iframe><p>To avoid having the model reloaded, we are going to use a function decorator called st.cache decorator. st.cache is a function decorator that helps to improve speed performance and memory consumption of the model.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/86783bc6644477f04c8d05bbe47cf998/href">https://medium.com/media/86783bc6644477f04c8d05bbe47cf998/href</a></iframe><p>On the explore page we will be displaying three charts a pie chart, a bar chart, and a line chart.</p><p>For the pie chart, we would be plotting the value_counts of the countries, we would do it by calling the value_counts() method and putting it in a pie chart using matplotlib. pyplot library.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/bce101a56d074251980b7484898499f5/href">https://medium.com/media/bce101a56d074251980b7484898499f5/href</a></iframe><p>To show everything we have done on the export page on the web app we would add some changes in the app.py file .</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/825fd16fc6823411f5c925dfac61e2e1/href">https://medium.com/media/825fd16fc6823411f5c925dfac61e2e1/href</a></iframe><p>Click save and rerun the browser .</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/854/1*KpMWx-KG4dQWQLjmZSJ_Iw.png" /></figure><p>Let’s plot the next bar chart.</p><p>For the bar chart, we are going to be plotting the Mean salary based on the country.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/02f08033f13f153db68a24423dfca7d4/href">https://medium.com/media/02f08033f13f153db68a24423dfca7d4/href</a></iframe><p>Click save and rerun</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/717/1*KRgEnYZqEoZ9XkLNr5l6fg.png" /></figure><p>Now we see the mean salary for each country.</p><p>The last chart is the line chart.</p><p>We would be plotting the mean salary based on the years of experience</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/802be67d74e863cf17de08c8d4b3f07a/href">https://medium.com/media/802be67d74e863cf17de08c8d4b3f07a/href</a></iframe><p>Save and re-run the explore page</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/760/1*btFrim4ffs1i28C5cGq-cw.png" /></figure><p>Now we are through deploying our model.</p><p>At the end of the deployment, this is how the web app should look like</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4in7mPBsh1B2F3Xdj5lIgA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1008/1*Knx3rHmoZX_AJC4GUIBkJQ.png" /></figure><p>This is the link to my <a href="https://share.streamlit.io/oreoluwa1234/salary-prediction-project/main/app.py">web app</a> ,you can also view the code on my <a href="https://github.com/Oreoluwa1234/Salary-Prediction-Project">GitHub</a> portfolio</p><p>Thank you for reading all through, I hope you now have a clear understanding of this project. Don’t forget to read, learn ,practice and clap under the article.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=cb707d8b8567" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Salary Prediction with Machine Learning (Part 1).]]></title>
            <link>https://medium.com/@babatundeoreoluwa35/salary-prediction-with-machine-learning-part-1-d88364ed7d6b?source=rss-34103bd9ac9b------2</link>
            <guid isPermaLink="false">https://medium.com/p/d88364ed7d6b</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[salary-negotiations]]></category>
            <category><![CDATA[technology]]></category>
            <dc:creator><![CDATA[Babatunde Oreoluwa]]></dc:creator>
            <pubDate>Sat, 05 Feb 2022 00:02:59 GMT</pubDate>
            <atom:updated>2022-02-05T08:55:32.736Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/640/1*AcZgjPN8B0J98CABXLqj9A.jpeg" /><figcaption>Image by <a href="https://www.itsecurityguru.org/2018/05/09/can-machine-learning-complement-existing-security-solutions/">IT security guru</a></figcaption></figure><p>Data Science is a very broad industry that has birthed many other recent data roles such as data analysis, machine learning engineering, data engineering, analytics engineering, and a few others. While some people have these roles well defined, others work across many of these branches without even knowing.</p><p>I recently stumbled upon a dataset that contains details of data scientists’ earnings/salaries across some countries, based on their education level and years of experience, so I thought it would be interesting to explore.</p><p>This article will be giving details of the project on data scientists’ annual salary predictions, which I worked on.</p><p>Prerequisites to understand this project include :</p><ul><li>Basic knowledge in Python programming</li><li>An understanding of data science</li></ul><p>The whole process is broken down into 4 stages;</p><ul><li>Data Collection.</li><li>Data Preprocessing</li><li>Model Building</li><li>Model Deployment</li></ul><p><strong>Data Collection</strong>: Data salaries are not easily available as HR personnel claims they are proprietary. Therefore, we resorted to using the publicly available data from Stack Overflow Annual Developer Survey. Here is the link</p><p><a href="https://insights.stackoverflow.com/survey">Stack Overflow</a></p><p><strong>Data Cleaning and preprocessing</strong>: The first step in data cleaning and preprocessing is importing the libraries and dataset. A python library is a collection of related modules that can be called and used. I would be using four main libraries which are<strong> pandas </strong>(for data analysis), <strong>Numpy (for </strong>numerical operations),<strong> seaborn (</strong>for data visualization and exploratory<strong> </strong>data analysis), and <strong>matplotlib.pyplot </strong>( for data visualization and graphical<strong> </strong>plotting). These libraries can be called and used with the help of the “import” keyword.</p><p>Importing all the necessary libraries</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/ad82678358e785279b4541d82d55863d/href">https://medium.com/media/ad82678358e785279b4541d82d55863d/href</a></iframe><p>Import and load the dataset from the drive: Since I used google colab I had to import and load the dataset from drive.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/8fc6d03faa5153a756da96d70d5148a6/href">https://medium.com/media/8fc6d03faa5153a756da96d70d5148a6/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VORaJeIL64NqYjexz0MsmQ.png" /></figure><p>The above shows the sample of the dataset.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/eafa84981a5e482dcef04b8d396f3e02/href">https://medium.com/media/eafa84981a5e482dcef04b8d396f3e02/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/e4b69f771bf5c844194f61e34b46eeb9/href">https://medium.com/media/e4b69f771bf5c844194f61e34b46eeb9/href</a></iframe><p>The dataset above contains 64461 rows and 61columns.</p><p>Let’s start cleaning!!!</p><p><strong>Selecting and keeping the columns/features needed for the prediction</strong>: When building a machine learning model in real-life, features selection is very important because it’s almost rare that all the features in the dataset are useful to build a model. So we selected only a few columns needed for the prediction in order not to bother the user with having to fill in too much unnecessary information. The columns are Country, Edlevel which is the education level, YearsCodePro which is the number of years of experience, Employment (full-time or part-time), ConvertedComp (the annual salary in dollars) this feature is still going to be renamed.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/c56454c7326eaaaae7176be82e488fef/href">https://medium.com/media/c56454c7326eaaaae7176be82e488fef/href</a></iframe><p><strong>Dealing with Missing Values</strong>: I would be using the columns where the salary is available, so I will be dropping columns with Nan salary.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/283958c77da09b2deaed20bb41cf9394/href">https://medium.com/media/283958c77da09b2deaed20bb41cf9394/href</a></iframe><p>Let’s take a quick look at the dataset</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/24caf532aedd6e6b1e96cb9796f2dad7/href">https://medium.com/media/24caf532aedd6e6b1e96cb9796f2dad7/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/308/1*Q7UdIvMrAAfTUGzvsaQV6Q.png" /></figure><p>Here we see that we have 34,025 data entries, three columns are objects which means they are strings, only the salary column is a float. So we would be dropping the rows where the columns are not numbers.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/22ff758c8b1f9c30a474b981260d4546/href">https://medium.com/media/22ff758c8b1f9c30a474b981260d4546/href</a></iframe><p>I dropped the employment columns since it wasn’t really needed for the prediction. Let’s take a quick look at our dataset again</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/10f89dd2b5a1cbbaf66276105daae149/href">https://medium.com/media/10f89dd2b5a1cbbaf66276105daae149/href</a></iframe><p>Now we would be cleaning each of the columns data</p><p>I would start with the country data</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/97af417d08d78ba67476679dd13c430d/href">https://medium.com/media/97af417d08d78ba67476679dd13c430d/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/c8ec47cea08620e5100a7b87a42ef6b4/href">https://medium.com/media/c8ec47cea08620e5100a7b87a42ef6b4/href</a></iframe><p>The value_counts() function in python will return the series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Here we see that the U.S.A has the most data and we have some countries with one data point which we will get rid of because our model can not learn from just one data point.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/9de7062736af7456e3b7248120d43993/href">https://medium.com/media/9de7062736af7456e3b7248120d43993/href</a></iframe><p>We will be cleaning our country column with the use of the function above after naming the function “shorten_categories”, we fix a cut-off value. If the number of data points for each country is greater than the cut-off value we keep it. Other wise we combine it to a new category called ‘Other’.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/c3504e2c24f7ce45a4b2a4dc6dd89e42/href">https://medium.com/media/c3504e2c24f7ce45a4b2a4dc6dd89e42/href</a></iframe><p>Now, after running the above we discovered that the new category we created has the most data points.</p><p>I would like to look at the relationship between the salary column and the country column by plotting a boxplot.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/099b2e2106e0464177e1b673b8b7ca85/href">https://medium.com/media/099b2e2106e0464177e1b673b8b7ca85/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/803/1*cgIXTeqE2llQqB7URWV1DA.png" /></figure><p>From our plot we can see that we have lot of outliers .So now we would keep the data where we have more information by keeping salaries that are lesser than or equal to $250,000 ,greater than or equal to $10,000 and drop the other category.</p><p>Let’s plot it again</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b50b1deff6d5f754d8a9bc3e1645fe69/href">https://medium.com/media/b50b1deff6d5f754d8a9bc3e1645fe69/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/771/1*6J5SBBrNlKVGJblN6hwf3g.png" /></figure><p>We can see that the outliers has reduced.</p><h4>Cleaning the YearsCodepro feature</h4><p>The unique() function in python is used to get unique values of the series object. Unique() functions are returned in order of appearance.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/af845a88c03e2879af1b11c157398118/href">https://medium.com/media/af845a88c03e2879af1b11c157398118/href</a></iframe><p>After running this, we discovered that all the arrays came out as strings, for the computer to understand these, we would convert the arrays to floats. If any is less than a year, it will return 0.5 while if it is more than a year, we ascribe 50. Otherwise, convert it to a float. We would do that by creating a function called clean_experience.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/88445b67e3625c86692d3a9576e1263f/href">https://medium.com/media/88445b67e3625c86692d3a9576e1263f/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/ac40e33b1c26a02deac6f746c37b4698/href">https://medium.com/media/ac40e33b1c26a02deac6f746c37b4698/href</a></iframe><p>After running the above we see that the output came out as integers.</p><h4>Cleaning the EdLevel feature</h4><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/5e6d99e6138ddea95a35a82df0b295fd/href">https://medium.com/media/5e6d99e6138ddea95a35a82df0b295fd/href</a></iframe><p>Here, we have different education levels. We would be focusing on the Bachelor’s , Master’s and other Post graduate degrees. Anything apart from these would be called Less than a Bachelor.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f333bc35af4657e45da07a366897fd38/href">https://medium.com/media/f333bc35af4657e45da07a366897fd38/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/69a03234010861512c8a36aadd6d5b53/href">https://medium.com/media/69a03234010861512c8a36aadd6d5b53/href</a></iframe><p>After running the code above we would see that we have only five outputs.</p><p>Now, we are almost done with the data cleaning.</p><p>As we all know, the computer does not understand strings and we do have columns containing strings! It is, therefore, necessary to transform the string values into unique values. To do this, we would be using LabelEncoder. Label Encoding in Python is part of data preprocessing. Hence, we will use the preprocessing module from the sklearn package and then import LabelEncoder</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/33a80945ea7978acdc4c7f923dc4c13f/href">https://medium.com/media/33a80945ea7978acdc4c7f923dc4c13f/href</a></iframe><p>Create an instance of LabelEncoder() and store it in the LabelEncoder variable which is the lb_edlevel.</p><p>Apply fit and transform which does the trick to assign numerical value to categorical value and the same is stored in a new column called “edlevel”.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/aadafbe6c893e80ee4c7695678744da6/href">https://medium.com/media/aadafbe6c893e80ee4c7695678744da6/href</a></iframe><p>Let’s take a look at the EdLevel</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/226329fd2ea29511ede2a93b14c08ae4/href">https://medium.com/media/226329fd2ea29511ede2a93b14c08ae4/href</a></iframe><p>We no longer have strings here again, the LabelEncoder has transformed the EdLevel column to integers which the model can now understand . We would do the same for the country column also.</p><p>Create an instance of <strong>LabelEncoder() </strong>and store it in<strong> the LabelEncoder </strong>variable which is the lb_country.</p><p>Apply fit and transform which does the trick to assign numerical value to categorical value and the same is stored in new a column called “country”.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/cfbd446bae8fc4d1a9290ebec4126db5/href">https://medium.com/media/cfbd446bae8fc4d1a9290ebec4126db5/href</a></iframe><p>Let&#39;s have a look at the unique values for the country column</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/95c2244ba7eb7822cf88cb252c9c60e3/href">https://medium.com/media/95c2244ba7eb7822cf88cb252c9c60e3/href</a></iframe><p>Now, label encoder has given each country a unique integer value.</p><p>Let’s check the dataset</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/92187f9110be256fdcad416fc3bbb079/href">https://medium.com/media/92187f9110be256fdcad416fc3bbb079/href</a></iframe><p>Now, our data is ready for training and testing.</p><p><strong>Data Splitting</strong>: Data splitting is commonly used in machine learning to split data into a train, test, or validation set. This approach allows us to estimate the model performance. Here, we would only be training and testing the data.</p><p>We would split our data into X and y. X will contain the features and y will contain the dropped target which is the salary.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/c711dca19248e492212a61259b02cabe/href">https://medium.com/media/c711dca19248e492212a61259b02cabe/href</a></iframe><p>After doing that we would be splitting X and y dataset into the test and train. To do this we will be using the train_test_split function.</p><p>The train_test_split is <strong>a function in Sklearn model selection for splitting data arrays into two subsets</strong>: for training and testing data. With this function, you don’t need to divide the dataset manually. By default, Sklearn train_test_split will make random partitions for the two subsets.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/488cd6b6cc8864f27c5bafe1993487ac/href">https://medium.com/media/488cd6b6cc8864f27c5bafe1993487ac/href</a></iframe><p>We would be training with 70% and testing with 30% of the dataset and set our random_state at 42. The random state ensures that the outputs are generated in the same order whenever it is being runned.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1ff10fc1f56fe0a03efad5c11d42eb1f/href">https://medium.com/media/1ff10fc1f56fe0a03efad5c11d42eb1f/href</a></iframe><p>So it’s time to build our model!!!</p><p>Three different algorithms would be used to build the model then we will pick the algorithm with the least error.</p><p>We would start with Linear Regression, Linear regression is a basic and commonly used type of algorithm for predictive analysis, we would start by importing it from sklearn.linear_model.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/24b4bd66bb5abcb83443260ce0137cf1/href">https://medium.com/media/24b4bd66bb5abcb83443260ce0137cf1/href</a></iframe><p>Create an instance of <strong>LinearRegression() </strong>and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/662ee76d6b7450a05dbcc1278da98510/href">https://medium.com/media/662ee76d6b7450a05dbcc1278da98510/href</a></iframe><p>In regression predictive modeling, we use error metrics to calculate the model performance. The error metrics we would be using is the RMSE which is the root of the mean square error. This error metric would be used to show the difference between the predicted value and the actual value.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/816db2c8415e2c3f33dbe34863e0d095/href">https://medium.com/media/816db2c8415e2c3f33dbe34863e0d095/href</a></iframe><p>Our output is shown below</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1e715fd27f2d4c3927b16d8c9453e642/href">https://medium.com/media/1e715fd27f2d4c3927b16d8c9453e642/href</a></iframe><p>We can see that the difference between the actual and predicted value using LinearRegression algorithm is $39,558.79 which is very high</p><p>Let’s try the DecisionTreeRegressor algorithm</p><p>Import DecisionTreeRegressor from sklearn.tree, create an instance of DecisionTreeRegressor<strong> </strong>and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a9995cab4e855af2389b003608054a91/href">https://medium.com/media/a9995cab4e855af2389b003608054a91/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f29f3b5609e73cca05b8a044ff2fbb4c/href">https://medium.com/media/f29f3b5609e73cca05b8a044ff2fbb4c/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b56335fd62fa12ea32ac080dd116b6b2/href">https://medium.com/media/b56335fd62fa12ea32ac080dd116b6b2/href</a></iframe><p>The difference between the actual and predicted value using DecisionTreeRegressor is $33,962.56 which is a little bit high.</p><p>Let’s try the Random Forest Regression algorithm.</p><p>Import RandomForestRegressor from sklearn.ensemble. Create an instance of RandomForestRegressor<strong> </strong>and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/70182692ecede19892e328ef78918e7e/href">https://medium.com/media/70182692ecede19892e328ef78918e7e/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/80f3c012652c7271db981061458c43a0/href">https://medium.com/media/80f3c012652c7271db981061458c43a0/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/58dacc9a949e1f241a10d6cd8d866095/href">https://medium.com/media/58dacc9a949e1f241a10d6cd8d866095/href</a></iframe><p>Finally, the RandomForestRegressor algorithm gave us the least error. Now we want to find the best parameter for our model using the Gridsearchcv.</p><p>Grid search is the process of performing hyperparameter tuning to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyperparameter values specified. It is a useful tool to fine-tune the parameters of your model.</p><p>The way it works is by importing it from sklearn.model_selection, define the set of different parameters, then create a parameter dictionary containing the keyword argument in RandomForestRegressor, (you can check out the documentation for that ) then create an instance of the regressor algorithm used, an instance of Gridsearchcv containing the regressor algorithm, parameter, and the scoring. Lastly, fit the Gridsearchcv to the training data set.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/7a94a9a4780c16e3dc939d8be7eea05e/href">https://medium.com/media/7a94a9a4780c16e3dc939d8be7eea05e/href</a></iframe><p>After running the above code, we get the best estimator and store it in a variable called model. Then fit it on our training dataset and use it to predict on our test data set.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1b03159e1f62580eb983ee33350938d4/href">https://medium.com/media/1b03159e1f62580eb983ee33350938d4/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/46b022df7bc64f709b81d867164e94fc/href">https://medium.com/media/46b022df7bc64f709b81d867164e94fc/href</a></iframe><p>Following this, the error value has reduced a bit from $33,617.45 to $32,911.09 which is still fair.</p><p><strong>Making a predictive system</strong></p><p>For instance a user inputs his country’s information as United States; EdLevel as Master’s degree; and the yearscodepro as 15years</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f53216c17062cc2cdce0231f3e6cd700/href">https://medium.com/media/f53216c17062cc2cdce0231f3e6cd700/href</a></iframe><p>Below is the result for the user’s annual salary .</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/7b7ded95f410b472fb54f83e59764fb2/href">https://medium.com/media/7b7ded95f410b472fb54f83e59764fb2/href</a></iframe><p>In conclusion, we have seen the step-by-step approach to building the model for our salary prediction web app. In my next article, I will be sharing how to deploy this model.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d88364ed7d6b" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>