Stories by Babatunde Oreoluwa on Medium

Image Classification Using Transfer Learning: Crop Disease Classification

Babatunde Oreoluwa — Sat, 25 Jun 2022 09:32:01 GMT

Artificial intelligence has numerous interesting branches, we will be discussing a branch of AI which is called Computer Vision.

Computer vision (CV) is a branch of artificial intelligence (AI) that allows computers and systems to extract useful information from digital photos, videos, and other visual inputs, as well as to make meaningful decisions based on those data.

Computer vision has numerous subfields which include Image classification, Image segmentation, scene reconstruction, object detection, event detection, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, visual servoing, and 3D scene modeling, and image restoration.

In this piece, we shall be focusing on a subfield in CV known as Image classification. Image classification is the task of assigning a label to an image from a predefined set of categories. In practice, this implies that we must analyze an input image and then produce a label that categorizes it. The label is always chosen from a predetermined set of options.

In this article, we would be building a model using transfer learning(pre-trained models)to classify if a plant has been affected by a fall armyworm using the images of the plant. The data source for this task is the Makerere Fall Armyworm Crop Challenge data on Zindi.

The data for this project has a train.CSV file that contains the 1,619train images name and Labels, a test.CSV file that contains the 1,080 image names only, and the Images folder which contains the 2,699 images for the train CSV and test CSV. I would be using Google colab as the IDE for this project. After downloading and uploading the data on your drive, you mount the drive.

https://medium.com/media/ed6a9b61be06db87ff35c12c6d153da8/href

Then import all the necessary libraries. For this project, we would be using the Tensorflow and Keras framework. I choose this framework because it is easy to use, flexible and both have simpler APIs.

https://medium.com/media/195566b63efe6569e450beac56015701/href

The libraries are now imported, it’s time to load the data.

https://medium.com/media/b1a02f2ed50b9e251104f806d5a74ebb/href

Let’s check what the data looks like;

https://medium.com/media/a3b7a1db7983aeb1800b91020a4ad95d/href

We can see that the train images CSV file contains the Image_id and the labels. Where 1 indicates that the image has been affected by a fall armyworm and 0 if it has not been affected.

https://medium.com/media/4844dc6e8b795f730dee75922577cf3e/href

The test data does not have a Label column because it is what we are classifying after building the model. Let’s give a variable name to the path of the image directory.

https://medium.com/media/2760f6d57e9f0be731810c28378f0acb/href

Now, we would be merging the images_path with the train [‘Image_Id] so that the train [‘Image_Id] column in the csv file will have the full image path.

https://medium.com/media/5cabb30c66c6f6f57752a8bbd345921a/href

The train [‘Image_Id] column of the train CSV file now has the train images full path containing 1,619 image file paths. We would do the same for the test CSV.

https://medium.com/media/935a39c6827aaa544e5b8b81342f7df1/href

Let’s now preprocess the data for modelling. As I said earlier we would be using Tensorflow.keras framework for this project. Tensorflow.keras framework has different ways of loading and augmenting data into the model. For this project, we would be using the ImageDataGenerator to augment the images and .flow from_dataframe to load the data. After rescaling all the images to a particular size we then augment the images.

Note: It is not compulsory to augment image data, we only augment image data to improve the performance and the outcome of the model by forming new and different examples to the train and validation datasets

When you use the .flow from_dataframe data loader, pass in the csv file, the image_id as the x column ,the target as the y column, the target_size of the images the class_mode, subset,seed and batch _size. There are other arguments you can pass in depending on what you want to do.

https://medium.com/media/0c441676044bfa770fa16f5a7b6d4041/href

For the test data loader we put the y_col as None because that is what we are predicting.

Let’s build our model!!!!!

In image classification, deep learning models require training from scratch, which is computationally expensive and requires a large amount of data to achieve high performance. On the other hand, using a pre-trained model is what we call Transfer Learning. Transfer Learning is a method in machine learning where a model developed for a task is reused as the starting point for a model on another similar task. Transfer learning is computationally efficient and helps achieve better results using a small amount of data. For this project, we are using a pre-trained model to classify the plant image which is infected or uninfected.

We have several pre-trained models for image classification. Examples are VGG16, VGG19, InceptionV3, ResNet50, ResNetV2, etc. These pre-trained models have been trained on millions of images making it easier to perform better on our data.

Let’s import the pre-trained model we are using.

https://medium.com/media/b965f39e8a3fe19456418a2e40761730/href

We would be using VGG19 as our pretrained model. VGG19 is a variant of the VGG model which consists of 19 layers (16 convolution layers, 3 fully connected layers, 5 MaxPool layers, and 1 SoftMax layer).

Now, we are done importing the pre-trained model. We would not be loading the output layer for the model, because the VGG19 model was initially trained with the ImageNet database that contains a million images of 1000 classes. Since we are working on a binary image classification we, therefore, freeze the initial output layers that have 1000 classes and add our output layer. We freeze the output layer by putting False to the include_top argument. VGG19 takes the input shape of the images 224x224 size.

https://medium.com/media/0f033357a8e54ff4b683cf2f2273526f/href https://medium.com/media/d13cb8b848e6a9dd3e34f10e5c8aea98/href

In the above code we set our trainable argument to false because we don’t want the VGG19 model’s weight to interfere with our current data so we freeze the weight of the pre-trained model.

Let’s add our output layer .

https://medium.com/media/8521110e8ca0060afdefdf1c1e278821/href

Let’s compile the model using Adam as the optimizer, loss as binary_crossentropy, because it’s a binary image classification and metrics as Accuracy score

https://medium.com/media/e2470a7744be7b09eeabbc60c52b8d3d/href

Let’s check for the summary

https://medium.com/media/d09aaa3c5ece49677750e1d7e2752f4f/href

The above image is the output when you run the model.summary() code, it brings out the layers of the model. We can also see that there are 26,447,682 total parameters for the model, of which 6,423,298 are trainable parameters. The remaining 20,024,384 are non-trainable parameters, which are the weights we froze.

Let’s now train the model!!!

To train our model we pass in the train data, validation data, epochs, and steps_per_epochs this is the number of unique samples of your dataset divided by the batch size and the verbose. The verbose will show the output of the model while training.

https://medium.com/media/cca251c0398759af6752a8a82b8f4fa0/href

Wow!!! The accuracy of our model is 98.4%. Let’s now create a submission CSV file by using the model to classify the images in the test data set.

https://medium.com/media/340ec034eca611a57cb368cd281a0b82/href

When we print out the predictions, we get two values which are the probability of each label being the actual value. This can be quite confusing, to avoid this we use the argmax() function from the Numpy library. The argmax function prints the index of the maximum value. If the first value is maximum it prints out 0, if the second value is maximum it prints out 1.

https://medium.com/media/d31a8508d100967cec1adf47eb2edf49/href

Now we have converted the prediction probability to target labels

The submission csv contains the test.csv file image_id , let’s read the submission CSV file and create a Label column

https://medium.com/media/0f78f6dc990d0cc6bd053dbc0ff50c21/href

When this code is done executing , a new csv file called my_submission csv will be created on your drive. This is what you will download and upload as your submission .

Finally, we have a model that can predict if a plant is infected or uninfected. I recommend you try out other pre-trained models and compare their accuracy scores.

For more clarifications on this project check out my GitHub portfolio.

Thank you for reading all through, I hope this article is useful and of benefit to you. Don’t forget to read, practice, learn, clap and share.

A step-by-step approach to building a machine learning model

Babatunde Oreoluwa — Thu, 28 Apr 2022 11:08:37 GMT

Image by Unsplash

Machine learning (ML) is a subfield of artificial intelligence (AI) that focuses on developing systems that learn from the data they consume to improve their performance. Most AI applications execute with ML models, and most beginners are unaware that there is a step-by-step process for creating an ML model.

We’ll go over the processes for creating a machine learning model in this article. The following is a step-by-step approach to building a machine learning model:

(1)Understanding the problem: Understanding the problem is the first step in creating a machine learning model, once the problem is understood, it provides us with a structured way to solve it.

(2)Data collection: The practice of collecting and acquiring data from a variety of sources is known as data collection. Data must be collected and kept in a form that makes sense for the problem at hand to be used to develop viable artificial intelligence (AI) and machine learning solutions. The accuracy of a machine learning model is only as good as the data used to train them. Different websites serve as a source of data for ML projects e.g Kaggle, Zindi, UCL machine learning repository.

(3)Data Preprocessing: As we know we can’t work with raw data, we must transform the data into an understandable format by preprocessing the data. There are various preprocessing methods for the various data type. Data preprocessing is considered one of the crucial phases in developing a machine learning model because it prepares the data in the most meaningful way for the subsequent data modeling.

(4)Data Modelling: The data is ready for training and testing at this point. We may now select a model and train the data on it. When it comes to selecting a model. You can choose from a variety of models based on your data. The model selection process includes classification, regression, clustering, and other methods. You will now be required to train datasets for them to run smoothly. Several algorithms and techniques are used in the stage of training and testing the machine model.

(5)-Model Evaluation: The outcome of the model can be used to evaluate the model. Model evaluation is done by using evaluation metrics such as accuracy score, Root Mean Square Error(RMSE), confusion metrics, classification report, Mean Square Error (MSE), and so on to be able to check the quality of the data. This stage ensures that a machine learning model is of high performance.

(6)-Model improvement: After checking your model’s performance by evaluating it with some metrics, there could be an improvement if the model is not performing as expected. So this stage is an optional one.

(7)-Model deployment: The model is now ready to be put into production to test how it performs in the real world. It could be deployed as a web app or anything you wish it could be

Finally, these are the steps needed to build a machine learning model. Thank you for reading this far; I hope you now have a clear understanding of how to build a machine learning model. Remember to read, learn, practice, clap, and share what you’ve learned.

Breast Cancer Classification with Deep learning

Babatunde Oreoluwa — Wed, 23 Mar 2022 14:22:01 GMT

Image by Health IT Analytics

Breast cancer is a disease in which the cells of the breast grow out of control. A breast cancer tumor can be benign (meaning it is not harmful to one’s health) or malignant (meaning it is harmful to one’s health) (has the potential to be dangerous). Benign tumors are not cancerous because their cells have a similar appearance to normal cells, they develop slowly, and do not invade neighboring tissues or spread to other parts of the body. Malignant tumors are cancerous. Malignant cells can eventually expand beyond the original tumor to other regions of the body if left untreated.

Deep learning is a subset of machine learning that is essentially a neural network with three or more layers. These neural networks aim to imitate the activity of the human brain by allowing it to “learn” from large amounts of data. Deep learning is a machine learning technique that allows computers to learn by example in the same way that humans do. Deep learning is a critical component of self-driving automobiles, allowing them to detect a stop sign or discriminate between pedestrians and lamppost. It enables voice control in consumer electronics such as phones, tablets, televisions, and hands-free speakers. Deep learning has gotten a lot of press recently, and with good cause. It’s accomplishing previously unattainable accomplishments.

In deep learning, a computer model learns to perform classification tasks directly from images, text, or Sundeep learning models can attain state-of-the-art accuracy, even surpassing human performance in some cases. Models are trained using a huge quantity of labeled data and multilayer neural network architectures.

In this article, I would be walking you through how to classify with deep learning whether a breast cancer tumor is benign or malignant.

The whole process is broken down into 4 stages;

Data Collection.
Data cleaning and preprocessing
Building Neural Network
Making a predictive system

Data collection: The data used for this project is the publicly available dataset from Kaggle titled ‘Breast Cancer Wisconsin (Diagnostic) Data Set’. Here is the link.

Data cleaning and preprocessing: Importing libraries and datasets is the initial step in data cleaning and preprocessing. A Python library is a group of related modules that may be called and used together. Pandas (for data analysis), Numpy (for numerical operations), Seaborn (for data visualization and exploratory data analysis), and matplotlib.py plot( for data visualization and graphical plotting). The “import” keyword can be used to access and use these libraries.

import all the necessary libraries

https://medium.com/media/0dcc811e3d6749f04bb0a9d4135a4164/href

To be able to access the dataset from the drive ,we must make sure we mount the google drive

https://medium.com/media/35fc2520eb24de5c90bb3f675209b657/href

After mounting the drive, load and read the dataset

https://medium.com/media/0c5555c29c5baf5bee421957cc1dcd98/href

The above shows the sample of the dataset

https://medium.com/media/efcd86e5c9f4fe8c835faf2fdca8ab6a/href

From the above code, we can see that the data contains 569 rows and 33 columns.

Let’s drop the columns which aren’t needed for the prediction

https://medium.com/media/9f76e3f39cff405c48c4663752f7b426/href

After dropping the unneeded columns, we can see that we now have31 columns.

Let’s take a look at the dataset

https://medium.com/media/a3d21dca656ff45d1469f07feceeed74/href

The above shows that we have 569 data entries.

https://medium.com/media/f73701da0b1ef63bf8f37e4419c6c54a/href

The above shows the statistical measures of the dataset.

Let’s check for categorical features

https://medium.com/media/376a2c32394f17099ad90abb73a2e028/href

Here we see that the target column which is the diagnosis is the only categorical column in the dataset. Let’s check for the distribution of the target column.

https://medium.com/media/6798a9bd56292166ce8c0172b0d1102f/href

The image above shows the value counts for the target column.

In my previous article, I talked about how to transform categorical features into numerical features using LabelEncoder. So we are going to encode the distribution of the target column into numerical features.

https://medium.com/media/6e36acad23fccf037b1cfd278338133a/href

Now, the label encoder has given each target value a unique integer value.

0-Benign

1-Malignant

Let’s take a look at the data.

Our data is ready for modeling!!!!

Now, we would split our data into X and y. X will contain the features and y will contain the target which is the diagnosis.

https://medium.com/media/6435a144e93866d4dcd23683bcf3939e/href

Splitting the data into Test and Train

https://medium.com/media/3f1c56ffc34488eac19910b2a80a9b32/href

To ensure that the data is internally consistent, we will use the standard scaler function from the sklearn.preprocessing library. Data standardization helps to increase the quality of your data and improve the accuracy of the model ,you can read more about data standardization.

https://medium.com/media/ffaa6da1f5aed6afd9f5c71106c71c7b/href

Now that the data has been cleaned and preprocessed, the next stage is building the neural network.

Building Neural Network: To train this data, we’d build a three-layer network.

Image by ResearchGate

Hence, we’d import the TensorFlow library, set the seed, specify a value, and import Keras from TensorFlow. Tensorflow is a deep learning library created by google used to create neural networks. Here is the documentation. Keras is an open-source software library for artificial neural networks, it serves as a user interface for TensorFlow. Here is the documentation

https://medium.com/media/5a723503cebaffaa3d42b8f460f223dd/href

Let’s create the neural network by calling the keras.Sequential() function. Sequential groups a linear stack of layers into a tf.keras.Model.

https://medium.com/media/d4569c4ab8804d381312a6d1eb84ca64/href

keras.layers.Flatten: this layer is the input layer, and it’s responsible for converting data into a one-dimensional array so that it may be passed on to the next layer. All of the feature columns are taken in this layer.

keras.layers.Dense: this is the hidden layer, where every neuron in the previous and next layers is connected to every other neuron in this layer. This layer is in between the input and output layers, it also contains the given number of neurons and activation function.

keras.layers.Dense: this is the output layer, it contains the number of target values and the activation function.

An activation function is a very important feature of an artificial neural network, they decide whether the neuron should be activated or not.

We then compile the neural network after it has been created. Compilation of neural networks is a process in deep learning that converts the previously created basic sequence of layers into a highly efficient series of matrix transformations. Compilation can be thought of as a stage before computing that allows the computer to train the model. The compilation is performed using one single method which is shown below

https://medium.com/media/6e1bce676d17b87ab067be0bd1dd3f44/href

Optimizer: An optimizer is a function or algorithm that modifies the characteristics of a neural network, such as its weights and learning rate. As a result, it aids in the reduction of total loss and the improvement of accuracy.

Loss: the loss function in a neural network quantifies the difference between the expected outcome and the outcome produced by the machine learning model.

Metrics: A metric is a function that can be used to assess your model’s performance. Metric functions are similar to loss functions, except that the outcomes of assessing a metric are not included in the model’s training.

The next stage is to train the neural network.

https://medium.com/media/26797a6c76fff0bdf5bbcd0c4908ce6c/href

From this, we can see that the loss function and accuracy are inversely proportional. The lower the loss function the higher the accuracy of the neural network and vice versa.

Let’s check the accuracy of the test data set.

https://medium.com/media/95dc7bdde906db02c402f66368fd3d77/href

Building a predictive system: This is the most interesting part of this project. Now we are going to be building a predictive system.

https://medium.com/media/f4f7aa05e925172e156c1444abf00e07/href

The above code and output show the probability of the first data point labels as 0 and the probability of the second label as 1. So we can say that the model is 64% sure that the output is 0(Benign) and 43% sure that the output is 1(Malignant). So when we print out y.pred we get two values which are the probability of each label being the actual value. This can be quite confusing, to avoid this we use the argmax() function from the Numpy library, the argmax function prints the index of the maximum value. If the first value is maximum it prints out 0, if the second value is maximum it prints out 1.

https://medium.com/media/76aba72430231480c2275dfdc1ca59d4/href

Now we have converted the prediction probability to target labels

https://medium.com/media/4091cf11965e9722d0102e347d8d33ee/href

Now, we have successfully built a deep learning model that can predict if a breast cancer tumor is Benign or malignant.

To test if your model is accurate you could just copy and paste a data point from your dataset and pass it into the input() and run the code. You can view the code on my GitHub portfolio.

Deep learning models may not be the best approach for this project, given the complexities of deep learning models.
It is a recommended practice in machine learning to experiment with basic models before moving on to more complicated model approaches like neural networks, which are the foundation for deep learning. So I recommend you try out some machine learning models with this dataset.

Thank you for reading all through, I hope you now have a clear understanding of this project. Don’t forget to read, learn, practice, clap and share.

Salary Prediction with Machine Learning (Part 2)

Babatunde Oreoluwa — Sat, 12 Feb 2022 04:54:04 GMT

In my last article, I built a model that can predict the annual salaries of data scientists. In this article, we will be deploying that model to create a Machine Learning web app that can predict the annual salaries of data scientists.

Before creating the ML web app, we must save the model, and we do this by using the library called pickle. Pickle is the standard way of serializing objects in Python. You can use the pickle operation to serialize your machine learning algorithms and save the serialized format to a file.

https://medium.com/media/1798dbf448546d9b270b6aefc53da594/href

After importing the pickle, we save the model, the label encoder for the country, and edlevel which is lb_country, lb_edlevel saved inside a dictionary. We then open a pickle file in the write binary mode “wb”, then dump the data into the file.

https://medium.com/media/1b838786f9b5391f7455f49a3fd43360/href

After running the code above, the pickle file(Saved_step.pkl) will be saved automatically on our google drive directory. We can check it again by loading it again in the read binary format “ rb”. We can access the model, the label encoder for the country, and edlevel which is lb_country, lb_edlevel by giving them a key.

https://medium.com/media/542e3e695ee429b8e75aa0fd2fa0a42f/href

Now that we are through with saving the model. Next thing to do is to deploy this model into a web app using streamlit.

Streamlit is an open-source python framework for creating and sharing web apps and interactive dashboards for data science and machine learning projects

Open the editor of your choice, I would be using visual studio code. Inside the Vscode, open a folder containing the dataset, colab notebook, and the pickle file we created (Saved_step.pkl).

Create three new files called app.py, predict_page.py for the prediction page, and explore_page.py. for the explore page.

For the predict_page.py page, we import all the libraries used which are streamlit, numpy , and pickle

https://medium.com/media/68f464f11d00d170aa55e7bf1e463793/href

After running the above code, we load and execute our data by writing a function.

https://medium.com/media/849679f368ce302693f94e7a39bc934a/href

Now, we want to access the keys to the model and label encoder for country and edlevel.

https://medium.com/media/3aec6a54c1fd64bc3268ab1ff07aa673/href

Now, let’s create a function containing streamlit widgets.

https://medium.com/media/403fc2b27d3563622590890b47ed5f8b/href

In order for us to run these codes, we would input “streamlit run predict_page.py ” in the terminal to activate the terminal, then we go to our app file to import streamlit as st and import show_predict_page.

https://medium.com/media/0fde1265f5b8dce79bac6fc996801ca6/href

After doing all these, input streamlit run app.py in the Vscode terminal.

The above output is what is going to show in the browser.

The next thing we are going to add to the predict_page are two select boxes to the countries and edlevel.

https://medium.com/media/7f868ab7f7694516d7d5a68f6b39eaec/href

The select box can contain a list or a tuple but we will be using a tuple here since the country and edlevel are tuples.

https://medium.com/media/a106f3111cf82c3fa19d134bb1fb9aa6/href

After doing all these, click save, go back to your browser and click rerun.

The above is the output it shows.

When you click on the country, it brings out the list of countries ;

When you click on education, it brings out the list of education levels.

For the years of experience, we would create a slider by calling the slider method and giving it a min -value, max value, and default value

https://medium.com/media/d72b4e954f49dcee7b247455821bcec2/href

Click save and rerun in your browser

This is what it looks like when runned.

Now let’s add a button to calculate the salary after filling in all the information needed. We would call the button method and assign it to a variable.

https://medium.com/media/da531365cef539c7201f21e37c9d76e6/href

Click save and rerun in the browser

The above is the output after rerunning.

You get the estimated salary after clicking the “Calculate Salary” button.

Now that we are done with the prediction page, let’s take an example where we have the country to be Canada, education level to be master’s degree, and 10 years of experience.

When we click the calculate salary button we have:

https://medium.com/media/6ed498fc9b9f40bcca404af4888ebbe2/href

If you followed all the steps above, your prediction page should have these contents in your browser.

We now have the prediction page ready, we are going to create a sidebar and add the second page called the explore page.

To create a sidebar, go to the app.py file and create a sidebar and selectbox method, passing the first and second argument.

https://medium.com/media/bc6580959f364dfeb67480a95af119fc/href

The above is the result.

The only thing left to do is to implement the explore page.

For that import all the libraries

https://medium.com/media/61d9190a414223d21a4080435b8901a6/href

The reason we imported the zip file is that our dataset is in the zip format.

Now we are going to clean and load the data the same way we did in the notebook in the previous article. To do this, we will copy all the functions we used to clean the data.

After cleaning and loading this data, we will apply all the transformations we did.

https://medium.com/media/a7ac4e79eea77a45d84716d2ce2362ee/href

To avoid having the model reloaded, we are going to use a function decorator called st.cache decorator. st.cache is a function decorator that helps to improve speed performance and memory consumption of the model.

https://medium.com/media/86783bc6644477f04c8d05bbe47cf998/href

On the explore page we will be displaying three charts a pie chart, a bar chart, and a line chart.

For the pie chart, we would be plotting the value_counts of the countries, we would do it by calling the value_counts() method and putting it in a pie chart using matplotlib. pyplot library.

https://medium.com/media/bce101a56d074251980b7484898499f5/href

To show everything we have done on the export page on the web app we would add some changes in the app.py file .

https://medium.com/media/825fd16fc6823411f5c925dfac61e2e1/href

Click save and rerun the browser .

Let’s plot the next bar chart.

For the bar chart, we are going to be plotting the Mean salary based on the country.

https://medium.com/media/02f08033f13f153db68a24423dfca7d4/href

Click save and rerun

Now we see the mean salary for each country.

The last chart is the line chart.

We would be plotting the mean salary based on the years of experience

https://medium.com/media/802be67d74e863cf17de08c8d4b3f07a/href

Save and re-run the explore page

Now we are through deploying our model.

At the end of the deployment, this is how the web app should look like

This is the link to my web app ,you can also view the code on my GitHub portfolio

Thank you for reading all through, I hope you now have a clear understanding of this project. Don’t forget to read, learn ,practice and clap under the article.

Salary Prediction with Machine Learning (Part 1).

Babatunde Oreoluwa — Sat, 05 Feb 2022 00:02:59 GMT

Image by IT security guru

Data Science is a very broad industry that has birthed many other recent data roles such as data analysis, machine learning engineering, data engineering, analytics engineering, and a few others. While some people have these roles well defined, others work across many of these branches without even knowing.

I recently stumbled upon a dataset that contains details of data scientists’ earnings/salaries across some countries, based on their education level and years of experience, so I thought it would be interesting to explore.

This article will be giving details of the project on data scientists’ annual salary predictions, which I worked on.

Prerequisites to understand this project include :

Basic knowledge in Python programming
An understanding of data science

The whole process is broken down into 4 stages;

Data Collection.
Data Preprocessing
Model Building
Model Deployment

Data Collection: Data salaries are not easily available as HR personnel claims they are proprietary. Therefore, we resorted to using the publicly available data from Stack Overflow Annual Developer Survey. Here is the link

Stack Overflow

Data Cleaning and preprocessing: The first step in data cleaning and preprocessing is importing the libraries and dataset. A python library is a collection of related modules that can be called and used. I would be using four main libraries which are pandas (for data analysis), Numpy (for numerical operations), seaborn (for data visualization and exploratory data analysis), and matplotlib.pyplot ( for data visualization and graphical plotting). These libraries can be called and used with the help of the “import” keyword.

Importing all the necessary libraries

https://medium.com/media/ad82678358e785279b4541d82d55863d/href

Import and load the dataset from the drive: Since I used google colab I had to import and load the dataset from drive.

https://medium.com/media/8fc6d03faa5153a756da96d70d5148a6/href

The above shows the sample of the dataset.

https://medium.com/media/eafa84981a5e482dcef04b8d396f3e02/href https://medium.com/media/e4b69f771bf5c844194f61e34b46eeb9/href

The dataset above contains 64461 rows and 61columns.

Let’s start cleaning!!!

Selecting and keeping the columns/features needed for the prediction: When building a machine learning model in real-life, features selection is very important because it’s almost rare that all the features in the dataset are useful to build a model. So we selected only a few columns needed for the prediction in order not to bother the user with having to fill in too much unnecessary information. The columns are Country, Edlevel which is the education level, YearsCodePro which is the number of years of experience, Employment (full-time or part-time), ConvertedComp (the annual salary in dollars) this feature is still going to be renamed.

https://medium.com/media/c56454c7326eaaaae7176be82e488fef/href

Dealing with Missing Values: I would be using the columns where the salary is available, so I will be dropping columns with Nan salary.

https://medium.com/media/283958c77da09b2deaed20bb41cf9394/href

Let’s take a quick look at the dataset

https://medium.com/media/24caf532aedd6e6b1e96cb9796f2dad7/href

Here we see that we have 34,025 data entries, three columns are objects which means they are strings, only the salary column is a float. So we would be dropping the rows where the columns are not numbers.

https://medium.com/media/22ff758c8b1f9c30a474b981260d4546/href

I dropped the employment columns since it wasn’t really needed for the prediction. Let’s take a quick look at our dataset again

https://medium.com/media/10f89dd2b5a1cbbaf66276105daae149/href

Now we would be cleaning each of the columns data

I would start with the country data

https://medium.com/media/97af417d08d78ba67476679dd13c430d/href https://medium.com/media/c8ec47cea08620e5100a7b87a42ef6b4/href

The value_counts() function in python will return the series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Here we see that the U.S.A has the most data and we have some countries with one data point which we will get rid of because our model can not learn from just one data point.

https://medium.com/media/9de7062736af7456e3b7248120d43993/href

We will be cleaning our country column with the use of the function above after naming the function “shorten_categories”, we fix a cut-off value. If the number of data points for each country is greater than the cut-off value we keep it. Other wise we combine it to a new category called ‘Other’.

https://medium.com/media/c3504e2c24f7ce45a4b2a4dc6dd89e42/href

Now, after running the above we discovered that the new category we created has the most data points.

I would like to look at the relationship between the salary column and the country column by plotting a boxplot.

https://medium.com/media/099b2e2106e0464177e1b673b8b7ca85/href

From our plot we can see that we have lot of outliers .So now we would keep the data where we have more information by keeping salaries that are lesser than or equal to $250,000 ,greater than or equal to $10,000 and drop the other category.

Let’s plot it again

https://medium.com/media/b50b1deff6d5f754d8a9bc3e1645fe69/href

We can see that the outliers has reduced.

Cleaning the YearsCodepro feature

The unique() function in python is used to get unique values of the series object. Unique() functions are returned in order of appearance.

https://medium.com/media/af845a88c03e2879af1b11c157398118/href

After running this, we discovered that all the arrays came out as strings, for the computer to understand these, we would convert the arrays to floats. If any is less than a year, it will return 0.5 while if it is more than a year, we ascribe 50. Otherwise, convert it to a float. We would do that by creating a function called clean_experience.

https://medium.com/media/88445b67e3625c86692d3a9576e1263f/href https://medium.com/media/ac40e33b1c26a02deac6f746c37b4698/href

After running the above we see that the output came out as integers.

Cleaning the EdLevel feature

https://medium.com/media/5e6d99e6138ddea95a35a82df0b295fd/href

Here, we have different education levels. We would be focusing on the Bachelor’s , Master’s and other Post graduate degrees. Anything apart from these would be called Less than a Bachelor.

https://medium.com/media/f333bc35af4657e45da07a366897fd38/href https://medium.com/media/69a03234010861512c8a36aadd6d5b53/href

After running the code above we would see that we have only five outputs.

Now, we are almost done with the data cleaning.

As we all know, the computer does not understand strings and we do have columns containing strings! It is, therefore, necessary to transform the string values into unique values. To do this, we would be using LabelEncoder. Label Encoding in Python is part of data preprocessing. Hence, we will use the preprocessing module from the sklearn package and then import LabelEncoder

https://medium.com/media/33a80945ea7978acdc4c7f923dc4c13f/href

Create an instance of LabelEncoder() and store it in the LabelEncoder variable which is the lb_edlevel.

Apply fit and transform which does the trick to assign numerical value to categorical value and the same is stored in a new column called “edlevel”.

https://medium.com/media/aadafbe6c893e80ee4c7695678744da6/href

Let’s take a look at the EdLevel

https://medium.com/media/226329fd2ea29511ede2a93b14c08ae4/href

We no longer have strings here again, the LabelEncoder has transformed the EdLevel column to integers which the model can now understand . We would do the same for the country column also.

Create an instance of LabelEncoder() and store it in the LabelEncoder variable which is the lb_country.

Apply fit and transform which does the trick to assign numerical value to categorical value and the same is stored in new a column called “country”.

https://medium.com/media/cfbd446bae8fc4d1a9290ebec4126db5/href

Let's have a look at the unique values for the country column

https://medium.com/media/95c2244ba7eb7822cf88cb252c9c60e3/href

Now, label encoder has given each country a unique integer value.

Let’s check the dataset

https://medium.com/media/92187f9110be256fdcad416fc3bbb079/href

Now, our data is ready for training and testing.

Data Splitting: Data splitting is commonly used in machine learning to split data into a train, test, or validation set. This approach allows us to estimate the model performance. Here, we would only be training and testing the data.

We would split our data into X and y. X will contain the features and y will contain the dropped target which is the salary.

https://medium.com/media/c711dca19248e492212a61259b02cabe/href

After doing that we would be splitting X and y dataset into the test and train. To do this we will be using the train_test_split function.

The train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training and testing data. With this function, you don’t need to divide the dataset manually. By default, Sklearn train_test_split will make random partitions for the two subsets.

https://medium.com/media/488cd6b6cc8864f27c5bafe1993487ac/href

We would be training with 70% and testing with 30% of the dataset and set our random_state at 42. The random state ensures that the outputs are generated in the same order whenever it is being runned.

https://medium.com/media/1ff10fc1f56fe0a03efad5c11d42eb1f/href

So it’s time to build our model!!!

Three different algorithms would be used to build the model then we will pick the algorithm with the least error.

We would start with Linear Regression, Linear regression is a basic and commonly used type of algorithm for predictive analysis, we would start by importing it from sklearn.linear_model.

https://medium.com/media/24b4bd66bb5abcb83443260ce0137cf1/href

Create an instance of LinearRegression() and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.

https://medium.com/media/662ee76d6b7450a05dbcc1278da98510/href

In regression predictive modeling, we use error metrics to calculate the model performance. The error metrics we would be using is the RMSE which is the root of the mean square error. This error metric would be used to show the difference between the predicted value and the actual value.

https://medium.com/media/816db2c8415e2c3f33dbe34863e0d095/href

Our output is shown below

https://medium.com/media/1e715fd27f2d4c3927b16d8c9453e642/href

We can see that the difference between the actual and predicted value using LinearRegression algorithm is $39,558.79 which is very high

Let’s try the DecisionTreeRegressor algorithm

Import DecisionTreeRegressor from sklearn.tree, create an instance of DecisionTreeRegressor and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.

https://medium.com/media/a9995cab4e855af2389b003608054a91/href https://medium.com/media/f29f3b5609e73cca05b8a044ff2fbb4c/href https://medium.com/media/b56335fd62fa12ea32ac080dd116b6b2/href

The difference between the actual and predicted value using DecisionTreeRegressor is $33,962.56 which is a little bit high.

Let’s try the Random Forest Regression algorithm.

Import RandomForestRegressor from sklearn.ensemble. Create an instance of RandomForestRegressor and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.

https://medium.com/media/70182692ecede19892e328ef78918e7e/href https://medium.com/media/80f3c012652c7271db981061458c43a0/href https://medium.com/media/58dacc9a949e1f241a10d6cd8d866095/href

Finally, the RandomForestRegressor algorithm gave us the least error. Now we want to find the best parameter for our model using the Gridsearchcv.

Grid search is the process of performing hyperparameter tuning to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyperparameter values specified. It is a useful tool to fine-tune the parameters of your model.

The way it works is by importing it from sklearn.model_selection, define the set of different parameters, then create a parameter dictionary containing the keyword argument in RandomForestRegressor, (you can check out the documentation for that ) then create an instance of the regressor algorithm used, an instance of Gridsearchcv containing the regressor algorithm, parameter, and the scoring. Lastly, fit the Gridsearchcv to the training data set.

https://medium.com/media/7a94a9a4780c16e3dc939d8be7eea05e/href

After running the above code, we get the best estimator and store it in a variable called model. Then fit it on our training dataset and use it to predict on our test data set.

https://medium.com/media/1b03159e1f62580eb983ee33350938d4/href https://medium.com/media/46b022df7bc64f709b81d867164e94fc/href

Following this, the error value has reduced a bit from $33,617.45 to $32,911.09 which is still fair.

Making a predictive system

For instance a user inputs his country’s information as United States; EdLevel as Master’s degree; and the yearscodepro as 15years

https://medium.com/media/f53216c17062cc2cdce0231f3e6cd700/href

Below is the result for the user’s annual salary .

https://medium.com/media/7b7ded95f410b472fb54f83e59764fb2/href

In conclusion, we have seen the step-by-step approach to building the model for our salary prediction web app. In my next article, I will be sharing how to deploy this model.