Machine Learning for Software Developers Part II

Shalinda Suresh
Arimac
Published in
6 min readApr 4, 2021

This is the second part of my previous article Machine Learning for Software Developers. Earlier, my intension was to make it easier to reach the term machine learning for those who’re in the software development field. So to avoid confusion at first reading I merely added theory in previous article.

Today, We’re going to look further into machine learning with the same task we practiced before, but this time in more details. Hence this article will be lengthier.

“Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed”
Arthur Samuel

As per Wikipedia, Machine learning is the study of computer algorithms that improves through experience and use of data.Machine Learning in short ML is a subset of Artificial Intelligence.Unlike Artificial Intelligence, Machine learning is mostly influenced from the study of Statistics.

Source: Stackexchange

Types of Machine Learning

Supervised Learning

Supervised learning is easy to understand and commonly used machine learning type.This technique is often called as rule based machine learning which uses relation of feature variables to the target label.

Features are independent variables fed into machine learning process.

And

Labels are dependent variable referred to answers/output.

Unsupervised Learning

Unlike supervised learning, no labels are seen in the dataset means the algorithm has to somehow figure out how to populate answers/labels. So to do that, it has to take advantages of techniques such as grouping/clustering to categorize the data. K-means clustering is one of popular algorithms for unsupervised learning.

Reinforcement Learning

There is no existing dataset using in reinforcement learning. Instead we create an agent with an environment.This agent has to perform some actions in order to receive rewarding points or penalties. This type of machine learning is commonly used for gaming platforms and robotics industry.

Let’s build our machine learning classifier.

On the top of our script we import sklearn library. SciKitLearn is an open source machine learning library written for python. sklearn comes with number of classification, regression algorithms and few standard datasets.

In the next line we import Numpy library which is used to execute operations on arrays and matrices.Because python does not have built-in support for arrays, alternatively python supports for Lists.

Note : You may require to install these libraries on your computer before importing them.

Next, we import “Pandas” which is useful for manage our dataset as a schema. Later we convert our iris dataset into a dataframe.

A dataframe is just like a Table in a relational databases.

Following lines will create a dataframe for sklearn iris dataset.

We already created a dataframe with feature columns, now we have to create target column and attach to the dataframe.

Howdy coders,

Remember the “head” command which is available in Linux distributions to display top content in a file? Let’s give it a shot in panda’s ways.

This will output first five rows in our dataframe.You may also pass any number to pandas head() function to see more data.

As you can see target column has numerical values(0,1,2 for three species). This is because computer sees numbers better than names. Lets map these numbers to species names in a brand new column.

We can display the updated dataframe.notice the new column ‘species’.

You can also print schema of the dataframe.The output from info() function is pretty similar to SQL command “describe table <Table_Name>”

Output :

You can see available columns in dataframe and respective data types for these columns are shown at the end.

Now let’s find out how many iris flowers are available for each species.This is equivalent to SQL query with group by statement for count() aggregation function.

As you can see each species has 50 iris flowers that sum up to 150 total flowers.

You can display basic statistical summary by issuing below command.

These measurements including mean,standard deviation,percentiles and min max values can be obtained using describe() function.All these functions are available in pandas library.

Matplotlib is a cross-platform, data visualization and graphical plotting library for Python

Visualizing is a vital part of any data analysis process.For your web/software projects you might use d3 js,JSCharting,Google Charts and so on…You name it fellas!

In python there is another popular library for plotting called Matplotlib.Let’s import matplotlib into our script.

Now we can explore the distribution of iris measurements on a scatter plots.First we take sepal width and length.

Then petal width and length which are the other features from iris dataset.

Alright, you got basic visual representation from feature variables. now we are going to separate feature columns and target variable for our machine learning model. As you can see in the below code,we drop species and target columns in order to pick feature variables from the dataframe and set it to variable “X” and target or the label column set to “y” variable.

In practice, you will often see uppercase “X”(matrix) for feature variables and lowercase “y”(vector) for target/label.

Assigning variables to features and label is done, Next thing we have to do is seperate the train and test data from entire sample.

We have added value for “test_size” as 0.3 means that we take thirty percent of data from entire dataset to test our machine learning model.Also , random_state set to 999. This parameter will ensure the same outcome from “train_test_split” function in every time we run it.

We are now good to select our machine learning model. Let’s initiate K-NearestNeighbor classifier.

Next task is fit the model. In simple words “fit” refers to train the machine learning model with given data.

Our machine learning model is now ready.Before we going to predict the model we can test for the score.

This will output “0.97” as the approximate score for our trained model.

Let’s get the answer from our model using new values. We first create a numpy array and assign it to variable “X_new”

When you run the code it will display ‘virginica’ as the predicted species.That’s the answer our machine learning model given to us.

Awesome,we have almost arrived at end of the article.Finally, I want to conclude that Machine Learning/Artificial Intelligence will not be a replacement for Software Development or other industries,They won’t go anywhere.But with the growth of the amount of data these technologies will empower other industries for a brighter future.

Here’s the full code for this article. You can try out this yourself by clicking “Open in colab” button.

https://github.com/shalindasuresh/machine_learning_python/blob/main/iris_knn_complete.ipynb

--

--

Shalinda Suresh
Arimac
Writer for

Skilled Software Engineer with deep expertise and hands on experience in Web Development and Data Engineering.