Understand Machine Learning With An Example In R

Sahiti Kappagantula
Edureka
Published in
10 min readJun 14, 2018

--

Machine Learning with R — Edureka

Machine learning is the present and the future! From Netflix’s recommendation engine to Google’s self-driving car, it’s all machine learning. This article on Machine Learning with R helps you understand the core concepts of machine learning followed by different machine learning algorithms and implementing those machine learning algorithms with R.

This article comprises of these sections:

  • Understanding Machine Learning
  • Types of Machine Learning Algorithms
  • Implementing Machine Learning Algorithms with R

Understanding Machine Learning

  • Types of Machine Learning Algorithms
  • Implementing Machine Learning Algorithms with R

Understanding Machine Learning

How do you know all of those are fish?

As a kid, you might have come across a picture of a fish and you would have been told by your kindergarten teachers or parents that this is a fish and it has some specific features associated with it like it has fins, gills, a pair of eyes, a tail and so on. Now, whenever your brain comes across an image with those set of features, it automatically registers it as a fish because your brain has learned that it is a fish.

That’s how our brain functions but what about a machine? If the same image is fed to a machine, how will the machine identify it to be a fish?

This is where Machine Learning comes in. We’ll keep on feeding images of a fish to a computer with the tag “fish” until the machine learns all the features associated with a fish.

Once the machine learns all the features associated with a fish, we will feed it new data to determine how much has it learned.

In other words, Raw Data/Training Data is given to the machine, so that it learns all the features associated with the Training Data. Once, the learning is done, it is given New Data/Test Data to determine how well the machine has learned.

Let us move ahead in this Machine Learning with R article and understand about types of Machine Learning.

Types of Machine Learning

Supervised Learning:

Supervised Learning algorithm learns from a known data-set(Training Data) which has labels to make predictions.

Regression and Classification are some examples of Supervised Learning.

Classification:

Classification determines to which set of categories does a new observation belongs i.e. a classification algorithm learns all the features and labels of the training data and when new data is given to it, it has to assign labels to the new observations depending on what it has learned from the training data.

For this example, if the first observation is given the label “Man” then it is rightly classified but if it is given the label “Woman”, the classification is wrong. Similarly for the second observation, if the label given is “Woman”, it is rightly classified, else the classification is wrong.

Regression:

Regression is a supervised learning algorithm which helps in determining how does one variable influence another variable.

Over here, “living_area” is the independent variable and “price” is the dependent variable i.e. we are determining how does “price” vary with respect to “living_area”.

Unsupervised Learning:

Unsupervised learning algorithm draws inferences from data which does not have labels.

Clustering is an example of unsupervised learning. “K-means”, “Hierarchical”, “Fuzzy C-Means” are some examples of clustering algorithms.

In this example, the set of observations is divided into two clusters. Clustering is done on the basis of similarity between the observations. There is a high intra-cluster similarity and low inter-cluster similarity i.e. there is a very high similarity between all the buses but low similarity between the buses and cars.

Reinforcement Learning:

Reinforcement Learning is a type of machine learning algorithm where the machine/agent in an environment learns ideal behavior in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior, this is known as the reinforcement signal.

Let’s take pacman for example. As long as pacman keeps eating food, it earns points but when it crashes against a monster it loses it’s life. Thus pacman learns that it needs to eat more food and avoid monsters so as to improve it’s performance.

Implementing Machine Learning with R:

Linear Regression:

We’ll be working with the diamonds data-set to implement linear regression algorithm:

Description of the data-set:

Prior to building any model on the data, we are supposed to split the data into “train” and “test” sets. The model will be built on the “train” set and its accuracy will be checked on the “test” set.

We need to load the “caTools” package to split the data into two sets.

library(caTools)

“caTools” package provides a function “sample.split()” which helps in splitting the data.

sample.split(diamonds$price,SplitRatio = 0.65)->split_index

65% of the observations from price column have been assigned the “true” label and the rest 35% have been assigned “false” label.

subset(diamonds,split_index==T)->train subset(diamonds,split_index==F)->test

All the observations which have “true” label have been stored in the “train” object and those observations having “false” label have been assigned to the “test” set.

Now that the splitting is done and we have our “train” and “test” sets, it’s time to build the linear regression model on the training set.

We’ll be using the “lm()” function to build the linear regression model on the “train” data. We are determining the price of the diamonds with respect to all other variables of the data-set. The built model is stored in the object “mod_regress”.

lm(price~.,data = train)->mod_regress

Now, that we have built the model, we need to make predictions on the “test” set. “predict()” function is used to get predictions. It takes two arguments: the built model and the test set. The predicted results are stored in the “result_regress” object.

predict(mod_regress,test)->result_regress

Let’s bind the actual price values from the “test” data-set and the predicted values into a single data-set using the “cbind()” function. The new data-frame is stored in “Final_Data”

cbind(Actual=test$price,Predicted=result_regress)->Final_Dataas.data.frame(Final_Data)->Final_Data

A glance at the “Final_Data” which comprises of actual values and predicted values:

Let’s find the error by subtracting the predicted values from the actual values and add this error as a new column to the “Final_Data”:

(Final_Data$Actual- Final_Data$Predicted)->errorcbind(Final_Data,error)->Final_Data

A glance at the “Final_Data” which also comprises of the error in prediction:

Now, we’ll go ahead and calculate “Root Mean Square Error” which gives an aggregate error for all the predictions

rmse1<-sqrt(mean(Final_Data$error^2))rmse1

Going ahead, let’s build another model, so that we can compare the accuracy of both these models and determine which is a better one.

We’ll build a new linear regression model on the “train” set but this time, we’ll be dropping the ‘x’ and ‘y’ columns from the independent variables i.e. the “price” of the diamonds is determined by all the columns except ‘x’ and ‘y’.

The model built is stored in “mod_regress2”:

lm(price~.-y-z,data = train)->mod_regress2

The predicted results are stored in “result_regress2”

predict(mod_regress2,test)->result_regress2

Actual and Predicted values are combined and stored in “Final_Data2”:

cbind(Actual=test$price,Predicted=result_regress2)->Final_Data2as.data.frame(Final_Data2)->Final_Data2

Let’s also add the error in prediction to “Final_Data2”

(Final_Data2$Actual- Final_Data2$Predicted)->error2cbind(Final_Data2,error2)->Final_Data2

A glance at “Final_Data2”:

Finding Root Mean Square Error to get the aggregate error:

rmse2<-sqrt(mean(Final_Data2$error^2))

We see that “rmse2” is marginally less than “rmse1” and hence the second model is marginally better than the first model.

Classification:

We’ll be working with the “car_purchase” data-set to implement recursive partitioning which is a classification algorithm.

Let’s split the data into “train” and “test” sets using “sample.split()” function from “caTools” package.

library(caTools)

65% of the observations from ‘Purchased’ column will be assigned “TRUE” labels and the rest will be assigned “FALSE” labels.

sample.split(car_purchase$Purchased,SplitRatio = 0.65)->split_values

All those observations which have “TRUE” label will be stored into ‘train’ data and those observations having “FALSE” label will be assigned to ‘test’ data.

subset(car_purchase,split_values==T)->train_datasubset(car_purchase,split_values==F)->test_data

Time to build the Recursive Partitioning algorithm:

We’ll start off by loading the ‘rpart’ package:

library(rpart)

“Purchased” column will be the dependent variable and all other columns are the independent variables i.e. we are determining whether the person has bought the car or not with respect to all other columns. The model is built on the “train_data” and the result is stored in “mod1”.

rpart(Purchased~.,data = train_data)->mod1

Let’s plot the result:

plot(mod1,margin = 0.1) text(mod1,pretty = T,cex=0.8)

Now, let’s go ahead and predict the results on “test_data”. We are giving the built rpart model “mod1” as the first argument, the test set “test_data” as the second argument and prediction type as “class” for the third argument. The result is stored in ‘result1’ object.

predict(mod1,test_data,type = "class")->result1

Let’s evaluate the accuracy of the model using “confusionMatrix()” function from caret package.

library(caret) confusionMatrix(table(test_data$Purchased,result1))

The confusion matrix tells us that out of the 90 observations where the person did not buy the car, 79 observations have been rightly classified as “No” and 11 have been wrongly classified as “YES”. Similarly, out of the 50 observations where the person actually bought the car, 47 have been rightly classified as “YES” and 3 have been wrongly classified as “NO”.

We can find the accuracy of the model by dividing the correct predictions with total predictions i.e. (79+47)/(79+47+11+3).

K-Means Clustering:

We’ll work with “iris” data-set to implement k-means clustering:

Let’s remove the “Species” column and create a new data-set which comprises only the first four columns from the ‘iris’ data-set.

iris[1:4]->iris_k

Let us take the number of clusters to be 3. “Kmeans()” function takes the input data and the number of clusters in which the data is to be clustered. The syntax is : kmeans( data, k) where k is the number of cluster centers.

kmeans(iris_k,3)->k1

Analyzing the clustering:

str(k1)

The str() function gives the structure of the kmeans which includes various parameters like withinss, betweenss, etc, analyzing which you can find out the performance of kmeans.

betweenss : Between sum of squares i.e. Intracluster similarity

withinss : Within sum of square i.e. Intercluster similarity

totwithinss : Sum of all the withinss of all the clusters i.e.Total intra-cluster similarity

A good clustering will have a lower value of “tot.withinss” and higher value of “betweenss” which depends on the number of clusters ‘k’ chosen initially.

The time is ripe to become an expert in Machine Learning to take advantage of new opportunities that come your way. This brings us to the end of this “Machine Learning with R” article. I hope this article was informative fruitful.

If you wish to check out more articles on the market’s most trending technologies like Python, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Data Science.

1.Data Science Tutorial

2.Math And Statistics For Data Science

3.Linear Regression in R

4.Machine Learning Algorithms

5.Logistic Regression In R

6.Classification Algorithms

7.Random Forest In R

8.Decision Tree in R

9.Introduction To Machine Learning

10.Naive Bayes in R

11.Statistics and Probability

12.How To Create A Perfect Decision Tree?

13.Top 10 Myths Regarding Data Scientists Roles

14.Top Data Science Projects

15.Data Analyst vs Data Engineer vs Data Scientist

16.Types Of Artificial Intelligence

17.R vs Python

18.Artificial Intelligence vs Machine Learning vs Deep Learning

19.Machine Learning Projects

20.Data Analyst Interview Questions And Answers

21.Data Science And Machine Learning Tools For Non-Programmers

22.Top 10 Machine Learning Frameworks

23.Statistics for Machine Learning

24.Random Forest In R

25.Breadth-First Search Algorithm

26.Linear Discriminant Analysis in R

27.Prerequisites for Machine Learning

28.Interactive WebApps using R Shiny

29.Top 10 Books for Machine Learning

30.Unsupervised Learning

31.10 Best Books for Data Science

32.Supervised Learning

Originally published at www.edureka.co on June 14, 2018.

--

--

Sahiti Kappagantula
Edureka

A Data Science and Robotic Process Automation Enthusiast. Technical Writer.