Understanding Machine Learning in R

Alparslan Kapani
5 min readJun 12, 2018

--

Machine Learning is very popular and trendy subject for over a year. Eventhough, the algorithms have been around in few decades, the computational power and the amount of data in today has been converged with the algorithms.

Obviously, following years there will be many breakthroughs in this area. But basic understanding of this subject is very important. And I’d like to mention Machine Learning basic concepts such as models, training, selection, scaling and so on in R.

Machine Learning diverges into two approach. 1.Supervised 2.Unsupervised Learning.

Supervised Learning where you first train your dataset and then predict the results based on the model you select. On the other hand, supervised learning such as Clustering, you have a dataset but you do not know where best to cut, categorized it so based on the model you select, machine will categorized it for you.

Before apllying any algorithm, data pre-processing is very important in order to get accurate results.

Data Pre-Processing: One of the most common issues in datasets are missing values. There are several ways like removing, giving values but common approach is taking the average of the selected column and give that value to missing row(s) selected column. In R, below code sniplet will do the job:

Also the type of data such as text, numeric, factor… is another important issue to adjust in your dataset.

Most of the algorithms need feature scaling if your dataset variables are unproportional in numbers. For example, X axis is for “Years of Experience” and Y axis is “Salary”. 1 year of experience could effect couple of thousands of dollars in salary. So Euclidian distance is gonna favor Y axis, thus your model prediction will not be accurate.

In R, following sample code will do the work:

When your dataset is ready, we can split dataset into training and test sets.

Splitting Dataset: As we mentioned above, for supervised learning we need to train our algorithm with the data. So in this example, we split our dataset with 80% in favor of our training set and 20% for our test set.

For R, we need to install “caTools” package to enable easy splitting.

After the split process, here comes the fun part. To apply our selected algorithm and predicting the result. But which algorithm will best fits our purpose?

The answer is what is your purpose and what kind of data you have.

Selecting the Model: If your purpose is predicting real values in simple terms then Regression analysis will probably do the work. If your purpose is predicting the probability then may the Classification. Or you don’t know what kind of dataset you have then Clustering first will be better.

If you select the regression analysis to predict the values then what kind of regression analysis you are going to pick? Linear, Multiple Linear, Polynomial, Support Vector Machine, Decision Tree or Random Forest?

The answer is what kind of dataset you have. If you have one dependent and one independent variable then linear may do the work. If you have more then one independent variable then multiple linear regression analysis may do the work. But important thing in here is your data needs to strongly coefficient with each other in order to make a good prediction.

In the above graph, line slope is determined by the sum of minimum distance squared between actual datapoints and predicted values.

But for some datasets datapoints are not that well correlated. Such as:

Polynomial Regression will may better suit for this datasets because of the strong curve.

If you have more than one independent variable, you may choose the multiple linear regression. Bu it is important that you need to eliminate any independent variable that are not strongly correlated with the dependent variable by doing backward elimination method(there are also other methods).

Let’s do it in an example in R. First, we need to fit our multiple linear regression to our training set.

Our dependent variable is Profit and independent variables are R&D, Marketing and Adminstrative Spend. If we look at the statistical significance between dependent and independent variables:

R&D Spend is strongly statically significant with the dependent variable Profit.

In backward elimination technic, you need to eliminate from the highest insignificant variable which is State2. First we need to eliminate State2.

For acceleration purposes, I have eliminated few independent variables step by step in advance and come up with this:

As we said, R&D is strongly correlated. But Marketing Spend is somewhat correlated. At this point, you need to set a statistical significance threshold. Mostly this threshold is set to 5%. So it is your decision to go with marketing spend variable or not in order to make a prediction.

In R, to make a prediction, following code will do the work:

Then you may go with plotting for visualisation purposes.

For now, I’d like to hold the other models selection because it is gonna be too much information for one post. For the next post, I will go with Classification and Clustering methods. Till then,

Best Regards,

Alparslan Kapani

--

--