Machine Learning A-Z with Python and R

Sharon Woo
Jul 24, 2017 · 4 min read

Function cheat sheet

Status update — New to R, figuring out R studio. Revising Python and learning new functions.

Data pre-processing

Here we replace missing values in datasets with the mean of other values.

Python — sklearn Imputer

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[: , :-1].values
Y = dataset.iloc[:, 3].values
# function replaces NaN values with what you tell it to
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[:,1:3] # .fit() trains the model using n_samples, n_features
X[:,1:3] = imputer.transform(X[:,1:3] # applies the changes
# encoding categorical data with dummy variables
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
countryencoder = countryencoder(categorical_features = [0]) # creates dummy variables - 3 columns to replace country
X = countryencoder.fit_transform(X).toarray() # toarray() as there are 3 columns in place of 1
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
# function replaces text labels with category indices 0,1,2...
# labelencoder_X.fit_transform([X[:,0])
# to fit the labelencoder_X object to the first column of X

R

# each index has to have missing values replaced separately
# note TRUE has to be written this way, else it becomes an object and you get this error - Error in mean.default(x, na.rm = True) : object 'True' not found
dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean (x,
na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean (x,
na.rm = TRUE)),
dataset$Salary)
# encoding categorical data with factor function
# if empty arguments then error so check
dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain','Germany'),
labels = c(1,2,3))
dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0,1))

Splitting into training and test sets

Python

from sklearn.cross_validation import train_test_split 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0) # train_size = 1 - test_size, random state is random sampling

R

# install.packages("caTools")
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Features scaling

Python

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
# do you need to scale dummy variables?
# do you need to scale the dependent variable? no - classification problems. yes - regression problems

R

training_set[,2:3] = scale(training_set[,2:3]) 
test_set[,2:3] = scale(test_set[,2:3])
# a factor in R is not a numeric instance, so columns 1 and 4 are not numeric

Simple linear regression

Python

# fit regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_train) # fits the regressor to the training data
# predicting test set results
Y_pred = regressor.predict(X_test)
# visualisation of training set results - uses matplotlib
plt.scatter(X_train, Y_train, color = "red")
plt.plot(X_train, Y_pred, color = "blue")
plt.title("Salary vs Experience (Training Set)"
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()
# visualisation of test set results - uses matplotlib
plt.scatter(X_test, Y_test, color = "red")
plt.plot(X_train, Y_pred, color = "blue")
plt.title("Salary vs Experience (Test Set)"
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()

R

# R is a lot stricter on typos. Make sure you check for additional spaces etc else code will break!dataset = read.csv("Salary_Data.csv")# splitting dataset into train x test 
# install.packages("caTools")
library(caTools)
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# salary proportional (~) to years experience
regressor = lm(formula = Salary ~ YearsExperience, data = training_set)
# optional - shows the P values as *s
# summary(regressor)
Y_pred = predict(regressor, newdata = test_set)#install.packages("ggplot2")
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
colour = "red") +
geom_line(aes(x =training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = "blue") +
ggtitle("Salary vs Experience (Training Set)") +
xlab("Years of experience") +
ylab("Salary")
ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
colour = "red") +
geom_line(aes(x =training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = "blue") +
ggtitle("Salary vs Experience (Test Set)") +
xlab("Years of experience") +
ylab("Salary")

Building a model

  1. Use all variables
  2. Backward elimination
  3. Forward selection
  4. Bidirectional elimination
  5. Score comparions

Python

Revision — use Encoder to create dummy variables to replace state text string (New York etc).

# code here is largely repeated from simple linear regression
# only new code is put in
# avoid the dummy variable trap
X = X[:, 1:]
# building the optimal model using backward elimination
import statsmodels.formula.api as sm
X = np.append(arr = np.ones(50,1).astype(int), values = X, axis=1)
# adds a matrix of ones as the first column to existing matrix

Polynomial Regression

R

  1. Create first quadratic variable — e.g. dataset$Level2 = dataset$Level^2
  2. Fit polynomial regression to the dataset — poly_reg = lm(formula = Salary ~ ., data = dataset)
  3. Visualise results with ggplot 2 — geom_point for actual points, geom_line for prediction (aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)
  4. Predict new result with polynomial regression — y_pred = predict(poly_reg, data.frame(Level = … insert levels here corresponding to number of corresponding quadratic variables))

Support Vector Regression

Python

  1. New: import SVR in scikitlearn. Kernel — linear/poly/rbf/sigmoid etc.
  2. Perform feature scaling (standardscaler in scikitlearn) and transform them

Sharon Woo

Written by

Nerd runner

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade