Machine Learning A-Z with Python and R
Jul 24, 2017 · 4 min read
Function cheat sheet
Status update — New to R, figuring out R studio. Revising Python and learning new functions.
Data pre-processing
Here we replace missing values in datasets with the mean of other values.
Python — sklearn Imputer
import numpy as np
import matplotlib.pyplot as plt
import pandas as pddataset = pd.read_csv("Data.csv")
X = dataset.iloc[: , :-1].values
Y = dataset.iloc[:, 3].values# function replaces NaN values with what you tell it to
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[:,1:3] # .fit() trains the model using n_samples, n_features
X[:,1:3] = imputer.transform(X[:,1:3] # applies the changes# encoding categorical data with dummy variables
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
countryencoder = countryencoder(categorical_features = [0]) # creates dummy variables - 3 columns to replace country
X = countryencoder.fit_transform(X).toarray() # toarray() as there are 3 columns in place of 1
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)# function replaces text labels with category indices 0,1,2...
# labelencoder_X.fit_transform([X[:,0])
# to fit the labelencoder_X object to the first column of X
R
# each index has to have missing values replaced separately
# note TRUE has to be written this way, else it becomes an object and you get this error - Error in mean.default(x, na.rm = True) : object 'True' not founddataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean (x,
na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean (x,
na.rm = TRUE)),
dataset$Salary)# encoding categorical data with factor function
# if empty arguments then error so check
dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain','Germany'),
labels = c(1,2,3))
dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0,1))
Splitting into training and test sets
Python
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0) # train_size = 1 - test_size, random state is random samplingR
# install.packages("caTools")
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)Features scaling
Python
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)# do you need to scale dummy variables?
# do you need to scale the dependent variable? no - classification problems. yes - regression problems
R
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])
# a factor in R is not a numeric instance, so columns 1 and 4 are not numeric Simple linear regression
Python
# fit regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_train) # fits the regressor to the training data # predicting test set results
Y_pred = regressor.predict(X_test) # visualisation of training set results - uses matplotlib
plt.scatter(X_train, Y_train, color = "red")
plt.plot(X_train, Y_pred, color = "blue")
plt.title("Salary vs Experience (Training Set)"
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()# visualisation of test set results - uses matplotlib
plt.scatter(X_test, Y_test, color = "red")
plt.plot(X_train, Y_pred, color = "blue")
plt.title("Salary vs Experience (Test Set)"
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()
R
# R is a lot stricter on typos. Make sure you check for additional spaces etc else code will break!dataset = read.csv("Salary_Data.csv")# splitting dataset into train x test
# install.packages("caTools")
library(caTools)
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)# salary proportional (~) to years experience
regressor = lm(formula = Salary ~ YearsExperience, data = training_set)
# optional - shows the P values as *s
# summary(regressor)Y_pred = predict(regressor, newdata = test_set)#install.packages("ggplot2")
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
colour = "red") +
geom_line(aes(x =training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = "blue") +
ggtitle("Salary vs Experience (Training Set)") +
xlab("Years of experience") +
ylab("Salary")ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
colour = "red") +
geom_line(aes(x =training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = "blue") +
ggtitle("Salary vs Experience (Test Set)") +
xlab("Years of experience") +
ylab("Salary")
Building a model
- Use all variables
- Backward elimination
- Forward selection
- Bidirectional elimination
- Score comparions
Python
Revision — use Encoder to create dummy variables to replace state text string (New York etc).
# code here is largely repeated from simple linear regression
# only new code is put in # avoid the dummy variable trap
X = X[:, 1:]# building the optimal model using backward elimination
import statsmodels.formula.api as sm
X = np.append(arr = np.ones(50,1).astype(int), values = X, axis=1)
# adds a matrix of ones as the first column to existing matrix
Polynomial Regression
R
- Create first quadratic variable — e.g. dataset$Level2 = dataset$Level^2
- Fit polynomial regression to the dataset — poly_reg = lm(formula = Salary ~ ., data = dataset)
- Visualise results with ggplot 2 — geom_point for actual points, geom_line for prediction (aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)
- Predict new result with polynomial regression — y_pred = predict(poly_reg, data.frame(Level = … insert levels here corresponding to number of corresponding quadratic variables))
Support Vector Regression
Python
- New: import SVR in scikitlearn. Kernel — linear/poly/rbf/sigmoid etc.
- Perform feature scaling (standardscaler in scikitlearn) and transform them
