Machine Learning — 3 (Python Models)

Scikit Learn

2 min readDec 17, 2019

This library contains most of the algorithms that are required for machine learning.

TSNE — It takes high dimension data and reduces into 2 dimensions, so that it can be visualized.

Summarizing Numerical and Categorical variables

For categorical variables, if the number of distinct categories for a variable is more than sqrt of N (numberof samples), then the variable is not useful. Some people use N/2 also.

Profiling of data using pandas

pandas_profiling.ProfileReport(data)

This is powerful tool to do EDA analysis,

Most people create their library of functions, that takes raw data and automates some manual steps.

Custom EDA Tools

which are numerical or categorical vars
what plot is best for each variable. scatter vs
bivariate plots between variables. export as pdf
plots X and X, X and Y
applies transformations, plots transformations.
eventually generates 6 PDF files which can be analyzed offline

Automating Variable Selection

Look at 50 different metrics — p values, f values etc
automates variable selection

Linear Regression

2 options

statsmodel.api — needs 2 separate variables, one for X and one for Y. The X variable is used entirely. we cannot select specific X variables
statsmodel formula api —

import statsmodel.formula.api as smfsmf.ols(formula='y~x1+x2+x3', data=dataset1).train()# dont drop the ID variable in the dataset, so that you can merge this data set back with original

Checking VIF

If VIF > 10, we must drop the variable
If VIF > 5, we can still drop the variable, as it corresponds to R Square = 85%, which is quite high

(85% information is present in other variables)

Feature Selection

F Regression builds univariate regression of Y with individual X values.

Calculate F Values for each variable
whichever variables have highest F values, we retain those variables

from sklearn.feature_selection import F_value, p_values = f_regression(train_x, train_y)

Error In Model

ddd

from sklearn import metricsimport numpy as np# simple to understand
mae = metrics.mean_absolute_error(test_y, y_pred)#mse punishes larger error which is useful in real world
mse = metrics.mean_squared_error(test_y, y_pred)#more popular, as its unit is Y
rmse = np.sqrt(mse)

Checking for Heteroscedascity

Plot

Predicted Y on x axis
Residuals on y axis

Do a scatter plot

QQ Plot

x axis shows: quantiles of standard normal distribution
y axis shows: quantiles of residuals

The graph should look like a 45 degree straight line

Auto Correlation

Look at Durbin — Watson metrics from linear regression. The value of Y depends on a previous value of Y from the past, in a time series data.