Machine Learning — 3 (Python Models)

Scikit Learn

Gaurav Madan
2 min readDec 17, 2019

This library contains most of the algorithms that are required for machine learning.

  • TSNE — It takes high dimension data and reduces into 2 dimensions, so that it can be visualized.

Summarizing Numerical and Categorical variables

For categorical variables, if the number of distinct categories for a variable is more than sqrt of N (numberof samples), then the variable is not useful. Some people use N/2 also.

Profiling of data using pandas

pandas_profiling.ProfileReport(data)

This is powerful tool to do EDA analysis,

Most people create their library of functions, that takes raw data and automates some manual steps.

Custom EDA Tools

  • which are numerical or categorical vars
  • what plot is best for each variable. scatter vs
  • bivariate plots between variables. export as pdf
  • plots X and X, X and Y
  • applies transformations, plots transformations.
  • eventually generates 6 PDF files which can be analyzed offline

Automating Variable Selection

  • Look at 50 different metrics — p values, f values etc
  • automates variable selection

Linear Regression

2 options

  • statsmodel.api — needs 2 separate variables, one for X and one for Y. The X variable is used entirely. we cannot select specific X variables
  • statsmodel formula api —
import statsmodel.formula.api as smfsmf.ols(formula='y~x1+x2+x3', data=dataset1).train()# dont drop the ID variable in the dataset, so that you can merge this data set back with original

Checking VIF

  • If VIF > 10, we must drop the variable
  • If VIF > 5, we can still drop the variable, as it corresponds to R Square = 85%, which is quite high

(85% information is present in other variables)

Feature Selection

F Regression builds univariate regression of Y with individual X values.

  • Calculate F Values for each variable
  • whichever variables have highest F values, we retain those variables
from sklearn.feature_selection import F_value, p_values = f_regression(train_x, train_y)

Error In Model

ddd

from sklearn import metricsimport numpy as np# simple to understand
mae = metrics.mean_absolute_error(test_y, y_pred)
#mse punishes larger error which is useful in real world
mse = metrics.mean_squared_error(test_y, y_pred)
#more popular, as its unit is Y
rmse = np.sqrt(mse)

Checking for Heteroscedascity

Plot

  • Predicted Y on x axis
  • Residuals on y axis

Do a scatter plot

QQ Plot

  • x axis shows: quantiles of standard normal distribution
  • y axis shows: quantiles of residuals

The graph should look like a 45 degree straight line

Auto Correlation

Look at Durbin — Watson metrics from linear regression. The value of Y depends on a previous value of Y from the past, in a time series data.

--

--