Machine Learning — 3 (Python Models)
Scikit Learn
This library contains most of the algorithms that are required for machine learning.
- TSNE — It takes high dimension data and reduces into 2 dimensions, so that it can be visualized.
Summarizing Numerical and Categorical variables
For categorical variables, if the number of distinct categories for a variable is more than sqrt of N (numberof samples), then the variable is not useful. Some people use N/2 also.
Profiling of data using pandas
pandas_profiling.ProfileReport(data)
This is powerful tool to do EDA analysis,
Most people create their library of functions, that takes raw data and automates some manual steps.
Custom EDA Tools
- which are numerical or categorical vars
- what plot is best for each variable. scatter vs
- bivariate plots between variables. export as pdf
- plots X and X, X and Y
- applies transformations, plots transformations.
- eventually generates 6 PDF files which can be analyzed offline
Automating Variable Selection
- Look at 50 different metrics — p values, f values etc
- automates variable selection
Linear Regression
2 options
- statsmodel.api — needs 2 separate variables, one for X and one for Y. The X variable is used entirely. we cannot select specific X variables
- statsmodel formula api —
import statsmodel.formula.api as smfsmf.ols(formula='y~x1+x2+x3', data=dataset1).train()# dont drop the ID variable in the dataset, so that you can merge this data set back with original
Checking VIF
- If VIF > 10, we must drop the variable
- If VIF > 5, we can still drop the variable, as it corresponds to R Square = 85%, which is quite high
(85% information is present in other variables)
Feature Selection
F Regression builds univariate regression of Y with individual X values.
- Calculate F Values for each variable
- whichever variables have highest F values, we retain those variables
from sklearn.feature_selection import F_value, p_values = f_regression(train_x, train_y)
Error In Model
ddd
from sklearn import metricsimport numpy as np# simple to understand
mae = metrics.mean_absolute_error(test_y, y_pred)#mse punishes larger error which is useful in real world
mse = metrics.mean_squared_error(test_y, y_pred)#more popular, as its unit is Y
rmse = np.sqrt(mse)
Checking for Heteroscedascity
Plot
- Predicted Y on x axis
- Residuals on y axis
Do a scatter plot
QQ Plot
- x axis shows: quantiles of standard normal distribution
- y axis shows: quantiles of residuals
The graph should look like a 45 degree straight line
Auto Correlation
Look at Durbin — Watson metrics from linear regression. The value of Y depends on a previous value of Y from the past, in a time series data.