Data Science (Python) :: Basics

Sunil Kumar SV
4 min readJun 29, 2017

--

Intention of this post is to give a quick refresher (thus, it’s assumed that you are already familiar with the stuff). You can treat this as FAQ’s as well.

Why do we need feature scaling?

When different features are on different scale (for e.g size from 0 to 2000 and number of bedrooms from 1 to 5), then gradient descent take more time to converge (reach it’s low point). Thus, feature scaling helps in reaching gradient descent faster. By using feature scaling, we get all the values b/w -1 to 1.

*************************************************

What is mean normalisation?

It’s a technique used for feature scaling. In this method,each value of the feature is updated with below formula :

value of feature = (Value — avg/mean) / max of range

*************************************************

Why do we need encoding? Sample code for encoding?

Encoding is needed to make sure that categorical feature have their own weights and thus making the regression model much accurate.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

*************************************************

Why do we need to split the data into training set & test set? How to split?

Data split is done mainly for validating the model created. The trianing set is use to train the model and then, we test the model based on the test set. When doing a split, we can instruct the function to split in terms of percentage. For e.g, when we specify a test_size as 0.2 (as in below code), it instructs the function to make a split of 80% : 20%. 80% being for training set and 20% for test set. Thus, if a data set has 1000 records then a test_size of 0.2 will yield in 800 records for training set and 200 records for test set.

from sklearn.cross_validation import train_test_split;
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0);

************************************************

How to display unique values of a column?

<DataFrame>.<ColumnName>.unique()

************************************************

How to plot a bar graph for counts of a specific column?

import pandas as pd
pd.value_counts(<DataFrame>[‘<ColumnName>’]).plot.bar()

************************************************

How to select only few columns from a Data Frame?

<DataFrame>.filter([‘<Col1>’,’<Col2>’,…etc], axis=1)

**********************************************

How to remove few columns from a Data Frame?

<DataFrame>.drop([‘<Col1>’,’<Col2>’,…etc], axis=1)

************************************************

Library used for StatsModelling?

from statsmodels.formula.api as sm

*************************************************

How do you add a column at the beginning to the matrix with all 1's?

import numpy as np
X = np.append(arr = np.ones((len(X),1)).astype(int), values = X, axis = 1)

*************************************************

What is forward elimination?

Forward Selection chooses a subset of the predictor variables for the final model.

We can do forward stepwise in context of linear regression whether n is less than p or n is greater than p.

Forward selection is a very attractive approach, because it’s both tractable and it gives a good sequence of models.

> Start with a null model. The null model has no predictors, just one intercept (The mean over Y).

> Fit p simple linear regression models, each with one of the variables in and the intercept. So basically, you just search through all the single-variable models the best one (the one that results in the lowest residual sum of squares). You pick and fix this one in the model.

> Now search through the remaining p minus 1 variables and find out which variable should be added to the current model to best improve the residual sum of squares.

> Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value above some threshold.

*************************************************

What is backward elimination?

Unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

In order to be able to perform backward selection, we need to be in a situation where we have more observations than variables because we can do least squares regression when n is greater than p. If p is greater than n, we cannot fit a least squares model. It’s not even defined.

> Start with all variables in the model.

> Remove the variable with the largest p-value | that is, the variable that is the least statistically significant.

> The new (p — 1)-variable model is t, and the variable with the largest p-value is removed.

> Continue until a stopping rule is reached. For instance, we may stop when all remaining variables have a significant p-value de fined by some significance threshold.

Next :- Data Science (Python):: Linear Regression

If you liked this article, please hit the ❤ icon below

--

--

Sunil Kumar SV

#ProductManager #TechEnthusiast #DataScienceEnthusiast #LoveToSolveProblemsUsingTech #Innovation