Day 44 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
3 min readJul 30, 2020

Quadratic Discriminant Analysis. Discriminant analysis encompasses methods that can be used for both classification and dimensionality reduction. Linear discriminant analysis (LDA) is particularly popular because it is both a classifier and a dimensionality reduction technique. Quadratic discriminant analysis (QDA) is a variant of LDA that allows for non-linear separation of data.

These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently multiclass, have proven to work well in practice, and have no hyperparameters to tune.

I picked up this diagram from the documentation of sklearn which shows how these two models can classify and separate data based on a number of factors such as covariance, SD, etc.

I shall share the implementation code in Python along with the link of the dataset which I have used.

Lets start by importing the libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from tqdm import tqdm_notebook

Here, we have a number of libraries imported and I’ll give a gist of the ones I haven’t used before.
- StratifiedKFold from sklearn: StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles your data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting.
- roc_auc_score from sklearn.metrics: In a ROC curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points of a parameter.
- VarianceThreshold from feature_selection: Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
- from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis: A classifier with a quadratic decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class.

The next step would be to read all the input data using Pandas.

train=pd.read_csv('../input/instant-gratification/train.csv')
test=pd.read_csv('../input/instant-gratification/train.csv')

Now I wanted only a few specific columns for which I have defined a list comprehension as seen below.

cols = [c for c in train.columns if c not in ['id', 'target', 'wheezy-copper-turtle-magic']]

We shall now create two arrays for train and test which shall only contain zeroes. I shall explain the use of this in the upcoming lines.

oof = np.zeros(len(train))
preds = np.zeros(len(test))

The next portion is the main code and I’ll mention the explanation right after.

for i in tqdm_notebook(range(512)):
train2 = train[train['wheezy-copper-turtle-magic']==i]
test2 = test[test['wheezy-copper-turtle-magic']==i]
idx1 = train2.index; idx2 = test2.index
train2.reset_index(drop=True,inplace=True)
data = pd.concat([pd.DataFrame(train2[cols]), pd.DataFrame(test2[cols])])
data2 = VarianceThreshold(threshold=2).fit_transform(data[cols])
train3 = data2[:train2.shape[0]]; test3 = data2[train2.shape[0]:]
skf = StratifiedKFold(n_splits=11, random_state=42)
for train_index, test_index in skf.split(train2, train2['target']):
clf = QuadraticDiscriminantAnalysis(0.1)
clf.fit(train3[train_index,:],train2.loc[train_index]['target'])
oof[idx1[test_index]] = clf.predict_proba(train3[test_index,:])[:,1]
preds[idx2] += clf.predict_proba(test3)[:,1] / skf.n_splits

I’ll give a gist of what the code means. So basically, we are running the kernel 512 times in order to train the model classifier. It also takes the specific column from the training dataset and sets that to the i value which is then fitted to the VarianceThreshold after transforming the values. We are then applying the KFold module in order to increase the accuracy and then finally run a loop and fit it using the QuadraticDiscriminantAnalysis classifier.

AUC score value

That’s it for today. Keep Learning.

Cheers.

--

--