Stories by Gayathri Gopalsami on Medium

Central Limit Theorem: Let’s learn with an example using Python

Gayathri Gopalsami — Wed, 02 Feb 2022 17:49:18 GMT

A Probability Theory

A small piece of information can tell the whole story

Central Limit Theorem (CLT) states that if we have a large population that may or may not follow a Gaussian (Normal) Distribution; when we take random samples from it, the sample means will always follow Gaussian (Normal) Distribution. We will try to understand this statement with the help of an example.

Before we do that, let us first answer the below 3 questions.

What is Gaussian/Normal Distribution?

It is the symmetric bell-shaped curve formed from a dataset where the probability of occurrence of data points is more frequent near the mean (highlighted in grey) and less frequent farther away from the mean (highlighted in blue).

Gaussian/Normal Distribution

2. What are population and population mean?

The “population” is the entire dataset collected based on a common feature common characteristics which can be used for statistical purposes.

For example- The dataset with the weights of all the fishes in the sea.

Population Depiction

The “population mean” is the average value calculated on the entire population.

For example- The average weights of all the fishes in the population.

3. What are sample and sample mean?

The “sample” is a subset of a population with fewer data points i.e. data selected randomly from a population.

Sample Depiction

The “sample mean” is the average value calculated on the sample dataset.

For example- The average weight of all the fishes in the sample.

As we now know, what is population, sample, and gaussian distribution; let’s understand the Central Limit Theorem with help of an example dataset.

This dataset which is used in this example consists of Salaries of Employees in 2019. The base salary data in this dataset will be our population and we will select random sample from this population to calculate the sample mean.

We will use pandas, matplotlib, and sns modules in python to load and analyze the data.

Import the library modules and load the dataset using pandas

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
population= pd.read_csv('employee_salary.csv')
population

Below is the snapshot of the dataset which has 10105 rows and 8 columns.

For understanding CLT, we will just concentrate on the “Base Salary” column and call it the salary of the entire population.

population_base_salary = population['Base Salary']

Let’s calculate the population mean of the Base Salary and plot the histogram to check the distribution

print("Population Mean")
print("---------------")
print(population_base_salary.mean())
sns.distplot(population_base_salary, color='grey')
plt.xlabel('Population Base Salary')
plt.ylabel('Probability Density')

Population Distribution of Base Salary

The above figure shows that the distribution is slightly asymmetric i.e. does not exactly follow the Gaussian Distribution.

Now we will randomly select the sample dataset with a constant number of samples and calculate the mean. We repeat this process 500 times to obtain 500 sample means and then plot the distribution.

To make it easier, we can use the below function calc_sample_mean which takes the input as “sample_size” and “no_of_sample_means” i.e. the number of “sample mean” to be calculated. This function calculates the sample mean every time the sample is selected randomly and returns it in an array.

If we pass sample_size =2 and no_of_sample_means=500 , this function will pick 2 random “Base_Salary” samples from the dataset, calculate the sample mean and store it an array. It will repeat this process to store 500 such sample means and return the 500 stored sample mean array.

mean = []
def calc_sample_mean(sample_size, no_of_sample_means):
    for i in range(no_of_sample_means):        
        sample_base_salary = population_base_salary.sample(n=sample_size)
        sample_mean=sample_base_salary.mean()
        mean.append(sample_mean)
    return mean

Let’s use the function to calculate mean and plot the distribution for sample_size=2

mean_2=calc_sample_mean(sample_size=2, no_of_sample_means=500)
sns.distplot(mean_2, color='b')
plt.xlabel('Sample Base Salary (Sample size =2)')
plt.ylabel('Probability Density')

Distribution Curve from Sample Mean with Sample Size 2

There it is !!

We got a Gaussian/Normal Distribution curve with slight skewness.

This proves the theorem- if we have a large population that may or may not follow a Gaussian (Normal) Distribution and when we take random samples from it, the sample means will always follow Gaussian (Normal) Distribution.

Let us check by following the same process but by increasing the sample_size.

sample_size=3

mean_3=calc_sample_mean(sample_size=3, no_of_sample_means=500)
sns.distplot(mean_3, color='r')
plt.xlabel('Sample Base Salary Mean(Sample size =3)')
plt.ylabel('Probability Density')

Distribution Curve from Sample Mean with Sample Size 3

sample_size=10

mean_10=calc_sample_mean(sample_size=10, no_of_sample_means=500)
sns.distplot(mean_10, color='y')
plt.xlabel('Sample Base Salary Mean(Sample size =10)')
plt.ylabel('Probability Density')

Distribution Curve from Sample Mean with Sample Size 10

sample_size=20

mean_20=calc_sample_mean(sample_size=20, no_of_sample_means=500)
sns.distplot(mean_20, color='g')
plt.xlabel('Sample Base Salary Mean(Sample size =20)')
plt.ylabel('Probability Density')

Distribution Curve from Sample Mean with Sample Size 20

sample_size=30

mean_30=calc_sample_mean(sample_size=30, no_of_sample_means=500)
sns.distplot(mean_30, color='maroon')
plt.xlabel('Sample Base Salary Mean(Sample size =30)')
plt.ylabel('Probability Density')

Distribution Curve from Sample Mean with Sample Size 30

sample_size>30

Distribution Curve from Sample Mean with Sample Size 40

Distribution Curve from Sample Mean with Sample Size 100

As we increased the sample size, the skewness reduced and the curve became sharper. Check out the code in GitHub.

This shows that no matter what our population distribution curve is, the sample means will always follow Gaussian (Normal) Distribution.

Sample sizes equal to or greater than 30 are considered sufficient for the theorem to hold but that might not be always necessary. In our example, we can observe that we get a Gaussian Distribution with a sample size of 10 or 20 too.

So one can ask, what is the practical implication of the Central Limit Theorem?

Collection and Statistical analysis of the data of the entire population are practically impossible in almost all cases. Sample data of any population can be used to draw conclusions about the overall population using CTL as we know that sample means are always normally distributed!!!!

Happy Learning!!!!

Mlearning.ai Submission Suggestions

Confusion Matrix: Let’s learn with example

Gayathri Gopalsami — Sun, 23 Jan 2022 17:33:39 GMT

Everything about the Confusion Matrix

Confusion Matrix, by definition, is a table that summarizes the performance of a classification algorithm. We will try to understand this statement with an example.

Let’s take the example of our favorite game of Cricket. There is a Cricket Board (ABC) that forms the teams, organizes and schedules the matches for the tournament. This time the board members take a crazy decision. They announce that Chuck- a random guy is going to be the new umpire for this tournament. Then they add to it saying that Chuck has only one job to do as an umpire- to decide if the batsman is “out” or “not-out” (In short, he has to classify between 2 classes — “Not Out” or “Out”).

All the players including Chuck are surprised. Chuck doesn’t know anything about the game. Everyone gets worried and then starts questioning the board about this decision. Then the board explains that Chuck has to undergo training, where he should watch the previous matches and learn the pattern by himself as when a batsman is declared “out” or “not-out”. Chuck is now a little hopeful and agrees to watch all the matches. He keenly observes all the patterns and tries to understand when to decide if a player is “out” or “not out”. After learning the pattern, he feels that he is ready for the job.

The tournament begins and Chuck makes his decisions based on the pattern he has observed and learned.

Now, the board members want to see how Chuck performed with what he has learned from his training. They compare Chuck’s decision with a decision that an experienced umpire would have made. They come up with the data as in the table below. In the table, Not-Out is the positive scenario which is represented as 1, Out is the negative scenario which is represented as 0.

Experienced Umpire Vs Chuck’s decision

Out of 10 decisions made by Chuck, 3 of them were wrong as highlighted in Red. Not bad for the first time though!!!

Wrong Decisions By Chuck (Highlighted in Red)

Let’s check his performance and analyze the implications by counting “Actual and Predicted” values.

Experienced Umpire and Chuck’s decisions highlighted with different colors

Actual “Not out” — Count the number of green cells — 3.

Actual “Out” — Count the number of orange cells — 7

Chuck’s predicted “Not out” — Count the number of yellow cells — 4

Chuck’s predicted “Out” — Count the number of blue cells — 6

Now we can start substituting the numbers in matrix form.

The initial version of Confusion Matrix

Let’s check how many “Outs” and “Not-Outs” Chuck predicted correctly. We know that he made 3 wrong decisions. This means that he has made 7 correct decisions.

Correct Decisions by Chuck

Correctly predicted “Not out” — Count the number of pink cells — 2.

Correctly predicted “Out” — Count the number of aqua green cells —5.

After substituting in our matrix we get:

The Confusion Matrix with all correctly predicted “Outs” and “Not-Outs”

Now we are ready to understand what are True Positives and True Negatives.

True Positives (TP): Correctly predicted positive values — in our case, the number of correct decisions by Chuck as “Not-out” (Pink Cell)

True Negatives (TN): Correctly predicted negative values — in our case, the number of correct decisions by Chuck as “Out” (Aqua Green Cell)

Let’s move forward with wrong predictions.

Wrong Decisions By Chuck

Wrongly predicted “Out” as “Not-Out” — Count the number of red cells — 2.

Wrongly predicted “Not out” as “Out”— Count the number of blue cells — 1.

This completes our matrix:

The complete confusion matrix of Chuck’s decision

Here we will try to understand False Positives and False Negatives

False Positives (FP): Wrongly predicted as positive values — in our case, the number of wrong decisions by Chuck where the player was actually “Out” but Chuck decided he was “Not-out” (Red Cell).

False Negatives (FN): Wrongly predicted as negative values — in our case, in our case, the number of wrong decisions by Chuck where the player was actually “Not-Out” but Chuck decided he was “Out” (Blue Cell).

It is very obvious that when we do something wrong, we call it an error:

Type1 Errors: False Positives are called Type1 Errors

Type2 Errors: False Negatives are called Type2 Errors

Below figure shows the Confusion Matrix representing Chuck’s decision on the left side and generic confusion matrix representation on the right side:

The Generic Confusion Matrix

As we have the confusion matrix now, we can now calculate Chuck’s performance scores by answering the below questions.

1) How accurate was Chuck in deciding whether a player was “Not-Out” or “Out”?

Accuracy

He was accurate 7 out of 10 times. Hence his Accuracy score is 7/10= 0.7 or 70%

Accuracy = (TP+TN) / Total

2)How inaccurate was Chuck in deciding whether a player was “Not-Out” or “Out”?

Misclassification

He was inaccurate 3 out of 10 times. Hence his Misclassification score is 3/10= 0.3 or 30%

Misclassification= (FP+FN) / Total

3)How many times did Chuck correctly decide that a player is “Not-Out” from all the “Not-Outs” he declared.

Precision

Chuck was correct 2 times out of total 4 times that he decided that a player is “Not-Out”. Hence his Precision score is 2/4 =0.5 or 50 %

Precision = TP / (TP+FP)

4) How many times did Chuck correctly decide that a player is “Not-Out” from all the actual “Not-Outs”?

Recall

Chuck was correct 2 times out of a total of 3 times that a player was actually “Not-Out”. Hence his Sensitivity /Recall score is 2/3 =0.6 or 60 %

Recall= TP / (TP+FN)

5) How many times did Chuck correctly decide that a player is “Out” from all the actual “Outs”?

Specificity

Chuck correctly decided 5 times out of a total of 7 times that a player was actually “Out”. Hence his Specificity score is 5/7 =0.7 or 70 %

Specificity= TN / (FP+TN)

Let us use Python scikit-learn to generate a “confusion matrix” and calculate performance scores:

#Code snippet used for this example :

from sklearn.metrics import accuracy_score ,confusion_matrix ,precision_score , recall_score , ConfusionMatrixDisplay
import matplotlib.pyplot as plt

#Data collected by the board
Experienced_umpire = [1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
Chuck = [1, 0, 0, 1, 0, 0, 1, 0, 1, 0]

#Plotting Confusion Matrix
fig, ax = plt.subplots(1,1,figsize=(7,4))
results = confusion_matrix(Chuck, Experienced_umpire , labels=[1,0])
cm_display = ConfusionMatrixDisplay(results, display_labels=['Not-out','Out']).plot(values_format=".0f",ax=ax)
ax.set_xlabel("Experienced Umpire's Decision")
ax.set_ylabel("Chuck's Decision")
plt.show()

#Printing Performance Metrix
print("Accuracy Score  : ", accuracy_score(Experienced_umpire, Chuck))
print("Precision Score : ", precision_score(Experienced_umpire, Chuck))
print("Recall Score    : ", recall_score(Experienced_umpire, Chuck))

Chuck’s decision confusion matrix in Python

The score has been calculated. What should Chuck do now? He should try to increase his performance. The 3 wrong predictions that he made are going to affect the teams very badly. He should try to increase the Accuracy. Accuracy will increase by either increasing the True Positives or True Negative. This means that either False Positives or False Negative should be reduced.

Reducing False Positive will increase Precision and reducing False Negative will increase Recall. Hence, Chuck needs to focus on either increasing Recall or Precision Score.

In an ideal scenario, reducing both False Positives and False Negatives might NOT be possible. In those cases, we need to analyze whether reducing False Positives is more important or False Negatives must be reduced i.e.

Chuck declaring a “Not-Out” as “Out” should be reduced or vice versa

In our scenario, reducing both might be important. But Chuck declaring a “Not-Out” as “Out” could have more impact. So, Chuck should focus more on reducing False Negative i.e. increasing his Recall Score.

Jargons at a glance:

True Positives (TP): Correctly predicted positive values.

True Negatives (TN): Correctly predicted negative values.

False Positives (TP): Wrongly predicted positive values.

False Negatives (TN): Wrongly predicted negative values.

Accuracy = (TP+TN) / Total

Misclassification= (FP+FN) / Total

Precision = TP / (TP+FP)

Sensitivity/Recall= TP / (TP+FN)

Specificity= TN / (FP+TN)

Here we go…. that’s the explanation for the statement -

“confusion matrix is a table that summarizes the performance of classification algorithm”.

Happy Learning!!!!!

Mlearning.ai Submission Suggestions

Reusable Python Functions in my repo to quickly develop any Machine Learning Models

Gayathri Gopalsami — Tue, 18 Jan 2022 04:28:35 GMT

Build once use many

Build Once Use Many

While executing any end-to-end data science project, any data science professional or a student has to mainly focus on problem definition, data collection, data investigation, cleansing, statistical and visual analysis, feature engineering, decision making, and model building. In the entire Lifecycle of a Data Science project, one step where one should not invest much time is writing codes for building any ML models.

This article assumes that the reader has a good understanding of ML models and how to build/implement them using the “scikit-learn (sklearn)” library in python. It is about making our code reusable so that we can develop any model without wasting much time in coding and avoid writing repeated codes.

Always try to write reusable codes wherever possible

After a long and necessary process of Exploratory Data Analysis on a given dataset, we split the dataset ( X — independent variables , y — target variable ) into training and test set using train_test_split. Now we have 4 variables returned by the function — X_train , X_test , y_train and y_test.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

Note: All the feature engineering and feature selection processes should be performed before the below-given steps.

To implement any ML model I keep the below functions in my code repository which I reuse to develop any classifier or regressor model.

Code Snippet 1:

Initialize the data frames to store and compare the performance metrics of the models.

I have this below code snippet to store the model performance scores in a data frame to compare the different models after the predictions.

 # For Classifier

import pandas as pd
import numpy as np

#This dataframe stores the scores from classifier models
df_model=pd.DataFrame(columns=['Model','Accuracy Score' ,'F1 Score', 'Precision Score' , 'Recall Score' ,'ROC AUC'])
df_model_performance =df_model

#This dataframe stores the train and test accuracy from classifier models to compare at the end of the model building. This can also be further modified to compare the other scores such as F1 score etc
df_model_test_train_acc = pd.DataFrame(columns=['Model' , 'Train Accuracy Score' ,'Test Accuracy Score'])
df_model_accuracy =df_model_test_train_acc

# For Regressor

import pandas as pd
import numpy as np

#This dataframe stores the scores from regressor models
df_model=pd.DataFrame(columns=['Model', 'MAE' ,'RMSE', 'R2 Score' , 'Adjusted R2 Score'])
df_model_performance =df_model

#This data frame stores the train and test "adjusted R2 scores" from regressor models to compare at the end of the model building. This can also be further modified to compare the other score such as MSE , RMSE  etc
df_model_test_train_r2 = pd.DataFrame(columns=['Model' , 'Train Adjusted R2 Score' ,'Test Adjusted R2 Score'])
df_model_r2 =df_model_test_train_r2

Code Snippet 2 :

Function to obtain the best model by performing hyperparameter tuning using GridSearchCV .

I have defined a function “get_best_hyperparameters” which does the hyperparameter tuning using GridSearchCV by taking classifier or regressor model as input. This function returns the best model which can be used to fit and predict. This step can be skipped if one just wants to build a basic model without performing any hyperparameter tuning.

# For both Classifier and Regressor

from sklearn.model_selection import GridSearchCV 
def get_best_hyperparameters(model, params, cv_value , X_train, y_train ): 
    search = GridSearchCV(estimator=model, param_grid=params, n_jobs=-1, verbose=1,cv=cv_value) 
    search.fit(X_train, y_train)  
    print("Best Accuracy    :",  search.best_score_) 
    print("Best Parameters  : ", search.best_params_)
    print("Best Estimators : ",  search.best_estimator_)  
    best_grid = search.best_estimator_
    return best_grid

Code Snippet 3:

Function to fit and predict the model:

This function (for classifier and regressor) get_classifier_predictions / get_regressor_predictions takes in the model as input and returns the predicted train and test results. In case of classifier , it also returns predicted train and test probability.

#For Classifier

def get_classifier_predictions(classifier, X_train, y_train, X_test): 
    classifier.fit(X_train,y_train)
    y_pred_train =classifier.predict(X_train)
    y_pred_test = classifier.predict(X_test)
    y_pred_prob_train = classifier.predict_proba(X_train)
    y_pred_prob_test = classifier.predict_proba(X_test)
    return y_pred_train, y_pred_test, y_pred_prob_train,y_pred_prob_test

#For Regressor

def get_regressor_predictions(regressor, X_train, y_train, X_test):  
    regressor.fit(X_train,y_train)
    y_pred_train =regressor.predict(X_train)
    y_pred_test = regressor.predict(X_test)
    return y_pred_train, y_pred_test

Code Snippet 4:

Function to calculate and print the performance metrics of train and test dataset

The function print_classifier_scores / print_regressor_scores calculates and returns the dataset with all the performance metrics scores related to a classification / regression algorithm respectively .

# For Classifier

from sklearn.metrics import accuracy_score ,confusion_matrix ,precision_score , recall_score , f1_score, plot_confusion_matrix ,roc_auc_score
import matplotlib.pyplot as plt                                     # Importing pyplot interface to use matplotlib
%matplotlib inline

def print_classifier_scores(classifier, X_train, X_test, y_train ,y_test,y_pred_train, y_pred_test,y_pred_prob_train, y_pred_prob_test,algorithm):
# store classifier scores for Training Dataset
    v_recall_score_train =  recall_score(y_train,y_pred_train)
    v_precision_score_train = precision_score(y_train,y_pred_train)
    v_f1_score_train =  f1_score(y_train,y_pred_train)
    v_accuracy_score_train = accuracy_score(y_train,y_pred_train)
    v_roc_auc_train = roc_auc_score(y_train, y_pred_prob_train[:,1])
    
# print classifier scores for Training Dataset
    print('Train-Set Confusion Matrix:\n', confusion_matrix(y_train,y_pred_train)) 
    print("Recall Score    : ", v_recall_score_train)
    print("Precision Score : ", v_precision_score_train)
    print("F1 Score        : ", v_f1_score_train)
    print("Accuracy Score  : ", v_accuracy_score_train)
    print("ROC AUC         :  {}".format(v_roc_auc_train))
    print("Predict Probability  :" , y_pred_prob_train)
    plot_confusion_matrix(classifier, X_train , y_train , display_labels = ["1" , "0"])
    plt.grid(b=None)
# store classifier scores for Testing Dataset 
   
    v_recall_score_test =  recall_score(y_test,y_pred_test)
    v_precision_score_test = precision_score(y_test,y_pred_test)
    v_f1_score_test =  f1_score(y_test,y_pred_test)
    v_accuracy_score_test = accuracy_score(y_test,y_pred_test)
    v_roc_auc_test = roc_auc_score(y_test, y_pred_prob_test[:,1])
# Print classifier scores for Testing Dataset    
    print('Test-Set Confusion Matrix:\n', confusion_matrix(y_test,y_pred_test)) 
    print("Recall Score    : ", v_recall_score_test)
    print("Precision Score : ", v_precision_score_test)
    print("F1 Score        : ", v_f1_score_test)
    print("Accuracy Score  : ", v_accuracy_score_test)
    print("ROC AUC         :  {}".format(v_roc_auc_test))
    print("Predict Probability  :" , y_pred_prob_test)
    plot_confusion_matrix(classifier, X_test , y_test , display_labels = ["1" , "0"])
    plt.grid(b=None)
# store to append the results in dataframe for final comparison of performance 
    df_model_test_train_acc = dict({'Model' : algorithm, 'Train Accuracy Score' :v_accuracy_score_train,'Test Accuracy Score' :v_accuracy_score_test })
    df_model_performance = dict({'Model' : algorithm, 'Accuracy Score' :v_accuracy_score_test, 'F1 Score' : v_f1_score_test, 'Precision Score' : v_precision_score_test, 'Recall Score' :v_recall_score_test, 'ROC AUC' : v_roc_auc_test})
    
    return df_model_test_train_acc , df_model_performance

# For regressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def print_regressor_scores(regressor, X_train, X_test, y_train ,y_test,y_pred_train, y_pred_test,algorithm):
    
    # store regressor scores for Training Dataset
    MAE_train = mean_absolute_error(y_train, y_pred_train)
    RMSE_train = np.sqrt( mean_squared_error(y_train, y_pred_train))
    r2_score_train = r2_score(y_train, y_pred_train)
    # Calculating Adjusted R2 for training set
    SS_Residual_train = sum((y_train-y_pred_train)**2)
    SS_Total_train = sum((y_train-np.mean(y_train))**2)
    r_squared_train = 1 - (float(SS_Residual_train))/SS_Total_train
    adj_r_sq_train = 1 - (1-r_squared_train)*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1)
    
    # print regressor scores for Training Dataset
    print('MAE for training set is {}'.format(MAE_train))
    print('RMSE for training set is {}'.format(RMSE_train))
    print('R squared score for training set is {}'.format(r2_score_train))
    print('Adjusted R squared score for training set is {}'.format(adj_r_sq_train))
    
    # store regressor scores for Test Dataset
    MAE_test = mean_absolute_error(y_test, y_pred_test)
    RMSE_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
    r2_score_test = r2_score(y_test, y_pred_test)
    # Calculating Adjusted R2 for test set
    SS_Residual_test = sum((y_test-y_pred_test)**2)
    SS_Total_test = sum((y_test-np.mean(y_test))**2)
    r_squared_test = 1 - (float(SS_Residual_test))/SS_Total_test
    adj_r_sq_test = 1 - (1-r_squared_test)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
    
    # print regressor scores for Test Dataset 
    print('MAE for test set is {}'.format(MAE_test))
    print('RMSE for test set is {}'.format(RMSE_test))
    print('R squared score for test set is {}'.format(r2_score_test))
    print('Adjusted R squared score for testing set is {}'.format(adj_r_sq_test))
    
    # store to append the results in dataframe for final comparison of performance
    df_model_test_train_r2= dict({'Model' : algorithm, 'Train Adjusted R2 Score' :adj_r_sq_train,'Test Adjusted R2 Score' :adj_r_sq_test })
    df_model_performance = dict({'Model' : algorithm, 'MAE' : MAE_test, 'RMSE' : RMSE_test, 'R2 Score' : r2_score_test, 'Adjusted R2 Score' :adj_r_sq_test})
    return df_model_test_train_r2 , df_model_performance

There it is!

Now I can develop any ML model and I can do the prediction, calculate the scores and compare the model performance by just giving the right model and parameters to the above functions.

Classifier Example :

Below example shows how to use these functions to build a Logistic Regression Model(GitHub link here):

Set up the parameters for hyperparameter tuning and pass the initialized model to the function get_best_hyperparameters to obtain the best grid. This step is optional and also an empty parameter list can be passed.

from sklearn.linear_model import LogisticRegression
logreg_params = {'penalty' : ['l2'],
                 'C' : np.logspace(-1, 2, 100),
                 'solver' :['liblinear'],
                 'random_state' :[42,99]
                 }
lr_best_grid= get_best_hyperparameters(LogisticRegression(), logreg_params, 5, X_train, y_train)

2. Pass the best model to function get_classifier_predictions to get the predicted results and probability.

y_pred_train, y_pred_test, y_pred_prob_train, y_pred_prob_test = get_classifier_predictions(lr_best_grid, X_train, y_train, X_test )

3. Input the predicted results to function print_classifier_scores to calculate the performance metric scores and print the results.

df_model_test_train_acc1, df_model_performance1=print_classifier_scores(lr_best_grid, X_train, X_test, y_train , y_test, y_pred_train, y_pred_test, y_pred_prob_train, y_pred_prob_test , 'Logistic Regression')

4. Append the results to the dataframe to compare all the built model performance

df_model=df_model.append(df_model_performance1,ignore_index=True )
df_model_test_train_acc= df_model_test_train_acc.append(df_model_test_train_acc1, ignore_index=True)

Regressor Example:

Below example shows how to use these functions to build a Linear Regression Model(GitHub link here):

Set up the parameters for hyperparameter tuning and pass the initialized model to the function get_best_hyperparameters to obtain the best grid. This step is optional and also an empty parameter list can be passed.

from sklearn.linear_model import LinearRegression
parameters = {'fit_intercept':[True,False],  'copy_X':[True, False]}
lr_best_grid= get_best_hyperparameters(LinearRegression(), parameters, 5, X_train, y_train)

2. Pass the best model to function get_regressor_predictions to get the predicted results.

y_pred_train, y_pred_test = get_regressor_predictions(lr_best_grid, X_train, y_train, X_test )

3. Input the predicted results to function print_regressor_scores to calculate the performance metric scores and print the results.

df_model_test_train_r2_1, df_model_performance1=print_regressor_scores(lr_best_grid, X_train, X_test, y_train , y_test, y_pred_train, y_pred_test , 'Linear Regression')

4. Append the results to the dataframe to compare all the built model performance

df_model=df_model.append(df_model_performance1,ignore_index=True )
df_model_r2= df_model_r2.append(df_model_test_train_r2_1, ignore_index=True)

Now we can use these functions to develop any model by passing the model specific parameters as done for Linear Regression and Logistic Regression in the example above .

Yay….Happy model building !!!

Mlearning.ai Submission Suggestions

Six Basic Features about any ML Algorithm a Data Scientist should definitely know ..

Gayathri Gopalsami — Tue, 11 Jan 2022 09:32:02 GMT

Six Basic Features about any ML Algorithm a Data Scientist should definitely know ..

Any algorithm, let alone an ML algorithm, has its own purpose and specialized features of its own. That’s what makes an algorithm unique in its own way!

An improperly implemented ML algorithm on a dataset can lead to disaster no matter how advanced and powerful it may be.

The result of “blind coding” is always disappointing. A proper and in-depth understanding of any ML algorithm is very much necessary. Even if we are devoid of it, knowing these basics before implementing will save our day…

1. Know the “Assumptions”

Let’s take a simple example of “Linear Regression” which assumes that the variables have linear relationships. If this basic assumption is not true about your dataset, then the algorithm might fail.

An ML algorithm might or might not be based on assumptions. It is very important for us to know and understand them properly. Verify all the assumptions before implementing an algorithm on the dataset. If the required conditions are not met, your algorithm just might not work!!

2. Know the “Pros and Cons”

One of the advantages of “Support Vector Machines” is, it works really well on high dimensional data. If you have a dataset that has a number of dimensions greater than the number of samples, then SVM might be your answer.

Every algorithm has its pros and cons. This should be one of the basic factors in deciding when to use an algorithm and when not to. Know all of them before you start programing using any ML algorithm on your dataset.

3. Know if “Missing Data Handling” is required:

“KNN — K-Nearest Neighbors” algorithm can’t work when data is missing. For this kind of algorithm, data needs to be manually imputed to make it work.

Data is never clean. Handling missing data is a very important step of the “EDA” process before building ML models and is always recommended. Some algorithms take care of missing values. But missing data needs to be handled or not, it should be an informed decision based on the ML algorithm we are going to use.

4. Know if “Feature Scaling” is required:

The is no need for scaling or normalizing data before building your model with “XG-Boost”

All the algorithms do not require feature scaling. A distance-based algorithm that is affected by the range of features requires scaling. Identify the algorithms that require feature scaling and do it only for the algorithms that require it.

5. Know “Outliers” impact:

Tree-based algorithms like “Random Forest” are robust to outliers. There is no need to handle outliers for building models by implementing such algorithms.

Outliers can mislead the algorithms which can affect the performance and can lead to a poor result. Understanding the impact of outliers on the ML algorithm is a must.

6. Know the type of “Problem Statement” the ML algorithm can solve

For problems such as Sentiment Analysis, Text Classification, “Naïve Bayes” tends to be the solution.

Most algorithms are built with a specific purpose. Before implementing understand the problem statement and be thorough with the dataset. Handpick the algorithm that specializes in solving the kind of problem statement and start implementing it.

Mlearning.ai Submission Suggestions