Artificial Intelligence Machine Learning- Supervised Learning and Ensemble Techniques
Machine Learning is the field of computer science that gives computer systems ability to “learn” with data without being explicitly programmed.
Machine Learning is subset of Artificial Intelligence. Deep Learning is subset of Machine Learning.
There are 3 types of Machine Learning- Supervised Learning,Unsupervised Learning, Reinforcement Learning.
Supervised Learning starts with a training data set containing actual labels. Training data set is an amalgamation of independent and dependent (predictor) variables. For example, let’s consider banking problem statement where business has to predict if customer subscribes to a term deposit. A supervised learning algorithm works on millions of customer records with actual target variable (label) which represents subscription to term deposit, classify and predicts the new target value which has never seen before.
Supervised machine learning problems are of two types: Regression and Classification.
We have different supervised algorithms to work on these problems like Linear Regression, Logistic Regression, K- Nearest Neighbor (KNN), Naive Bayes (NB), Support Vector Machine (SVM) etc..
Ensemble learning is a ML technique where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models. Decision Trees, Bagging, Random Forest, Ada Boosting, Gradient Boosting, XGBM, Stacking are the most common algorithms used in ensemble techniques.
Data Science using Python
Context: Leveraged customer information of bank marketing campaigns to predict whether a customer will subscribe to term deposit or not. Different classification algorithms like Logistic Regression, KNN, Naive Bayes, SVM were used. Ensemble techniques like Random forest, Gradient Boosting were used to further improve the classification results.
You can download marketing bank data set from the following link,
I have done a case study on bank data set in Jupyter Notebook kernel and added insights to the patterns hidden in data. The goal of this project is a Classification model. Steps involved for to perform data science are Data pre-processing, Exploratory Data Analysis (EDA), Data Preparation, Split and seed data set into training and test sets, Build model (supervised and ensemble) , Train the model and make predictions using test set, Tune the model with various hyper parameters for better prediction, Cross validate all used models and declare best algorithm which performs well in terms of recall, mean, standard deviation and accuracy.

- Importing necessary python libraries for machine learning (numpy, pandas, seaborn, matplotlib, scipy, sklearn).
- Read bank dataset, load data into pandas data frame object and checking for top sample 50 records.

- In Exploratory Data Analysis, we can check for data object shape, feature importance (correlation), check the presence of missing values, check for distribution of data, checking presence of outliers using box plot(data skew-ness), 5 point summary.
- Data preparation involves converting object column types to categorical columns, replacing unknown values, hot coding categorical columns.



- Using zscore we can also check the presence of outliers.

- We have to ensure of removing outliers from data or else if model deployed into production with outliers may lead to erroneous predicted values with misclassifications and wouldn’t perform well. Below is the way to handle outliers.

- Now I’m splitting my filtered dataset to training and test sets of 70:30 ratio respectively.

- Train the data with supervised classification algorithms, ensemble methods and predict the values for test data.



- A Confusion matrix is a table with correct and incorrect predictions and are summarized by counts broken down by each class. Table comes with 2 rows x 2 columns: TP, FP, FN, TN.
- TN / True Negative: when a case was negative and predicted negative.
- TP / True Positive: when a case was positive and predicted positive.
- FN / False Negative: when a case was positive but predicted negative.
- FP / False Positive: when a case was negative but predicted positive

- Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False.
- Precision – Accuracy of positive predictions.
Precision = TP/(TP + FP) - Recall: Fraction of positives that were correctly identified.
Recall = TP/(TP+FN) - F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)


- Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. A single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
- Final step is to cross validate all machine learning models based on accuracy and recall and find insights of model performance on data.

- Algorithm comparison based on “accuracy”



- Algorithm comparison based on “recall”



From above insights Gradient Boosting algorithm is an Ensemble technique performs well for this classification problem in terms of both accuracy and recall.