Machine Learning Classification Project — Part 4: Creating Dummy Variables, Splitting Data to Train & Test sets and Handling Imbalanced Data

Roi Polanitzer
5 min readMar 19, 2022

--

Photo Credit: dataaspirant.com

5. Creating Dummy Variables

Create dummy variables for four categorical variables.

df_loan_dummies=pd.get_dummies(df_loan,columns=['LOAN_TERM','LOAN_PURPOSE','EMPLOYMENT_LENGTH', 'HOUSING'])df_loan_dummies.head().T
df_loan_dummies.to_csv("df_loan_dummies.csv",index=False)df_loan_dummies.columns

Index(['LOAN_AMOUNT', 'INTEREST_RATE', 'MONTHLY_PAYMENT', 'ANNUAL_INCOME', 'DEBT_TO_INCOME', 'DEFAULT', 'LOAN_TERM_ 36 months', 'LOAN_TERM_ 60 months', 'LOAN_PURPOSE_car', 'LOAN_PURPOSE_credit_card', 'LOAN_PURPOSE_debt_consolidation', 'LOAN_PURPOSE_educational', 'LOAN_PURPOSE_home_improvement', 'LOAN_PURPOSE_house', 'LOAN_PURPOSE_major_purchase', 'LOAN_PURPOSE_medical', 'LOAN_PURPOSE_moving', 'LOAN_PURPOSE_other', 'LOAN_PURPOSE_renewable_energy', 'LOAN_PURPOSE_small_business', 'LOAN_PURPOSE_vacation', 'LOAN_PURPOSE_wedding', 'EMPLOYMENT_LENGTH_1-2 Years', 'EMPLOYMENT_LENGTH_3-4 Years', 'EMPLOYMENT_LENGTH_5-6 Years', 'EMPLOYMENT_LENGTH_7-8 Years', 'EMPLOYMENT_LENGTH_9-10 Years', 'EMPLOYMENT_LENGTH_< 1 year', 'EMPLOYMENT_LENGTH_>10 Years', 'HOUSING_no', 'HOUSING_yes'], dtype='object')

7. Splitting Data to Train & Test sets

X = df_loan_dummies.drop("DEFAULT",axis=1)
y = df_loan_dummies["DEFAULT"]
y=y.astype('int')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=47, stratify=y)

8. Handling Imbalanced Data

8.1. Checking the balance

df_loan_dummies["DEFAULT"].value_counts()
sns.countplot(x="DEFAULT",data=df_loan_dummies)
plt.show()
count_no_default = len(df_loan_dummies[df_loan_dummies['DEFAULT']==0])count_default = len(df_loan_dummies[df_loan_dummies['DEFAULT']==1])pct_of_no_default = count_no_default/(count_no_default+count_default)print("\033[1m percentage of no default is", pct_of_no_default*100)
pct_of_default = count_default/(count_no_default+count_default)
print("\033[1m percentage of default", pct_of_default*100)

percentage of no default is 89.77159086363454

percentage of default 10.228409136365455

Our classes are imbalanced, and the ratio of no-default to default instances is 90:10.

8.2. Over-sampling using SMOTE

With our training data created, I’ll up-sample the default class using the SMOTE algorithm (Synthetic Minority Oversampling Technique). At a high level, SMOTE:

  1. Works by creating synthetic samples from the minor class (default class) instead of creating copies.
  2. Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.

We are going to implement SMOTE in Python.

os = SMOTE(random_state=47)
columns1 = X.columns

os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns1)
os_data_y= pd.DataFrame(data=os_data_y,columns=['DEFAULT'])

# we can Check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no default in oversampled data",len(os_data_y[os_data_y['DEFAULT']==0]))
print("Number of default",len(os_data_y[os_data_y['DEFAULT']==1]))
print("Proportion of no default data in oversampled data is ",len(os_data_y[os_data_y['DEFAULT']==0])/len(os_data_X))
print("Proportion of default data in oversampled data is ",len(os_data_y[os_data_y['DEFAULT']==1])/len(os_data_X))
cols=columns1X_train=os_data_X[cols]
y_train=os_data_y['DEFAULT']
X_train.columns

Index(['LOAN_AMOUNT', 'INTEREST_RATE', 'MONTHLY_PAYMENT', 'ANNUAL_INCOME', 'DEBT_TO_INCOME', 'LOAN_TERM_ 36 months', 'LOAN_TERM_ 60 months', 'LOAN_PURPOSE_car', 'LOAN_PURPOSE_credit_card', 'LOAN_PURPOSE_debt_consolidation', 'LOAN_PURPOSE_educational', 'LOAN_PURPOSE_home_improvement', 'LOAN_PURPOSE_house', 'LOAN_PURPOSE_major_purchase', 'LOAN_PURPOSE_medical', 'LOAN_PURPOSE_moving', 'LOAN_PURPOSE_other', 'LOAN_PURPOSE_renewable_energy', 'LOAN_PURPOSE_small_business', 'LOAN_PURPOSE_vacation', 'LOAN_PURPOSE_wedding', 'EMPLOYMENT_LENGTH_1-2 Years', 'EMPLOYMENT_LENGTH_3-4 Years', 'EMPLOYMENT_LENGTH_5-6 Years', 'EMPLOYMENT_LENGTH_7-8 Years', 'EMPLOYMENT_LENGTH_9-10 Years', 'EMPLOYMENT_LENGTH_< 1 year', 'EMPLOYMENT_LENGTH_>10 Years', 'HOUSING_no', 'HOUSING_yes'], dtype='object')

About the Author

Roi Polanitzer, PDS, ADL, MLS, PDA, CPD

Roi Polanitzer, PDS, ADL, MLS, PDA, CPD, F.IL.A.V.F.A., FRM, is a data scientist with an extensive experience in solving machine learning problems, such as: regression, classification, clustering, recommender systems, anomaly detection, text analytics & NLP, and image processing. Mr. Polanitzer is is the Owner and Chief Data Scientist of Prediction Consultants — Advanced Analysis and Model Development, a data science firm headquartered in Rishon LeZion, Israel. He is also the Owner and Chief Appraiser of Intrinsic Value — Independent Business Appraisers, a business valuation firm that specializes in corporates, intangible assets and complex financial instruments valuation.

Over more than 16 years, he has performed data science projects such as: regression (e.g., house prices, CLV- customer lifetime value, and time-to-failure), classification (e.g., market targeting, customer churn), probability (e.g., spam filters, employee churn, fraud detection, loan default, and disease diagnostics), clustering (e.g., customer segmentation, and topic modeling), dimensionality reduction (e.g., p-values, itertools Combinations, principal components analysis, and autoencoders), recommender systems (e.g., products for a customer, and advertisements for a surfer), anomaly detection (e.g., supermarkets’ revenue and profits), text analytics (e.g., identifying market trends, web searches), NLP (e.g., sentiment analysis, cosine similarity, and text classification), image processing (e.g., image binary classification of dogs vs. cats, , and image multiclass classification of digits in sign language), and signal processing (e.g., audio binary classification of males vs. females, and audio multiclass classification of urban sounds).

Mr. Polanitzer holds various professional designations, such as a global designation called “Financial Risk Manager” (FRM, which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as K-Means, SVM and KNN for credit risk measurement and management) from the Global Association of Risk Professionals (GARP), a designation called “Fellow Actuary” (F.IL.A.V.F.A., which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as GLM, RF and NN for determining premiums in general insurance) from the Israel Association of Valuators and Financial Actuaries (IAVFA), and a designation called “Certified Risk Manager” (CRM, which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as DT, NB and PCA for operational risk management) from the Israeli Association of Risk Managers (IARM).

Mr. Polanitzer had studied actuarial science (i.e., implementation of statistical and data mining techniques for solving time-series analysis, dimensionality reduction, optimization and simulation problems) at the prestigious 250-hours training program of the University of Haifa, financial risk management (i.e., building statistical predictive and probabilistic models for solving regression, classification, clustering and anomaly detection) at the prestigious 250-hours training program of the program of the Ariel University, and machine learning and deep learning (i.e., building recommender systems and training neural networks for image processing and NLP) at the prestigious 500-hours training program of the John Bryce College.

He had graduated various professional trainings at the John Bryce College, such as: “Introduction to Machine Learning, AI & Data Visualization for Managers and Architects”, “Professional training in Practical Machine Learning, AI & Deep Learning with Python for Algorithm Developers & Data Scientists”, “Azure Data Fundamentals: Relational Data, Non-Relational Data and Modern Data Warehouse Analytics in Azure”, and “Azure AI Fundamentals: Azure Tools for ML, Automated ML & Visual Tools for ML and Deep Learning”.

Mr. Polanitzer had also graduated various professional trainings at the Professional Data Scientists’ Israel Association, such as: “Neural Networks and Deep Learning”, “Big Data and Cloud Services”, “Natural Language Processing and Text Mining”.

--

--

Roi Polanitzer

Chief Data Scientist at Prediction Consultants — Advanced Analysis and Model Development. https://polanitz8.wixsite.com/prediction/english