Make It So: A Simple Step by Step Intro to Logistic Regression with Captain Picard
Data Science, the final frontier. Join me on my continuing mission to explore strange new datasets and boldly simplify complex concepts. Today’s mission: determine the odds of one’s favorite Starfleet captain being the legendary Jean-Luc Picard. Lay in a course for Logistic Regression. Engage!
Setting Your Target
There is no neutral zone. We have identified two types of beings: those who think Picard is the finest leader Starfleet has ever known and those who don’t. So we create a target column with two values: ‘1’ for Captain Picard, and ‘0’ for Not Captain Picard.
But what’s the meaning of this voyage? Do we want to limit false positives? In other words, avoid assuming someone’s favorite is Captain Picard when it is not? (Perhaps our client is attending a Star Trek convention and doesn’t want to appear basic by picking everyone’s fave.)
Or are they more concerned with false negatives wherein we predict a person does not love Captain Picard when in fact they do? (Perhaps our client is in charge of driving tune-in for Picard and doesn’t want to miss out on potential audience growth.)
The answers to these questions will shape our mission. Let’s see what’s out there!
Exploratory Data Analysis
Exploring means taking our time with the dataset to learn all we can about the data and what significance they might hold. Since we are no stranger to the final frontier, we have the industry knowledge necessary to make informed decisions. Patterns and correlations can be identified through things like charts, value counts, and packages like dabl (which analyzes target feature relationships).
Data Cleaning
Lucky for us our data universe is as pristine as the brand new USS Enterprise-G. Usually this is not the case. (To prepare for your next mission: Ultimate Guide to Data Cleaning.)
Creating a Simple Baseline Model
The simplest kind would be one that predicts the most popular target (aka the Majority Class) 100% of the time. In this completely fictional example, it turns out 80% of the people in our dataset chose Captain Picard.
Therefore, a baseline model that chooses Captain Picard every time will be 80% accurate. 80% might not sound terrible, but if our client wants to avoid socially awkward moments at the next convention they would hardly be pleased with being told to just assume everyone loves Jean-Luc.
In that case, precision would be a more suitable metric than accuracy for evaluating our model. (Our precision score will tell us what percentage of positive predictions were actually positive by dividing the number of true positives by the number of true and false positives combined. More on metrics — precision, recall, accuracy, f1 — and how to use them here.)
Selecting Your Features
The data universe is vast and contains far more information than we need for our model. Through exploring the dataset we have begun to narrow down features of interest to us.
For our first model iteration, let’s say we want to look at the following demographics and behaviors: age, how many Star Trek series they have watched, which series they have watched, and how many cups of tea, earl grey, hot they have consumed.
NOTE: You must do a train test split before imputing missing values or balancing the data. Think of this as a Prime Directive with a zero tolerance policy.
Handling Missing Data
Predictive models won’t run if we have missing data, so what can we do with incomplete records?
Drop it like it’s hot: Easier said than done in some cases, but in our example if a record is missing the favorite captain field we can safely drop it.
I like to impute, impute it: Let’s say we really want to use age as a feature (aka independent variable) in our model but some records we don’t want to drop are missing that information. We can use something like MICE or KNN to generate ages for those records based on the analysis of records that do have age filled in.
Scaling the Data
Scaling is particularly important when you’re dealing with features that have different ranges and units. Consider those of our chosen features:
- Age: This could be any value from (let’s say) 10 to 100.
- How many Star Trek series they have watched: This could be anywhere from 0 to 12.
- How many cups of tea, earl grey, hot they have consumed: This could easily reach into the hundreds, if not thousands. (amiright?)
In the unscaled version of this dataset, the feature with the largest range (in this case, cups of earl grey) would dominate the others in most machine learning algorithms. This could make our model largely dependent on this feature.
Scaling solves this issue by transforming all features to have a mean of 0 and standard deviation of 1. This means that no feature dominates over others, allowing the model to learn from all features equally.
This can be particularly important in distance-based algorithms like K-Nearest Neighbors (KNN) or Support Vector Machines (SVM), where the distance between data points is a crucial part of the algorithm. The wide range in earl grey consumption would have a larger influence on the distance between the points.
Scaling is also advantageous for Gradient Descent based algorithms like Linear Regression, Logistic Regression, Neural Networks etc., where scaling the data can speed up the convergence of the algorithm. And for those of us stuck in the 21st century, time (and computational power) is money. Sigh. Beam me up, Scotty!
Dealing with Imbalance: Oversampling vs Undersampling
As mentioned above, our dataset is 80% Captain Picard and 20% Not Captain Picard. Keeping this ratio will limit our model’s ability to accurately predict the minority class of Not Picard.
The simplest (and perhaps the least controversial) way to balance our dataset would be to randomly drop the number of Picard records so that there is a 50/50 split between Captain Picard and Not Captain Picard. Conversely, random oversampling would randomly replicate our Not Captain Picard records.
Other options include using an algorithm like Synthetic Minority Oversampling Technique (SMOTE) to generate synthetic data. And of course there are those who feel strongly we should leave the imbalance as is.
Encoding
Predictive models run on numeric values so our categorical values must be encoded numerically. In this example our categorical variable is which series they have watched. One hot encoding is the preferred option for categorical variables that do not necessitate any relational ordering.
Evaluating and Interpreting the Model
In addition to metrics like accuracy, precision, f1, and AUC ROC (see links below for more detailed explanations of each), we also want to report out the feature importance. In other words, of the variables we selected, which are the strongest predictors for card-carrying members of the Jean-Luc Picard Fandom?
Our logistic regression model’s important features (aka coefficients) are assigned a value that represents how much each feature contributes to the prediction.
You can accomplish this by using the .coef_ attribute from sklearn’s Logistic Regression classifier to get your coefficients and then convert them from log odds to regular odds with np.exp() in order to make them more interpretable.
Breaking down the equation: to get the log odds of our target, the number of Captain Picard is divided by the number of Not Captain Picard. Next the logit function is used to make the odds of Captain Picard symmetrical to the odds of Not Captain Picard so we can accurately compare the two. (If you are still confused by odds, probabilities, and log odds, don’t worry. It’s a lot. This video breaks it down even further.)
However, that can make feature importance confusing to interpret when we are trying to quickly quantify the influence a given feature has on our prediction. Converting the log odds to regular odds allows us to make observations like:
For every additional cup of tea, earl grey, hot consumed, the odds of a person favoring Captain Picard and forsaking all others increases by {coefficient from regular odds} times when all other features remain the same.
Model Iteration
Finally, don’t settle on the first model you create. Experiment with different features and hyperparameters. Look for signs of overfitting (where your model works too well on the training data and poorly on the test data), and underfitting (where your model isn’t doing well on either the training or the test data).
Keep iterating until you’ve built a model that best serves your mission. As Captain Picard once said:“There is a way out of every box, a solution to every puzzle. It’s just a matter of finding it.”
Sources & Further Reading
https://www.geeksforgeeks.org/understanding-logistic-regression/
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
https://www.datacamp.com/tutorial/categorical-data
https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec
https://towardsdatascience.com/precision-and-recall-88a3776c8007
https://towardsdatascience.com/understanding-the-roc-curve-and-auc-dd4f9a192ecb
https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/
https://www.simplilearn.com/tutorials/machine-learning-tutorial/overfitting-and-underfitting