What Does Regression Mean in Data Science?
How the Word Gets Confused with the Model Type
In data science, regression problems are problems where we want to train a model using features derived from data that can predict a numeric value (e.g. the target), within a certain confidence band (usually 95%). The assumption is that the output we are interested in predicting is a continuous numerical value.
For example, let’s say we wanted to predict the approximate salary of a job based on the words provided in the job description. As a quick aside, a company is actually working on this very problem (see here).
Because the outcome of interest, salary, is easy to conceptualize as a continuous, numeric outcome, this example is an easy real problem to translate into a data science problem. In fact, regression problems are often the first step in any organization at incorporating data science into their solutioning because most businesses are clear on the metrics that drive their business decisions (e.g. money [profit, revenue, salary, etc], volumes [units produced per day, hours worked, etc], etc).
In order to train a regression model to solve the problem of predicting salary from job descriptions, we need to create a data set that includes both features derived from the job descriptions and salaries associated with each job description. A simple solution may look something like this:
1. Use a count vectorizer to generate the top n words and phrases from the job descriptions
a. Each word and phrase will be a feature column in the data set
2. Transform each job description using our count vectorizer such that each job description is now represented by the frequency of different words and phrases in each of n columns
3. Align with the salaries associated with each job description
4. Train a classic multiple regression model using the features to predict the salaries
5. Evaluate performance using R-squared
6. Evaluate the contribution of each feature (e.g. word or phrase column) to the solution by examining the coefficients (strength and significance [p-value]).
In the above example, we demonstrated how to derive a simple multiple regression model to predict the salaries of job descriptions based on the features we could derive from those descriptions. We also assume that each feature (e.g. word or phrase) shares a linear relationship with salary using a simple multiple regression. Moreover, without any feature selection in place we also run the risk of reducing the models ability to accurately generalize to new job descriptions that are in-need of a salary prediction.
Fortunately, there are many other model types that we can use to help us deal with issues such as non-linearity and model complexity. Here are just a few different regression model algorithms and a few notes on what they help us to overcome:
The most basic model algorithm that learns the weights (e.g. coefficients) needed to perform a linear combination of features to predict a numerical outcome. These models are simple to build, easy to interpret, and are computationally efficient. On the downside, they also assume rectilinear (e.g. straight lines) relationships with the outcome variable. These models are best generalized when the outcome is normally distributed, the coefficients of the features are significantly related to the outcome, and the errors (e.g. residuals) are also normally distributed.
Ridge and Lasso Regression:
Both ridge and lasso regression use different corrections when estimating coefficients that help to deal with very complex models. That is, models that include a lot of features. The general idea is that coefficients are penalized (e.g. reduced to 0) the more complex (e.g. more features) the model.
Non-Linear Regression Algorithms:
o Decision Trees are a form of a non-linear model that can be used to predict numerical values. The basic idea is that decision trees create a series of splitting rules based on features in order to determine a complex “tree” of splits at different feature values to ultimately come to a prediction. These models can be computationally more taxing than traditional regression models and are very sensitive to the ranges of the training data, meaning the models may not generalize well.
o Random Forests are like decision trees but rather than just a single decision tree, a random forest combines multiple decision trees derived from samples of the data and used to create an average prediction. The averaging across multiple, “weak” trees is why these models are known as ensemble models. Random forests are even more computationally taxing than decision trees and so take a long time to train.
o Support Vector Machines (SVM) are different from traditional regression in that they attempt to identify a hyperplane of predictive values that exist at higher dimensions than regression. In multiple regression we derive a prediction line. In SVMs we derive a hyperplane that is created as a function of the dot product of each feature vector to determine the coefficient values that give a hyperplane with the smallest distance between each data point.
Like engaging to learn about data science, career growth, life, or poor business decisions? Sign up for my newsletter here and get a link to my free ebook.