What is Classification in Data Sciece
And Some Examples of It’s More Practical Use
At its core, data science is all about developing models of data that allow us to predict something about the world. And our predictions largely fall into 4 different classes of models. Learning these classes is fundamental to starting out strong in data science and building the effective mindset for translating real problems into data science problems.
In other words, a large majority of real problems fit into one of 4 data science problem classes. In last week’s newsletter we introduced them and included some sub-classes that are informative to consider. The main 4 classes are:
Herein I focus on Classification.
Recall that data science is all about using data to build models that help us predict things in the world. Classification is the process of learning from old data to properly classify new data. That is, predict which class or group new data should belong to.
This sounds like clustering however in classification, we know how the old data should be classified whereas in clustering we do not know how the data should be grouped. Thus, we refer to classification as supervised and clustering as unsupervised.
Many real problems can be refrained as classification problems. Let’s look at a few:
1. Document Classification
a. The real problem: An employee at a company is getting overwhelmed reading digital PDF’s to determine whether they should go to one of many different teams.
b. The data science problem: Build a classification model that can classify which documents belong to which teams. To build the solution we would need to either convert the PDFs to images and train a convolutional neural net model (CNN) or use OCR to extract the text, create features out of that text and then train a text-based classification model.
2. Ad Click Prediction
a. The real problem: A marketing team has several ads they want to use on their company website but don’t know which to prioritize for different users.
b. The data science problem: Build a classification model that classifies which users are more likely to click on certain ads. To build the solution we would need to use historical user data with clicks on ads and build user features to train a classification model that predicts whether they will click.
3. Disease State Classification
a. The real problem: A healthcare company wants to provide information to patients who may have developed a chronic condition to reduce the cost of treatment associated with early detection.
b. The data science problem: Build a classification model that determines the likelihood someone has the disease given historical medical features and patient features. To construct this model, we need to identify a point in time when patients were diagnosed with the disease and a random sample of patients who were not diagnosed with the disease matched on time, derive features prior to the date of the diagnosis and then train a model to classify each group of patients.
In fact, a great many business problems can be reframed as classification problems in data science. The algorithms that support classification can be either binary classifiers, multi-class classifiers, or multi-label classifiers.
Binary classifiers are classifiers that attempt to classify data into one of two classes. Popular binary classifiers include:
- K-Nearest Neighbors
- Logistic Regression
- Support Vector Machines (SVM)
- Decision Trees/Random Forests
- Naive Bayes
Multi-Class classifiers are classifiers that attempt to classify data into one of three or more classes. The goal with the more popular classifiers is to either build a binary model for every possible two-group comparison (one-vs-one) or to build a single binary classifier where each group is compared to all the rest combined (one-vs-rest; OVR). Because multi-class classifiers essentially employ binary classification in one way or another, many of the same algorithms for binary classification are also used in multi-class classification. Popular multi-class classifiers include:
- K-Nearest Neighbors
- Naïve Bayes
- Boosted Models (e.g. XGBoost)
- Decision Trees/Random Forests
Finally, multi-label classifiers are classifiers that allow us to learn one or more label for each set of data we are classifying. In other words, the labels are not mutually exclusive and so one example can have multiple labels assigned. These are sometimes referred to as soft-classifiers because probabilities for more than one class can be used to assign multiple class labels to one example. Popular algorithms include:
- Multi-Label Random Forest
- Multi-Label Decision Trees
- Multi-Label Gradient Boosting
As practice, take a look at the things that challenge you in your own life. For example, I was spending a lot of time grading and noticed that my comments were very similar across papers I was grading. I translated this into a classification problem by reframing the issue to one of classifying each paragraph in a paper with a comment. I treated the comments as different classes and the paragraphs of each paper as the data. I built a multi-class classifier to assign a comment to each paragraph.
Like engaging to learn about data science, career growth, life, or poor business decisions? Sign up for my newsletter here and get a link to my free ebook.