Which Machine Learning Algorithm Should I Use for My Data Science Project?

Alice Zhao
Learning Data

--

A common question I receive from my data science students is, “Which machine learning algorithm should I use for my particular dataset or project?”

The answer is that it depends on what type of problem you’re trying to solve. This is the flow chart I walk through every time before I kick off a data science project.

Question 1: Are you trying to predict something?

If the answer is yes, then you’re going to want to use a supervised learning technique for your analysis.

If the answer is no, then you’re going to want to use an unsupervised learning technique for your analysis.

Question 2a: Are you predicting something continuous or categorical?

If you go down the supervised learning route, the next question you need to answer is whether you’re predicting something continuous (house prices, temperature, etc.) or categorical (sale or not, spam or not, etc.).

If you want to predict something continuous, you’ll need to use a regression technique:

  • The first regression technique I always start with is Linear Regression
  • If I want to try a different model, I’ll use Regularized Regression (Ridge Regression, LASSO Regression, etc.)

If you want to predict something categorical, you’ll need to use a classification technique:

  • The first classification technique I always start with is Logistic Regression
  • I then move on to tree-based models including Random Forests and Gradient Boosting, which often perform very well for classification

Question 2b: Are you trying to group data points or reduce features?

If you aren’t trying to predict something and you go down the unsupervised learning route, the next question you need to answer is whether you want to group your data or reduce the number of features (columns).

If you want to group your data (aka segment or cluster your data), then you’ll need to use a clustering technique:

  • The first clustering technique I always start with is K-Means Clustering
  • I then move on to other clustering techniques including Hierarchical Clustering, DBSCAN, etc.

If you want to reduce the number of features, then you’ll need to use a dimensionality reduction technique. Imagine you have 10 columns of data and you want to turn it into 2 columns of data that are able to capture the behavior of the 10 columns — that is where dimensionality reduction comes in.

  • If I’m trying to reduce features before modeling, then the first technique I try is Principal Component Analysis (PCA)
  • If I’m trying to reduce features to visualize my data, then I start with PCA and potentially move on to t-SNE
  • If I’m trying to reduce features in a Natural Language Processing situation (I have many documents and I want to find the main topics or themes within the documents), then I would use topic modeling techniques such as Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF)

Question 3: Are there algorithms that don’t fall within these four buckets?

Yes, but from my experience, I find that the majority of data science problems do fall into these buckets.

Here’s an example of a situation that is outside of this flow chart:

Sometimes data scientists are tasked with answering questions such as:

  • Which products were purchased by the highest spending customers?
  • How have cancellation rates changed over time?

These questions don’t require machine learning at all (no predictions, no grouping data points, no reducing dimensions), and instead can be answered by slicing and dicing the data using techniques such as filtering, sorting and grouping. This is more formally known as Exploratory Data Analysis (EDA).

Here’s an example of a situation that falls under multiple buckets in this flow chart:

Natural Language Processing is a great example of a field that can incorporate a variety of machine learning techniques.

While there are text preprocessing techniques you need to do before applying these algorithms (tokenization, etc.), once your text is transformed into a format that can be input into a model, you can use the same machine learning flow chart.

  • If you want to predict a readability score → use regression techniques
  • If you want to classify emails as spam or not spam → use classification techniques
  • If you want to organize or cluster documents → use clustering techniques
  • If you want to find the main topics within your documents (aka topic modeling) → use dimensionality reduction techniques

Final Thoughts

I truly believe that this flow chart is the best way to kick off a data science project. My students are very used to seeing me draw this on a whiteboard before we discuss any algorithms or even look at the data!

  • Just knowing whether to take a supervised or unsupervised approach can help you structure your data in a way that can be easily input into a machine learning model.
  • Often I’ll try techniques from several buckets, if it makes sense for my project. I also keep in mind that EDA might be good enough for my project and I may not need to use any machine learning algorithms.

So to answer the question, “which machine learning algorithm should I use for my data science project?” Use the flow chart to decide!

More details on how to scope a data science project and prepare data for analysis can be found in my course, Data Science in Python: Data Prep & EDA on the Maven Analytics platform and on Udemy.

Ready to build practical, job-ready data skills of your own?

Spring Savings: Up to 40% off at Maven Analytics!

Create your custom learning plan today, and save up to 40% on all-access memberships when you upgrade to a paid account.

All Maven memberships include:

✓ Unlimited access to ALL courses & paths

✓ Customized learning plans

✓ Skills assessments

✓ Free practice data sets

✓ Guided projects

✓ Portfolio builder & Showcase

✓ Private student dashboard

✓ Live instructor chat support

Join today and see why we’ve earned 50,000+ perfect 5-star reviews from students around the world.

This is a limited-time deal; take advantage of the savings today!

--

--

Alice Zhao
Learning Data

Hi! 👋 I'm a data scientist & author of the SQL Pocket Guide (O’Reilly). Check out my Data Science in Python series on Maven / Udemy & my blog, A Dash of Data.