10 Machine Learning Flavors in sklearn

It is easy to paint machine learning in broad strokes, just one big black box where you plug in what you have and something relevant comes out the other side. But in reality it is much more complicated, and a wide variety of tools that fit better to specific problems.

Before computers much of this mathematically calculated by statisticians and some very complex models (here is a great article about statistics and machine learning). Modern computers can run these algorithms in minutes depending on the amount of data at hand.

Today machine learning and AI (artificial intelligence) has grown exponentially. For many the barrier to entry like expensive computing servers or specialist programmers are being lowered. A solid computer can run models, and if there is too much data or complexity, with one click away one can rent time on the Amazon Cloud with all the horse power you want.

Here is a great primer that explains some of the conceptual differences between types of algorithms:


There are several classes of machine learning and the algorithm reflects that application. Here are the basic breakdowns as defined by Scikit-learn (sklearn), a great library for machine learning:

Identifying to which category an object belongs to.
Applications: Spam detection, Image recognition.
Algorithms: SVM, nearest neighbors, random forest, … 

Predicting a continuous-valued attribute associated with an object.
Applications: Drug response, Stock prices.
Algorithms: SVR, ridge regression, Lasso, … 


Automatic grouping of similar objects into sets.
Applications: Customer segmentation, Grouping experiment outcomes
Algorithms: k-Means, spectral clustering, mean-shift, …

Here is Microsoft's cheat sheet on which algorithm to use:

For each of these groups we are going to dig in a bit deeper but by no means is this an exhaustive list. For me it was a great way to learn a little more about each one and it should provide a light overview when you are thinking of what is out there. Each title is linked to the related sklearn page so you can explore the parameters if you want to for your own projects.


A Classification Algorithm is a procedure for selecting a hypothesis from a set of alternatives that best fits a set of observations. Or in normal words, it is way to determine which group an object belongs to using multiple variables.

Random Forest Classifier (Classification)

Random Forest is a commonly used algorithm, so called after the multitude of decisions trees it uses to either classify (RandomForestClassifier) or mean prediction (RandomForestRegressor) of the individual trees. It uses various sub-samples of the dataset to cut the whole dataset into multiple different working sets and averaging the result to improve the predictive accuracy and control overfitting.

It is a fairly strong model and used widely on Kaggle for its versatility, it was actually the first one I used in my Titanic problem.


Fast, simple to use and robust with noise and missing data, but may be difficult to interpret. There is a reason it is growing in popularity.

K Nearest Neighbors Classifier (Classification)

Nearest Neighbors-based classification is a type of instance-based learning where classification is determined from a simple majority vote of the nearest neighbors of each point. Out of the box is uses uniform weights, but that can be manually changed to fine tune the model. In cases where the data is not uniformly sampled the RadiusNeighborsClassifier can also be used as it relies a fixed radius set by the user.


Simple, powerful with no training set need, but hardware expensive and slow on new instances and performs poorly as dimensionality increases.

SVM (Classification/Regression)

There are multiple types of Support Vector Machines (C-Support Vector Classification, LinearSVC or SVR on the regression side for example). They are popular in text classification problems where very high-dimensional/features are the norm but Random Forest seems to be stealing their crown.


Great for complex non-linear relationships and good with noise, but parameter control can get complicated and uses a lot of memory and processing power as it scales

Gradient Boosting (Classification/Regression)

Gradient Boosting is combination of Gradient Descent and Boosting. It builds in a forward stage-wise manner an aggregating of weak prediction models to produce a strong prediction model through arbitrary differentiable loss functions. Similar Boosting algorithms would be ADABoost where the sum of the parts is much stronger than the bunch of weak predictions that make it up.


Handles missing values well and no need to transform variables, but it can overfit and is struggles when scaling.

Gaussian NB/Gaussian Naive Bayes (Classification)

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naïve” assumption of independence between every pair of features. Basically it looks at each feature (like “red” and “round”) as independent features and determines the classification probity for each (i.e. if it is an apple) rather than trying to consider multiple features together and then determine the probability as a whole. This is what helps make it fast.

There are several variations on Naive Bayes, including MultinomialNB for multinomially distributed data and BernoulliNB for multivariate Bernoulli distributions or in other words it specifically requires binary-valued (Bernoulli, boolean) variables as features.


Fast for classification and can be trained on partial set of data if the whole data set it too big to be put in memory, but has assumption about feature independence that may not hold true in the real world


A Regression Algorithm uses a statistical based approach for estimating the relationships among variables. The result can be a linear regression usually represented as a line of best fit in a scatterplot, or may be a more complicated depiction of a dependent variable and one or more independent variables (or ‘predictors’). One of the simplest is a direct linear regression (sklearn.linear_model.LinearRegression) which is perfect for data exploration in visualizations.

Logistic Regressions (Regression)

Logistic Regressions are used to predict the odds of a binary state dependent variable (what you are trying to predict) based on the values of the independent variables (features). Basically it will try to determine, based on the features, if the final result either is, or is not.


Nice probabilistic focus and fast to train on big data, but requires work to make it fit a non-linear functions.

Random Forest Regressor (Regression)

Like before in the Random Forest Classifier, the Random Forest Regressor fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.


Faster training and does well with noise and missing values, Computer resourcing grows when scaling for accuracy as the number of forests increase.

Ordinary Least Squares/Ridge Regression (Regression)

Ridge Regression is an optimization of Ordinary Least Squares Regression. They are both linear regression models and a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the sum of the squares of the differences between the observed responses in the given dataset and those predicted by a linear function of a set of explanatory variables. The key difference is that these focus on regularization to prevent overfitting as the coefficient increases. For real detail take a look here.


Ordinary Least Square is the more commonly used regression, but it struggles with outliers and anything, obviously, that is non-linear in nature


Discovering structure through clustering of observations/dependent variables based on features. It is unsupervised machine learning method which basically means there is no splitting into training sets and is used in data mining.

k-Means (Clustering)

The k-means problem is solved using Lloyd’s algorithm. It’s goal is to clusters data by trying to separate samples in n groups of equal variance. The number (k) of clusters must be specified.


Scales well to large number of samples and is fast, but is the clusters are not very spherical it can struggle and running it repeatedly may not return exactly the same answer

Mean-Shift (Clustering)

Mean-Shift is a non-parametric (i.e. doesn’t expect bell curve distribution) feature-space analysis technique for locating the maxima of a density function. Clustering is iteratively creating “high” peaks of density.


Good for uneven distribution in a data set, but it is not highly scalable


The best algorithm is always the one out of MANY you try that produces the best result. As I researched this article (and will go back and update it as I use them more) and after two cups of considerably strong coffee it is clear that these are very complex tools with many parameters that can be tuned to match the data.

Just like when you are car shopping you don’t just test drive one car and that is it. Likewise it would be a disservice to only use one algorithm when you are solving a data problem.

Happy Hunting!