Learning about data mining algorithms is not for the faint of heart and the literature on the web makes it even more intimidating. It seems as though most of the data mining information online is written by Ph.Ds for other Ph.Ds.
Here is a next drill down on top ten data mining algorithms which seems to get lot of attention and google search volume. In the spirit of demystifying these algorithms, I tried to use every day language wherever I can instead of technical language. I don’t want to apologize for this but I want the ‘purists’ to be aware of this deliberate choice. This article is for the ‘rest of us’ who just want to scratch the surface to do enough damage but not interested in diving deep (yet).
One of the first questions people ask about a particular algorithm is whether it is ‘Supervised’ Or ‘Unsupervised’? Here is what those terms mean.
Supervised learning — Algorithms that need a ‘training’ set of data to learn.
Unsupervised learning — Algorithms that don’t need any training data to work properly.
Another key question that they ask is ‘What type of algorithm it is based on how it functions’? Here are the main types of algorithms.
Classification: These algorithms put the existing data (or past data) into various ‘classes’ (hence classification) based on their attributes (properties) and use that classified data to make predictions.
Regression: These algorithms build a mathematical model based on existing data elements and use that model to predict one or more data elements are mostly used with numbers such as profit, cost, real estate values etc. The primary difference between classification algorithms and regression algorithms is the type of output in that regression algorithms predict numeric values whereas classification algorithms predict a ‘class label’.
Segmentation or clustering: These algorithms divide data into groups, or clusters, of items that have similar properties.
Association: These algorithms find some relation (technically called correlation) between different attributes or properties in existing data and attempt to create ‘association’ rules to be used for predictions. The algorithms find items in data that frequently occur together.
Sequence analysis: These algorithms find frequent sequences in data (Ex: Series of clicks in a web site, or a series of log events preceding machine breakdown).
Time series: These algorithms are similar to regression algorithms in that they predict numerical values but time series is focused on forecasting future values of an ordered series and also incorporate seasonal cycles (ex: warehouse inventory management).
Dimensional Reduction Algorithms: Some datasets may contain many variables making it almost impossible to identify the important variables with an impact on prediction. Dimension reducing algorithms help identify the most important variables.
Additionally, there are some key technical terms that we need to know before we find out about the algorithms. They are:
Classifier program — A program to sort data entries into different classes. E.g. a classifier might sort cars into classes such as sedans, SUVs etc.
Outliers — Data points that are out of the usual range. E.g. in a test with most scores between 40–45, a score of 100 would be an outlier.
Noisy data — Data with lots of outliers
With that background, let us now move onto our featured topic of the most popular data mining algorithms. I have curated this list from various publications but the most important source is the research paper from this IEEE International Conference paper . Drum roll please. Here we go!
C4.5 algorithm / Supervised / Classification Type
Algorithm used to generate decision Tree tool (technical term Classifier) for classifying data from a set of training data. Used to generate a decision based on a certain sample of data.
K-Means algorithm / Unsupervised / Clustering type
Partitions the data into a predetermined number of clusters with each cluster having a center of gravity (technical term Centroid) around which the data is clustered.
Support Vector machines (SVM) algorithm / Supervised / Classification Or Regression type
SVM classification algorithm attempts to classify data into target classes with the widest possible margin (technical term hyper-plane). Could be just a line with 2 identifiable classes. SVM regression algorithm, on the other hand, tries to find a continuous function where the maximum number of data points are within an epsilon-wide tube around those data elements.
Apriori algorithm / Unsupervised / Association type
Identifies the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database.
Expectation-Maximization (EM) algorithm / Unsupervised / Clustering type
The Expectation-Maximization (EM) algorithm is a way to find maximum-likelihood estimates when the data is incomplete or has missing data points. It works by choosing random values for the missing data points and using those guesses to estimate a second set of data. The new values are used to create a better guess for the first set, and the process continues until the algorithm converges on a fixed point.
PageRank algorithm / Unsupervised / Association type
PageRank, popularized by Google’s Pagerank for websites, is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.
Adaboost algorithm / Supervised / Classification type
Adaboost, short for Adaptive boosting, is part of what are called boosting algorithms of which GBM and XGBoost are the other popular boosting algorithms. Adaboost combines multiple ‘weak classifiers’ into a single ‘strong classifier’ or in other words, uses multiple ‘weak’ learning systems to put together a ‘strong’ learning system.
K-Nearest Neighbors (KNN) algorithm / Supervised / Classification type
KNN algorithm first looks at the k closest labeled training data points — in other words, the k-nearest neighbors. K in this case could be understood as the number of neighbors or the depth. Second, using the neighbors’ classes, kNN gets a better idea of how the new data should be classified.
Naive Bayes algorithm / Supervised / Classification type
Naive Bayes, a family of algorithms, makes predictions using Bayes’ Theorem, which derives the probability of a feature, based on prior knowledge of conditions that might be related to that feature. The “naive” comes from the assumption that the algorithm makes of conditional independence between every pair of features in the data set. Used in applications such as spam filtering, text classification, sentiment analysis.
Classification And Regression Trees (CART) algorithm / Supervised / Classification & Regression type
It is a decision tree learning technique that outputs either classification or regression trees.The CART algorithm provides a foundation for important algorithms like bagged decision trees, random forest and boosted decision trees.
And here is a cheat sheet that you can download and keep in your pocket for quick reference.
Of course, there are lot of other algorithms like random forest, GBM, XBoost, GMM, Kernel approximation etc. and choosing the best algorithm to use for a specific analytical task can be a challenge. For the same business problem, you can use different algorithms and each algorithm produces a different result, and some algorithms can produce more than one type of result. Hopefully, you are at least familiar with the most popular ones with this article.
What is your favorite algorithm? What other important algorithms am I missing in this list?
Thank you for reading.