A Career-Changing Algorithms to make you a better Data Analyst
With experience, you will find that a handful of algorithms can solve most of your problems. This article will cover the 3 most useful CML(Classic Machine Learning) algorithms you need in your toolbox.
The following 3 algorithms are the go-to algorithms for CML problems. The list includes 3 classifiers algorithms:
I-Naive Bayes (NB)
NB is a probability-based modeling algorithm based on Bayes’ theorem. Bayes’ theorem simply states the following:
The probability of an event is based on prior knowledge of conditions that might be related to the event.
Bayes’ theorem discusses conditional probability. Conditional probability is the likelihood that event A occurs given that condition B is true.
For example, consider human eyesight and its relationship to a person’s age. According to Bayes’ theorem, age can help assess more accurately the probability that a person wears glasses, compared to an assessment made without knowledge of the person’s age. In this example, the age of the person is the condition.
The reason for the naive part of the name is that the algorithm makes a very naive assumption about the independence of the attributes.
Some advantages of NB algorithms include:
1- NB is good for spam detection where classification returns a category such as spam or not spam.
2- NB can accept categorical and continuous data types.
3- NB can work with missing values in the dataset by omitting them when estimating probabilities.
4- NB is also effective with noisy data because the noise averages out with the use of probabilities.
5- NB is highly scalable and it is especially suited for large databases.
6- NB can adapt to most kinds of classification,and it’s an excellent algorithm choice for document classification, spam filtering, and fraud detection.
7- NB is good for updating incrementally.
8- NB offers an efficient use of memory and fast training speeds. The algorithm is suitable for parallel processing.
For disadvantage of NB is that this algorithm does not work well when data attributes have some degree of correlation. This violates the naive assumption of the algorithm.
II- Random forest (classify)
To understand RF, it is first necessary to understand decision trees. This last is a supervised learning method for classification. Also,they grow using the training data set.
The decision trees can classify instances in the test data set,and are a divide-and-conquer approach to learning.
Random forests is great with high dimensional data since we are working with subsets of data. It is faster to train than decision trees because we are working only on a subset of features in this model, so we can easily work with hundreds of features.
The RF algorithm has several advantages:
1- RF is easy to visualize so you can understand the factors that lead to a classification result. This can be very useful if you have to explain how your algorithm works to business domain experts or users.
2- Each tree in a random forest grows the structure on random features,minimizing the bias.
3- Unlike the naive Bayes algorithm, the decision tree-based algorithms work well when attributes have some correlation.
4- RF is one of the most simple, robust, and easily understood algorithms.
5- The RF bagging feature is very useful. It provides strong fit and typically does not over-fit.
6- RF is highly scalable and gives reasonable performance.
RF has some disadvantages:
1- Decision trees can be slow with large training times when they are complex.
2- Missing values can pose a problem for decision tree-based algorithms.
3- Attribute ordering is important, such that those with the most information gain appear first.
The RF algorithm is a good compliment to the naive Bayes algorithm. One of the main reasons RF has become popular is because it is very easy to get good results.
III- K-Nearest Neighbors Algorithm (KNN)
The k-nearest neighbors algorithm is a simple algorithm that yields good results. KNN is useful for classification and regression.
KNN algorithms classify each new instance based on the classification of its nearby neighbor(s).
1- KNN makes no assumptions on the underlying data.
2- KNN is a simple classifier that works well on basic recognition problems.
3- KNN is easy to visualize and understand how classification is determined.
4- Unlike naive Bayes, KNN has no problem with correlated attributes and works well with noisy data if the dataset is not large.
1- Choosing K can be problematic and you may need to spend time tuning K values.
2- KNN is subject to the curse of dimensionality due to reliance on distance-based measures. To help combat this, you can try to reduce dimensions or perform feature selection prior to modeling.
3- KNN is instance-based and processes the entire dataset for classification, which is resource intensive. KNN is not a great algorithm choice for large datasets.
4- Transforming categorical values to numeric values does not always yield good results.
5- As a lazy classifier, KNN is not a good algorithm choice for real-time classification.
As a conclusion, KNN is a simple, useful classifier. Consider it for the initial classification attempt, particularly if the disadvantages listed above are not an issue for your problem.
Originated from : Blog.selcote.com