Data Science Beginner’s Algorithms

Harsita Mav
Butterfly Effect | MetaMorphoSys
8 min readDec 8, 2022

Prior to realizing what it takes to succeed as a data scientist, everyone aspires to be one. As implied by the name, data science involves dealing with data. To get started with data science, one needs to be knowledgeable with a wide range of ideas, including mathematics, arithmetic, statistics, algorithms, and different available approaches to different problems. Most importantly, one needs to be willing to conduct ongoing study to find the most effective technique.

“Algorithms” plays crucial role in Data Science and Machine Learning. The concept of manual is shifting in a world when practically all manual jobs are becoming automated. Machine Learning algorithms can assist computers in playing chess, performing surgery, identifying sentiments from the texts and becoming smarter and more personalized. We live in an era of ongoing technological growth, and we can forecast what will happen in the future by looking at how computers has grown over the years. One of the most notable aspects of this revolution is the democratization of computer tools and processes. Data scientists have constructed sophisticated data-crunching machines during the last five years by effortlessly performing innovative procedures. The outcomes have been mind-blowing.

“To get just an inkling of the fire we’re playing with, consider how content-selection algorithms function on social media. They aren’t particularly intelligent, but they are in a position to affect the entire world because they directly influence billions of people.”

— Stuart Russell

Prof. Stuart Russell clearly highlighted how important algorithms are. Professor used social media data to demonstrate the significance of algorithms. Social media has become an integral aspect of daily life, providing a platform for people to stay connected from one end of the planet to the other. Understanding the user’s search on these platforms allows the platform to identify the user’s requirement and, as a result, target Ads and content that is comparable to that. Algorithms are used to understand user behaviour. These algorithms comprehend user behaviour by monitoring and analyzing the user reaction when surfing on various social media platforms, and then show the user content related to their prior search. This clearly shows the importance of algorithms in our daily lives.

To start with, there are 11 basic algorithms which every Data Science enthusiast should start with :

I. Linear Regression

Linear regression refers to the relationship between one dependent instance on another independent instance. Consider an example of a commercial product firm, it has to decide on how much it has to invest on advertising. By understanding the sales of any particular the firm can decide on how much more investment is required for advertisement of any product. More the number of sales, lesser the advertisement is required and vice versa. In this process, a relationship is established between independent and dependent variables by fitting them to a regression line Represented by a linear equation:

Y — Dependent Variable
a — Slope
X — Independent variable
b — Intercept

II. Logistic Regression

Logistic regression is used when several independent features are needed to derive discrete values or predictions (e.g. 0/1). This is best understood with the assistance of a loan approval procedure. Any person seeking for a loan must meet a few conditions in order for the loan to be approved. Gender, number of dependents, source of income, income quantity, and a variety of other factors will all play a role in evaluating if a person is eligible for that particular loan plan. Logit regression predicts the likelihood of an occurrence by fitting data to a logit function.

III. Decision Tree

The Decision Tree method is a supervised learning technique used in machine learning to classify issues. It is effective in classifying both categorical and continuous dependent variables. We split the sample into two or more homogeneous sets using this approach based on the most significant attributes/independent variables. It is a graphical representation of all possible solutions to a problem/decision based on specified conditions, with internal nodes representing dataset properties, branches representing decision rules, and each leaf node representing the outcome.

IV. Random Forest

Random Forest is based on the principle of maximum voting. The random forest algorithm combines many decision trees to provide results. Starting with a random selection of the data for training and testing, the random forest method consists of four main steps. Following that, the numerous decision trees are created. The choice tree will be averaged during voting. Finally, choose the prediction result that received the most votes as the final forecast result. Ensemble refers to the process of integrating these trees. Additionally, Ensemble employs the two techniques of bagging and boosting.

Three main hyperparameters are used to enhance the predictive power :

  • n_estimators: Number of trees to be build by the algorithm.
  • max_features: Maximum number of features random forest uses before considering splitting a node.
  • mini_sample_leaf: The minimum number of leaves required to split an internal node.

V. XGBoost

XGBoost is a decision tree-based ensemble learning framework that employs Gradient Descent as the underlying objective function and provides a lot of flexibility while delivering the desired results by optimally utilising processing capacity. XGBoost is a scalable and highly accurate gradient boosting implementation that pushes the computational limitations of boosted tree algorithms. XGBoost has a lot of features that make it more distinctive and convenient to use.

  • The immensely complicated XGBoost models are penalised using several regularisation techniques, including Lasso and Ridge.
  • Being able to manage sparse data and handles missing values. XGBost’s block structure in the system design allows it to utilise several CPU cores simultaneously.
  • Cache awareness and out-of-core computing, which optimise disc space and processing speed, are features of XGBoost that were created with the best utilisation of hardware in mind.
  • Because of its capacity for cross-validating models, helps decrease the likelihood of overfitting, which aids in maintaining the bias-variance trade-off.
  • After splitting the data up to the provided max depth, XGBoost begins to backward prune the tree, deleting splits that have no further potential for benefit.
  • Backward tree pruning prevents XGBoost from acting greedily and prevents the development of overfit models.

VI. Naive Bayes

The Naive Bayes Classifier algorithm, which is based on the Bayes Theorem of Probability, is one of the most used algorithms for creating machine learning models, especially for disease prediction and document classification. The Bayes theorem is used to calculate probabilities in this classification method. It may be used for text datasets and is advantageous for huge datasets. It is simple and straightforward to use. It is mostly utilised in applications like sentiment analysis, spam classification, document classification, news categorization, and many more. Despite the fact that conditional independence is required, the Nave Bayes Classifier has performed well in a variety of application domains.

* P(A|B) is the posterior probability of A given B, 
* P(A) is the prior probability, 
* P(B|A) is the likelihood which is the probability of B given A, and 
* P(B) is the prior probability of B
P(A|B) is the posterior probability of A given B, P(A) is the prior probability, P(B|A) is the likelihood which is the probability of B given A, and P(B) is the prior probability of B

VII. KNN

KNN is based on supervised learning methods used in machine learning. It does not immediately learn from the training set; instead, it saves the dataset and uses it to perform an action when classifying data. Because of this trait, it is known as the ‘Lazy Learner Algorithm.’ Based on similarities, it stores the data and divides it into appropriate groups. In order to calculate similarity, “Distance-based measures” are used. These include several distance measures as the Euclidean, Manhattan, Minkowski, and Hamming distances. In this strategy, the k-value is crucial and is used to divide the data into groups that make sense. When dealing with noisy and huge datasets, this approach can be useful. The primary problem with this technique is the significant processing cost associated with determining the separation between data points for all training samples.

VIII. K-Means Clustering

The K-means algorithm is an iterative technique that performs iterations based on the number of k-clusters in a given data set. This algorithm begins by selecting k data points at random and classifying them as clusters. These clusters’ centroid is then calculated, and the process is repeated until the centroid is ideal. At this point, the values won’t change. In comparison to hierarchical clustering, this algorithm operates more quickly for small k-values. This method is typically used by search engines like Google and Yahoo to group web pages according to their similarity and calculate the “relevance rate” of search results.

clusters with their centroid using k-means with k = 3

IX. Hierarchical Clustering

Unsupervised hierarchical clustering is a method used to group unlabeled datasets into clusters. This is also referred to as HCA, or hierarchical cluster analysis. The main distinction between k-means and hierarchical clustering is that we are not required to know the number of clusters. This approach produces clusters in the form of trees known as “Dendograms”. Agglomerative and Divisive are two methods used in hierarchical clustering.

  1. The “Bottom-Up” or agglomerative technique, which collects all data points and clusters them into a single cluster, works by merging the clusters repeatedly until it creates one.
  2. The “Top-Down” or divisive strategy gets its name from the way the divvying-up method works, which is by breaking up large clusters of data into smaller clusters of data, which are then split into datapoints.

X. DBSCAN

One of the clustering algorithms is DBSCAN (Density-based spatial clustering of applications with noise). This approach is based on the notion that there should be a minimum number of data points inside a particular radius. In comparison to k-means and hierarchical algorithms, DBSCAN shows to be more useful for cluster-based algorithms because these algorithms other than DBSCAN only assist in finding the spherical shapes, however real-world data can be highly erratic and contain a lot of noise, therefore a cluster can be of any arbitrary shape.

3 different sorts of data points are used in this algorithm.

  1. Core point : A point is a core point if it contains more points during an episodic period than MinPts points.
  2. Border point : A border point is a point that is close to a core point but has fewer MinPts than eps.
  3. Noise or an Outlier : A point that is neither a border or core point is referred to as noise or an outlier.

XI. Support Vector Machine

Support Vector Machine (SVM) is a Supervised Machine Learning technique that is mostly used to solve classification issues. SVM classifies data into groups with other similar data and divides those groups using a hyperplane. SVM selects the extreme vectors and points that aid in the creation of the hyperplane. Support vectors, which are used to represent these extreme instances, form the basis for the SVM method. Margin maximisation, also known as distance maximisation, is used to design this hyperplane by maximising the distance between the various classes involved. SVM is divided into two categories: linear and non-linear. The classification feature of SVM makes it useful for training because it offers the highest accuracy and proper classification for upcoming data. The fact that this algorithm does not overfit is its strongest feature.

“Some of the biggest challenges faced by computers and human minds alike: how to manage finite space, finite time, limited attention, unknown unknowns, incomplete information, and an unforeseeable future; how to do so with grace and confidence; and how to do so in a community with others who are all simultaneously trying to do the same.”

The aforementioned statement demonstrates the significance of algorithms and the wonders they can produce. This article is useful for those who want to begin establishing their foundation in the field of Data Science.

--

--

Harsita Mav
Butterfly Effect | MetaMorphoSys

Data science enthusiast, ML Developer, LLM & GenAI Consultant