Mastering the Fundamentals of Machine Learning algorithms 📝 📚 — Part 2

⭐️ Check out Part-1 here ⭐️

Pavan Saish
6 min readJun 30, 2023

Random Forests

Random forests are ensemble learning methods that combine multiple decision trees to make predictions. They reduce overfitting and improve accuracy compared to individual decision trees.

IBM

Math Formulas:

Random forests build upon decision trees and do not involve additional mathematical formulas.

Assumptions:

Same as decision trees.

Bagging (Bootstrap Aggregating):

Bagging is a technique used in ensemble learning, where multiple models are trained on different subsets of the training data and combined to make predictions. The key idea behind bagging is to introduce randomness and diversity in the models, thereby reducing overfitting and improving generalization performance. It promotes stability and robustness by introducing randomness and diversity in the models’ training process. Here are the steps involved in the bagging process:

1. Bootstrap Sampling: Bagging starts by creating multiple bootstrap samples from the original training data. A bootstrap sample is obtained by randomly selecting data points and features from the training set with replacement.

2. Model Training: For each bootstrap sample, a separate model (e.g., decision tree) is trained on the selected subset of data. Each model is trained independently, with no knowledge of the other models or the complete training set.

3. Aggregation: Once all the models are trained, their predictions are combined to make the final prediction. In classification tasks, this can be done through majority voting, where the class with the highest number of votes from the models is selected as the final prediction. In regression tasks, the predictions can be averaged across the models.

Boosting:

Boosting is another ensemble learning technique that combines multiple weak learners (models with modest predictive power) to create a strong learner. The key idea behind boosting is to sequentially build a series of models, where each subsequent model focuses on correcting the mistakes made by the previous models. Few Boosting algorithms are (AdaBoost, Gradient Boosting, XGBoost, and CatBoost). Here are the steps involved in the boosting process:

1. Weight Initialization: Each training example in the dataset is assigned an initial weight. Initially, all the weights are set to be equal, indicating that each example is equally important.

2. Model Training and Weight Update: A weak learner (e.g., decision tree) is trained on the training set, giving higher weight to the misclassified examples from the previous models. This emphasizes the importance of the misclassified examples and helps the subsequent models to focus on those cases.

3. Weight Adjustment: After each model is trained, the weights of the training examples are adjusted based on the model’s performance. Misclassified examples are assigned higher weights, while correctly classified examples receive lower weights. This adaptive weighting gives more emphasis to the challenging examples, making them more influential in the subsequent models.

4. Model Combination: The final prediction is obtained by combining the predictions of all the models, usually through weighted voting, where the models with higher performance are given more weight in the ensemble.

Interview-based Q&A

Q1. What is the main advantage of using random forests over individual decision trees?

Ans: Random forests reduce overfitting and improve prediction accuracy by combining multiple decision trees and introducing randomness through feature and sample selection.

Q2. How does random feature selection help in random forests?

Ans: Random feature selection ensures that each tree in the random forest uses only a subset of features, which promotes diversity among the trees and reduces correlation between them.

Q3 Can random forests handle missing values and categorical variables?

Ans: Yes, random forests can handle missing values by imputing them and can handle categorical variables by using one-hot encoding or similar techniques.

Q4:How does random forests handle overfitting?

Ans: Random forests mitigate overfitting by averaging predictions from multiple trees and by using techniques like feature bagging and random feature selection.

Q5: How can you determine feature importance in a random forest?

Ans: Feature importance in a random forest can be determined by measuring the average decrease in impurity (e.g., Gini impurity) across all trees when a particular feature is used for splitting.

Support Vector Machines (SVM)

SVM is a powerful supervised learning algorithm used for classification and regression tasks. It finds the optimal hyperplane that maximally separates different classes or predicts values based on support vectors.

Math Formulas:

Hypothesis function: h(x) = sign(θᵀx + θ₀)

Cost function: J(θ) = C * Σ(max(0, 1 — yᵢ(hθ(xᵢ)))) + 0.5 * λ * Σ(θⱼ²)

SVM uses L2 regularization because, In SVM, robustness to outliers is desirable, L2 is often preferred as it provides more robustness against individual extreme values. The choice between L1 and L2 regularization depends on the specific requirements and considerations of the problem and the desired trade-offs between sparsity, interpretability, and computational complexity.

Assumptions:

Linear separability: SVM assumes that the data is linearly separable, or with appropriate kernel functions, it can handle non-linearly separable data.

Interview-based Q&A

Q1. What is the role of the kernel function in SVM?

Ans: The kernel function in SVM is used to transform the input features into a higher-dimensional space, making it possible to separate non-linearly separable data points.

Q2. What is the significance of support vectors in SVM?

Ans: Support vectors are the data points that lie closest to the decision boundary and have the most influence on determining the location of the hyperplane. They are used to define the decision boundary.

Q3. What is the purpose of the regularization parameter © in SVM?

Ans: The regularization parameter © controls the trade-off between maximizing the margin and minimizing the misclassification of training samples. A higher value of C allows fewer misclassifications but may result in overfitting.

Q4. Can SVM handle multi-class classification problems?

Ans: Yes, SVM can handle multi-class problems using techniques like One-vs-One or One-vs-All, where multiple binary SVM classifiers are trained to distinguish each class from the rest.

Q5. How does SVM handle outliers?

Ans: SVM is less sensitive to outliers because its objective is to maximize the margin, which focuses on the samples near the decision boundary. Outliers have little influence on the placement of the decision boundary.

PCA (Principal Component Analysis):

PCA is an unsupervised learning algorithm used for dimensionality reduction by transforming the original features into a new set of uncorrelated variables called principal components. It captures the maximum variance in the data.

Math Formulas:

Covariance matrix: Σ = (1/m) * XᵀX

Eigenvalue decomposition: Σ = WΛWᵀ

Projection of data points onto the new feature space: X’ = XW

Assumptions:

  1. Linearity: PCA assumes that the relationship between the original features and the principal components is linear.

Interview-based Q&A

Q1. What is the objective of PCA?

Ans: The objective of PCA is to reduce the dimensionality of the data while preserving as much information (variance) as possible.

Q2. How do you interpret the eigenvalues in PCA?

Ans: Eigenvalues represent the amount of variance explained by each principal component. Larger eigenvalues indicate that the corresponding principal components capture more information from the data.

Q3. How do you choose the number of principal components to retain in PCA?

Ans: The number of principal components to retain is determined by considering the cumulative explained variance, where a threshold (e.g., 95%) is set to retain enough information.

Q4. What is the relationship between PCA and feature selection?

Ans: PCA is a dimensionality reduction technique that creates new features (principal components). In contrast, feature selection aims to select the most informative original features without creating new ones.

Q5. Can PCA be used for data with categorical features?

Ans: No, PCA is typically applied to continuous numeric data. It may not be suitable for data with categorical features unless appropriate preprocessing techniques are applied.

However, it is important to note that each algorithm encompasses a broader range of concepts, techniques, and mathematical underpinnings that go beyond the scope of this blog.

Machine learning is a vast field, and delving deeper into each algorithm would unveil further mathematical foundations, optimization techniques, regularization methods, and advanced concepts that contribute to their effectiveness and performance.

Follow my page for more content on ML & DL. Next up, Will catch up with NLP fundamentals & concepts. Follow and 👏🏻 :)

LinkedIn: https://www.linkedin.com/in/pavansaish/

Happy reading …! 📚

--

--

Pavan Saish

Dedicated in contributing to the ever-evolving landscape of AI | AI Researcher | VIT’24 | SWE @Honeywell.