100 days of data science and AI Meditation (Day 7-Supervised Machine learning)
This is part of my data science and AI marathon, and I will write about what I have studied and implemented in academia and work every single day.
Whoso neglects learning in his youth, loses the past and is dead for the future.
Learning is the ability of an agent to improve its behaviour based on experience. The ability to learn is essential to any intelligent agent. As Euripides outlined, learning involves agent remembering its past such a way that is useful for future.
Machine learning is a subfield of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. Imagine it as teaching a computer to recognize patterns in data and make predictions or decisions based on those patterns. It’s like training a pet; initially, you show the pet how to do a trick, and over time, it learns to perform it better and more accurately.
In machine learning, we feed the computer a large amount of data, like images of cats and dogs, or historical stock market prices, along with their corresponding labels or outcomes. The computer then uses this data to identify patterns and relationships, and it creates a model or a set of rules to make predictions about new, unseen data.
For example, after training a machine learning model with images of cats and dogs, it can distinguish between cats and dogs in new pictures it has never seen before. Similarly, in finance, the model can analyse patterns in historical stock market data to predict future trends.
Machine learning has many practical applications, such as personalized recommendations on streaming platforms, fraud detection in banking, medical diagnosis, and even self-driving cars. It’s all about enabling computers to learn and adapt on their own, just like humans do, making our lives easier and more efficient.
Supervised learning is a type of machine learning where the algorithm is trained on labelled data, meaning the input data is paired with corresponding output labels or target values. The goal of supervised learning is to learn a mapping function from the input to the output so that the algorithm can make accurate predictions on new, unseen data.
In supervised learning, the process involves two main components:
- Training Data: This is the labelled data used to train the algorithm. It consists of input features (also known as independent variables) and their corresponding output labels (also known as dependent variables). For example, in a cat vs. dog image classification task, the training data would include images of cats and dogs along with labels indicating whether each image is of a cat or a dog.
- Model Training: During the training phase, the supervised learning algorithm uses the training data to learn the relationship between the input features and the output labels. It adjusts its internal parameters to minimize the difference between its predicted outputs and the actual labels in the training data.
Once the model is trained, it can be used to make predictions on new, unseen data. The algorithm takes the input features of the new data and uses the learned mapping function to predict the corresponding output labels.
Below is an example of supervised learning using Python and Scikit-learn to build a simple linear regression model:
In this example, we have some simulated data X and y, where X is the independent variable and y is the dependent variable. We split the data into training and testing sets using train_test_split from Scikit-learn. Then, we create a linear regression model using LinearRegression, train it on the training data, and make predictions on the test data. We calculate the mean squared error and R-squared to evaluate the model’s performance. Finally, we visualize the data and the linear regression line using matplotlib.
Evaluating predictions in machine learning
Evaluating predictions in machine learning is a critical step in assessing the performance and accuracy of the trained model. The goal is to measure how well the model generalizes to new, unseen data and how reliable its predictions are. Several evaluation metrics and techniques are used to assess the model’s performance, and these can vary depending on the type of machine learning problem (classification, regression, etc.) and the specific requirements of the task.
Here are some common methods for evaluating predictions in machine learning:
- Train-Test Split: The simplest evaluation method involves dividing the dataset into two parts: a training set and a test set. The model is trained on the training set and then evaluated on the test set, which contains unseen data. The accuracy or other relevant metrics are calculated based on the model’s performance on the test set.
- Cross-Validation: Cross-validation is a more robust evaluation technique that helps reduce the variance in performance estimation. It involves dividing the dataset into multiple subsets (folds), training the model on some folds, and testing it on the remaining fold. This process is repeated several times, and the average performance is calculated.
- Confusion Matrix: Confusion matrix is commonly used for classification problems. It provides a tabular representation of the model’s predictions against the actual labels. From the confusion matrix, metrics such as accuracy, precision, recall, and F1 score can be calculated.
- Mean Squared Error (MSE): MSE is a common evaluation metric for regression problems. It measures the average squared difference between the predicted values and the actual target values. A lower MSE indicates better predictive performance.
- Receiver Operating Characteristic (ROC) Curve: ROC curve is used for binary classification problems to assess the trade-off between true positive rate and false positive rate at different classification thresholds. The Area Under the Curve (AUC) is also calculated to quantify the model’s overall performance.
- Precision-Recall Curve: For imbalanced classification tasks, precision-recall curve is used, which focuses on the precision and recall values at different classification thresholds.
- R-squared (R2): R-squared is a metric used to evaluate how well the regression model fits the data. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
- Mean Absolute Error (MAE): MAE is another metric for regression problems that measures the average absolute difference between the predicted values and the actual values.
Choosing the appropriate evaluation metric depends on the nature of the problem and the desired outcome. It is essential to consider factors such as data distribution, class imbalance, and the specific goals of the machine learning task. Evaluating predictions helps identify the strengths and weaknesses of the model, and it guides the process of fine-tuning the model to achieve better performance and generalization on new data.
How supervised learning algorithms work?
Supervised learning algorithms are a class of machine learning algorithms that learn from labelled data to make predictions or decisions. The “supervised” aspect refers to the fact that during the training process, the algorithm is provided with input-output pairs (features and corresponding labels) to learn the relationship between the input and output variables. Once trained, the algorithm can then make predictions on new, unseen data.
Mathematically, supervised learning algorithms aim to find a function that maps the input features (X) to the corresponding output labels (y). Let’s denote this function as f(X) = y. The goal is to learn the parameters (weights and biases) of the function f such that it can accurately predict y for new inputs X.
The general steps in a supervised learning algorithm are as follows:
- Input Data: First, the algorithm is provided with a dataset containing input features X and their corresponding output labels y.
- Model Selection: Choose a specific type of model or algorithm that best fits the problem at hand. Common types include linear regression, decision trees, support vector machines, and neural networks, among others.
- Model Training: The training process involves finding the optimal parameters for the chosen model to minimize prediction errors. This is typically done by defining a loss function that measures the difference between the predicted outputs and the actual labels and then using optimization techniques (e.g., gradient descent) to minimize the loss.
- Model Evaluation: After training, the model’s performance is evaluated on a separate dataset, known as the test set, to assess its ability to generalize to unseen data. Common evaluation metrics include mean squared error, accuracy, precision, recall, F1-score, and others.
- Prediction: Once the model is trained and evaluated, it can be used to make predictions on new, unseen data by applying the learned function f(X) = y to the input features X.
Supervised learning algorithms learn a mapping between input features and output labels by iteratively adjusting model parameters based on the provided labelled training data. The objective is to find a model that can generalize well to make accurate predictions on new, unseen data.
Learning Issues in Supervised Learning:
- Overfitting: Overfitting occurs when a model learns the training data to well and captures noise or random variations in the data. As a result, the model performs well on the training data but poorly on new, unseen data. It is essential to prevent overfitting by using techniques such as regularization or cross-validation.
- Underfitting: Underfitting happens when a model is too simple to capture the underlying patterns in the data. The model may not perform well on both the training data and new data. To address underfitting, more complex models or feature engineering may be needed.
- Bias-Variance Trade-off: The bias-variance trade-off refers to the balance between model bias (the error introduced by approximating a real problem with a simplified model) and model variance (the model’s sensitivity to variations in the training data). Finding the right balance is crucial to build a model that generalizes well to new data.
- Data Quality: The quality of the training data is essential for the success of a supervised learning model. Noisy or incorrect labels, imbalanced classes, or missing data can affect the model’s performance.
By understanding and addressing these learning issues, we can develop robust and accurate supervised learning models that can solve a wide range of real-world problems effectively.
References:
[1] Stuart J. Russell, Peter Norvig (2010) Artificial Intelligence: A Modern Approach, Third Edition, Prentice Hall ISBN 9780136042594.
[2] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar (2012) Foundations of Machine Learning, The MIT Press ISBN 9780262018258.
If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. It’s $5/month, giving you unlimited access to thousands of stories on Medium, written by thousands of writers. If you sign up using my link https://medium.com/@fhuqtheta, I’ll earn a small commission.