Introduction to Machine Learning, Machine Learning Basics
What is Machine Learning?
Machine learning is a type of artificial intelligence that involves training algorithms to learn from data instead of being explicitly programmed. Machine learning is a subset of data science focusing on the development of algorithms and models that can make predictions/take action based on the input data. ML algorithms are often used to identify patterns in data, predict future events, or classify data into different categories.
What are the main principles of machine learning?
- Training data: ML algorithms require a large amount of data to learn and function correctly.
- Features: Input data must be organized into features(a fancy way of saying categories/classes) that the algorithm can learn frm.
- A Loss Function: These are used to evaluate the performance of a machine learning model.
- Optimization: Tuning hyperparameters(variables that affect how the model works) in order to optimize the model and improve accuracy. This can be done manually or using hyper parameter tuning methods like GridSearchCV in scikit-learn.
- Regularization: This is a technique used to prevent overfitting(Occurs when a model is too complex, and performs bad on new data). Regularization methods add a penalty to loss functions based on the magnitude of the models parameters, constraining the model to prevent too much complexity.
- Evaluation: Performance must be carefully evaluated on separate datasets to the test data in order to assess the effectiveness of the model.
- Cross-validation: splitting the data into multiple subsets, training the model on one of these subsets and using the remaining subsets for evaluation purposes. This is repeated with each subset and model performance is averaged across all subsets. Helps prevent overfitting.
Three main types of machine learning task?
The three main types of machine learning tasks are classification, regression and clustering.
Classification is all about training a model to predict the category/class that an input belongs to.
E.G. Is this a dog?
Regression trains the model to predict a continuous value, for example price or probability. Often used when the output is a quantity rather than a category/class. E.G Predicting the likelihood that a patient has a type of cancer based on their medical history.
Clustering is a type of unsupervised learning that involves dividing datasets into groups/clusters based on similarity. This is helpful for identifying underlying patterns and structures in data, and grouping these data points together. Clustering has uses in the field of market research, segmenting markets into different groups. Clustering is useful for discovering hidden relationships in your dataset.
What is the curse of dimensionality, underfitting, overfitting?
The curse of dimensionality is when datasets contain a high amount of variables/dimensions, and the challenges arising from handling this data. More data is needed to achieve reliable results and it can become increasingly difficult to visualize and analyse data with a large amount of variables.
A large amount of redundancy can occur in datasets with huge amounts of features/categories/dimensions. This makes model training more difficult. Models with high dimensionality need a larger amount of data to attain good performance, slowing down development and evaluation.
Overfitting is when a model is far too complex and it has poor performance when working with new data. Occurs when models are trained on limited data and learns random fluctuations in data instead of underlying patterns. Overfitting is caused by many factors such as complexity of model, insufficient training data, large number of features.
Underfitting is also known as bias. It happens when the model is too simple and cannot learn the underlying patterns in the data, resulting in a poorly trained model with poor performance all around. Caused by several factors such as lack of features, lack of complexity, and poor training.
The four main types of machine learning?
The four main types are supervised, unsupervised, semi-supervised and reinforcement learning.
- Supervised learning involves training models on labelled data where the correct output is provided in the dataset for every sample of data. The model gets a helping hand in identifying patterns and learns how to map the input data to correct outputs.
- Unsupervised learning uses unlabelled data, the correct output is not provided in the dataset and the model has to find all the patterns by itself.
- Semi-supervised learning is a combination of supervised and unsupervised. The model is trained with a mix of data, labelled and unlabelled.
- Reinforcement learning trains a model to take action in an environment to maximize a specific reward. Reinforcement learning involves trial and error, the model gains positive or negative feedback each action/iteration and learns from this to modify its approach. E.G training a video game playing AI
General Workflow for machine learning model training
- Define the problem and determine the goals of the model.
- Collect and prepare the data. This may involve cleaning, transforming, and pre-processing the data to make it suitable for training a model. EG removing null columns/rows.
- Select and extract the features from the data. This involves selecting the relevant variables and transforming them into a format that can be used as input to the model. I.E what are the categories/classes of data/columns
- Split the data into training and test sets. This is done to evaluate the model’s performance on unseen data.
- Train the model on the training data. This involves using an algorithm to learn the underlying patterns in the data. EG K-NN algorthm
- Evaluate the model’s performance on the test data. This allows you to assess the model’s ability to make accurate predictions on unseen data.
- Fine-tune the model. This may involve adjusting the model’s hyperparameters, adding regularization, or using other techniques to improve the model’s performance.
- Use the model to make predictions on new data.