Machine learning — All you need to know from scratch

Syed Sohaib Uddin
9 min readMay 24, 2020

--

Machine learning is the most widely talked of a subset under Artificial Intelligence. It pretty much deals with every other implementation of AI, thus earning prime importance on our road to achieving Super AI.

Imagine a 3 year old kid learning to talk. You have been asked to teach him to recognize cats. You take up the challenge and walk him around cats, show him videos, pictures, cartoons and everytime point to the cats and very loudly pronounce ‘cat’. Two days later, the kid points to a lion on the TV and screams C-A-T. Now, you are more determined than ever to make him distinguish and properly recognize cats.

You fine-tune the content to show him and frequently repeat the exercise of pointing and pronouncing ‘cat’. Finally, a day later he correctly recognizes a cat.

If this is related to the process of machine learning, you are the programmer and the kid is a machine. You first provide appropriate data to train the machine to recognize cats in the form of images. Then, you train the machine by labeling cats from the images. Once done, you now provide a sample image to the machine and test if it properly recognizes cats. On the other hand, the machine would check for similarities between the images you have shown it earlier and the sample image. If the matching is high, it would predict it’s a cat.

Technical touch:

In order to train your computer to do the following, you will have to write code. Firstly, you will need to store all the cat images in a directory. This is your dataset. Next, you will search for a mathematical algorithm that can efficiently take in your images, assign labels to them and create other recognition parameters. You will now have to run the algorithm on your dataset and identify all images as cats.With the algorithm executed on your dataset, it creates recognition parameters specific to cats on your dataset. This algorithm has now transformed into a model for the task of identifying cats. Now you test the model by running it on sample images and see the output.

This process of training your computer to learn from the data you provide and allow it to make independent decisions is called machine learning.

What is ML?

Machine Learning is a subset of AI that learns from data and works on information gathered from previous experiences. It is a simple concept where the machine takes data and learns on certain tasks to maximize its performance on those tasks. At a very high level, it is the process of teaching a computer system how to learn, acquire knowledge and thus enabling it to make accurate predictions and decisions. ML allows the system to learn new things from data (lots of data) and this is the key to making machine learning possible.

Classifications

— Supervised learning

Supervised learning is the type of machine learning in which the machine is trained using data that is well labeled i.e it is already tagged with the correct output for a given input. The machine trains to map the input to output and adjusts the algorithm to get the best results. The goal is to approximate the mapping function such that new input data can predict the output for that data.

There are two types of Supervised Learning techniques: Regression and Classification.

  • Regression: Regression is a technique that is used when the machine is required to predict/produce a continuous-valued output. For example, to predict the prices, weights, heights, marks, salary etc.
  • Classification: Classification is a technique that is used when the output is a category or a class and not a continuous value. It classifies inputs into a discrete category of outputs. For example, predicting a fruit as an apple or orange depending upon weight, color etc.

Use-case: This technique can be used to train machines when the input and its desired output are all known from a history of data i.e data is labeled. The machines can map the input with the output and learn relationships between them, eventually producing high accuracy.

Machines trained via supervise dlearning are like a small children. Everything has to be taught and fed to them in order to get the desired results. But, the chances of training them in the way you want are very high.

Unsupervised learning

Unsupervised learning is the training of machine using information that is neither classified nor labeled and the algorithm has to act on the information without any guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data. The machine is restricted to find the hidden structures in unlabeled data by itself.

There are two types of Unsupervised learning techniques: Clustering and Association.

  • Clustering: This unsupervised technique deals with finding a pattern of similarity and forming groups (natural clusters) without any guidelines or restrictions in a collection of uncategorized data.
  • Association: This unsupervised technique deals with finding relationships between data based upon rules and guidelines and forms groups(associations) in the collection of uncategorized data.

Use-case: The best time to use unsupervised machine learning is when you do not have data on desired outcomes, such as determining a target market for an entirely new product that your business has never sold before. However, if you are trying to get a better understanding of your existing consumer base, supervised learning is the optimal technique.

Machines trained via unsupervised learning are like students. They are self-aware on how to learn and only require you to provide the study material. To see if the learning has been effectively done, you just have to test them. The chances of you training them in the way you want are uncertain.

— Semi-supervised learning

Semisupervised learning is the approach where an algorithm is trained upon a combination of labeled and unlabeled data thus being a combination of both supervised and unsupervised learning. This combination will contain a very small amount of labeled data and a very large amount of unlabeled data.

Use-case: When you don’t have enough labeled data to produce an accurate model and also don’t have the ability or resources to get more data, you can use semi-supervised techniques to increase the size of your training data.

Machines trained via semi-supervised learning are like interns. They know how to learn. However, you have to teach them a small portion of the work and leave the rest for them to self learn.

— Reinforcement learning

Reinforcement learning is the ML approach where the machine is provided with data and a complex algorithm. The programmers create a feedback based reward system. The machine trains on the data and produces output(s) using trial and error from its thousands of parallel learning outcomes and the feedback system grades it with respect to our expectations. The machine over time fine-tunes the learning results and tries to achieve the maximum reward.

Use-case: It is used in scenarios where the state-space is very large i.e the number of possible ways to achieve an outcome are numerous and the most efficient ones are required. It could be very time consumed and cost-effective to allow the machine to learn by providing labels to the outcomes (supervised learning) or letting it learn and reach the outcomes without knowing itself (unsupervised learning) how the outcome impacts the environment.

Machines trained in reinforcement learning are like entrepreneurs. They are aware of the numerous ways of achieving their goal but improvise their approach based on the market trends and customer feedback form time to time.

Algorithms

An algorithm refers to a formal set of steps or course of action taken in order to perform a particular task. In ML, algorithms are any mathematical formulation or procedures that are essential functions in order to map, group, categorize and analyze inputs wrt output. The algorithms once trained on the dataset transform into models. An ML model refers to a mathematical expression of parameter values on an algorithm obtained after running it numerous times on specific data such that the output of the algorithm is perfectly desired.

There are many mathematical formulas that can be used to perform operations and produce the desired results we need. However, some of them are very commonly used.

  • Supervised algorithms: Linear Regression, Nearest Neighbor, Gaussian Naive Bayes, Decision Trees, Support Vector Machine (SVM), Random Forest, Neural Networks
  • Unsupervised algorithms: Hierarchical clustering, K-means clustering, K-NN (k nearest neighbors), Neural Networks, Algorithms based on Association rules.
  • Semi-supervised algorithms: Transductive support vector machine(TSVM), graph-based methods, heuristic methods
  • Reinforcement learning algorithms: Q-learning, Monte Carlo methods, value functions, temporal difference

Each and every algorithm above has its own set of terminologies and procedures. It is better to understand each one of them closely and implement them via code. I will try to cover as many I can, with complete details, in my other blogs.

ML Process

The Machine learning process outlines the various steps taken sequentially from gathering input data to getting the desired results.

Gathering and preparing data

  • The first step is to gather the data. The quantity & quality of data dictates the accuracy of the model.
  • After gathering the data, we prepare it for training. All the errors or deficiencies in the data that may misguide the training must be removed. This is done by cleaning, randomizing, visualizing and then splitting it into training and testing data.
  • In addition, performing Exploratory Data Analysis(EDA) helps to understand relationships and extracting features from data.
  • Good train/test split is 80/20, 70/30, or similar ration depending on the domain, data availability and other dataset particulars.

Choosing a model

  • The data collected must be well-suited to the problem of model selection. Different algorithms are for different tasks, however, the ML approach being used should mainly address the algorithm.
  • There are several techniques used to determine the models suitable for a given scenario such as the Akaike information criterion, the Bayes factor and the Bayesian information criterion.
  • Generally, the ML type (supervised/unsupervised, etc) dictates the algorithm set and trial among those algorithms determine the model.

Training

  • The goal of training is to program the model to make a prediction correctly as often as possible.
  • The first step before training would be to determine the model parameters i.e good values for all the weights and the bias. These can be learned directly from the training data.
  • Next, we must determine the Hyperparameters. They are parameters used in model training that cannot be learned by the training process and are very crucial in determining the model training and accuracy.
  • In order to determine the hyperparameters, we perform cross-validation on the validation set(a subset of the training set). It is performed by looping on the training data and determining the best hyperparameter value(s).
  • We fit the model with the model parameters, hyperparameters and perform training.

Evaluation

  • Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data by determining the feasibility, success and performance of the model.
  • There are two methods of evaluating models: Hold-Out and Cross-Validation. (Here Cross-validation is performed on test dataset)
  • Evaluating model performance with the data used for training is not acceptable in as it can easily generate overoptimistic and overfitted models.
  • Models are evaluated against evaluation metrics. Different evaluation metrics are used for different kinds of model training techniques. There are several evaluation metrics like confusion matrix, cross-validation, AUC-ROC curve, etc.
  • Very common evaluation metrics for classification models are accuracy and precision that are calculated from a confusion matrix.

Hyperparameter Tuning

  • Models can be configured to improve accuracy using hyperparameter tuning. This enables us to improve probabilistic outcomes, but at the cost of increased computing resources and time.
  • Model hyperparameters such as number of training steps, learning rate, initialization values etc can be adjusted(tuned) from their initial values to improve the model.
  • Methods such as grid searching and random search allow us to determine the best hyperparameters.

Prediction/Testing

The final step is all about getting accurate results using input data (test set) and see how the outcomes layout. In case you aren't satisfied, run back to the training.

Finally, getting hands-on ML 🚀

For the majority, ML models are programmed using python. The main reason being the variety of libraries available that make our work easier. When working with any AI application in general, it is very common to see python and its libraries around. In order to get started with ML the last thing you need to know would be — which library to use and for what purpose.

  • Numpy, Theano —Used for Multi-dimensional arrays and matrix processing. They are helpful when dealing with operations on datasets.
  • Pandas — Non-ML library that provides support for data analysis, extraction and preparation before training.
  • Matplotlib — Non-ML library that provides support for all kinds of data visualizations.
  • Scipy — A library that supports linear algebra, statistics and also image processing.
  • Scikit-learn — ML library for classical ML algorithms and supports nearly all supervised and unsupervised learning.
  • Tensorflow, Keras, Pytorch — These libraries involve tensor-based computation and can train, support neural networks and deep neural networks.

Summary

These were the fundamentals I think are needed before starting your journey with ML. There is more to be learned in-depth about each and every concept. I urge you to stress on the fundamentals.

I hope it was helpful. 😄

Cheers:)

--

--