As someone who is currently learning about data science and machine learning techniques I thought it would be a useful exercise to try and explain my understanding of some of the basic concepts and algorithms I have learnt so far.
Living with two Harry Potter geeks I (perhaps foolishly) have tried to use HP analogies throughout. Pretty risky…but anyway, here we go!
Machine Learning Basics
Uses for Machine Learning (ML):
- Classification: This performs the function of the Sorting Hat. We can ‘train’ algorithms to sort new data (inputs) into a predefined category/class (outputs): i.e. Harry (input) → Gryffindor (output).
- Regression: Observes trends in the data to make predictions. Much like drawing a straight line on a graph and using extrapolation to predict what a value should be at another point on the graph. But this can get complex, with multi-variable and nonlinear models usually employed.
Types of ML:
Supervised Learning: Takes in data for which we already know what the result should be and uses this to ‘train’ a chosen algorithm. We can then use a ‘test set’ of this known data to determine how well the algorithm has learnt the proper way to give us the correct output. The ‘accuracy’ of the algorithm is judged by how many answers it got right on its test.
- Spam email detection and filtering
- Facial recognition in photos from previously ‘tagged’ photos
- Medical technologies — cancer identification, genetic sequencing, etc.
Unsupervised Learning: Just receives a whole bunch of data and we ask it if it can recognise any patterns or relationships that we might not (as mere humans) be able to see.
- Identify customers with similar interests for product recommendation
Reinforcement Learning: Based on algorithms that use the trial and error of a series of actions to seek the greatest ‘reward’. This leads to a refinement or optimisation of the algorithm based on outcomes from its actions.
Terminology in ML:
The quickest way to complicate something is to use jargon. So although the field of ML often tries its very hardest to confuse us, the terms used can be easily grasped and pretty analogous to other fields in math and computer science. Here are some fundamental ones I’ll be referring to later on:
- Features: These are the ‘variables’, and their values are used as inputs into the algorithms.
- Labels: These are the values assigned to a given input. In the Harry Potter example, the labels would be Gryffindor, Slytherin, Hufflepuff, and Ravenclaw, for which the features would be the Hogwarts students, with values of Harry, Ron, Neville, etc.
- Algorithm: This is the mathematical model that is chosen for performing the analysis. There are lots to choose from, which I’ll go into a little bit below. But really these are the crux of ML, as they essentially ‘crunch the numbers’ (i.e. inputs) to tell us something interesting (i.e. output).
Machine Learning Algorithms
The choice of algorithm is very important, and depends on what you want to get out of it. Depending on your circumstance you might want a highly complex algorithm that takes a long time to run, or you might just be happy with a less accurate algorithm with a much shorter runtime. Additionally, each algorithm can take a number of parameters which affect the accuracy of your ML predictions.
Some common supervised learning algorithms are:
The Naive-Bayes algorithm is based on a fundamental piece of statistics: ‘Bayes’ Rule’ (or ‘Bayes’ Theorem’). Bayes’ Rule for probability goes something like this:
prior probability + test evidence → new probability
The Naive-Bayes (NB) algorithm uses the multiplication of probabilities for different variables to calculate a total probability. For each potential outcome (or ‘label’) these probabilities can be normalized (i.e. by taking the ratio) and the highest probable value used as the output. This provides a method of classification that is relatively simple, and thus scales well for large datasets with a wide featureset.
For example, in the image above we have a list of spells and the fraction for which Harry and Hermione use each of them. If we were investigating a break-in at Gringotts and knew for sure that the burglar had used obliviate and accio, we can make a prediction as to whether it was more likely to be Harry or Hermione. Given a ‘prior’ probability of 0.5 for both (that is, both are initially assumed equally-likely to be our suspect) you simply multiply the known fractions together to obtain a new probability:
Harry: 0.5 (prior) x 0.25 (accio) x 0.10 (obliviate) = 0.01250
Hermione: 0.5 (prior) x 0.15 (accio) x 0.55 (obliviate) = 0.04125
Normalizing these new fractions gives us the final probability:
Harry: 0.01250 / (0.01250 + 0.04125) = 23%
Hermione: 0.04125 / (0.01250 + 0.04125) = 77%
We can therefore be 77% sure that it was Hermione! This is an extremely simplified case (obliviously), but the same principle can be applied to vast numbers of ‘spells’, with many iterations of probabilistic calculations, making it a highly versatile and useful algorithm. Bonus: The reason that this algorithm is “naive” is that it does not take into account the order in which the spells were performed! This would of course add a much higher level of sophistication and complexity.
Support Vector Machines:
Imagine a 2D-scatterplot with data points that can be classified as belonging to Group A or Group B. The Support Vector Machine (SVM) algorithm draws a line through the plot that divides the data into (predominantly) Group A and (predominantly) Group B. The best line will be the one that classifies the most data points correctly, and also maximises the average distance between the line and the data points (this is called the ‘margin’). This line of demarcation is called a ‘decision surface’ in ML parlance, and is a very important concept in classification algorithms. In practice, ‘decision surfaces’ can be multidimensional, with the 2D-scatterplot just being the most simple representation.
Once this decision boundary/surface is generated, new “unclassified” data points can classified depending on which side of the boundary line they fall. Parameters for this algorithm can also be tuned, such as if you prefer a smoother boundary over more incorrect classifications or vice versa. This can have a large effect on the ability of the algorithm to generalize to unseen data points, and hence its performance and reliability.
The Decision Tree (DT) algorithm uses a series of ‘linear’ questions to arrive at an answer. This is like a yes/no questionnaire that branches off multiple times and eventually arrives at an output value. The series of decisions made along the way allow us to draw a decision surface, from which new, unknown data can be assessed against to make predictions. The nodes in the decision tree are called ‘leafs’, while the connections between nodes are called ‘branches’.
The tree-like structure is quite straightforward, easily allowing us to sort Harry into Gryffindor from a relatively small set of key characteristics.
As the name suggests, the K-Nearest Neighbors (k-NN) algorithm tries to find a predefined number (that number being ‘k’) of data points surrounding the point being investigated, so that it can predict the label for the investigated point based on the surrounding ones. For example, in the image below we can see that most of the Hogwarts students hang out in their House common areas/bunkrooms. If we had a student for whom we didn’t know their house, looking at their location relative to the other students might help us to determine which house this student belonged to.
As you can see, the unknown student is closest to Cedric, and most likely lots of other Hufflepuffs. Therefore classifying them as a member of Hufflepuff seems like the most reasonable guess. Setting the value of ‘k’ determines the number of students that the algorithm will take into account before making a decision (starting from the closest ones and the moving further out). If it often common when using k-NN to also pass in a weighting parameter that tells the algorithm to give more significance to closer data points (students) than those that are geometrically further away.
A common unsupervised learning algorithm is:
The K-Means Clustering (k-MC) algorithm is unsupervised. That means that we don’t need to know any of the ‘labels’ beforehand. Given a dataset, k-MC simply picks a point in (for example) a 2D-scatterplot, then it calculates the mean distance to the surrounding points. The position of the original point is then updated and this cycle repeated until the minimum mean distance to the k-surrounding points is achieved (i.e. the local minimum). This point is then called the ‘centroid’, and it identifies one ‘cluster’. We can set the ‘number of clusters’ that we think there should be. In the example for k-NN above, we could potentially even use this unsupervised approach to identify the students belonging to each house. If we set the number of clusters to 4, then the 4 points with the minimum average distances between students would be approximately located in each house common room. We still wouldn’t know which group corresponded to which house (e.g. Gryffindor), but the students would be classified into 4 distinct groups.
Note that because the algorithm only finds the local minimum, the initial position of the centroids can have a large effect, and in some cases can even lead to ‘counterintuitive clusters’.
However, this algorithm is a very useful method for performing sorting/classifying operations in an unsupervised way.
The concepts and algorithms discussed here are pretty fundamental machine learning things, and there are many more complex algorithms that one may choose for a given task.
That being said, I hope that this article has helped to explain some aspects of machine learning, so that now when you hear people talking about “machine learning” or “artificial intelligence” you might recognise a few words here or there and not just be completely overwhelmed.