Machine Learning #3 — Grades
If you read my first post in this series, you may recall the example I employed to illustrate the nature of machine learning — tumor analysis and detection.
Now, obviously in reality a problem akin to the tumor detection one will be plagued with complexity, as the size of the tumor isn’t the only variable that is thrown into the situation. In reality, scientists have to face countless other variables, from cell uniformity, to the biological intricacies of the patient. For this reason, it’s probably not the easiest example to get started with, and so I will be introducing a slightly more straight-forward example.
I have a Math test next week. I’ve been given data for the number of hours spent studying, and the final grade it corresponds to. From this, I face the itching question that all high-schoolers drool over — How can I spend the least time studying, and still get the best grade?
Fortunately, it is a question no longer. Thanks to our good old friend (or new) Machine Learning, we don’t have to burn away our brain power pondering this question for hours on-end. We don’t have to send hours wasting away at how to best answer this perplexing paradigm — now we can let the computer tell us.
The figure above (Fig. 1.1), is a random scatter plot of data I created. These numbers are entirely made up and do not actually reflect any true data collection, however, for the sake of this exercise, let’s assume it is ;)
Now, to tie in with what I discussed in the last post regarding supervised and unsupervised learning, which of the two branches do you think this problem falls under?
It’s supervised. Think about it. We’re feeding in the ‘correct’ data, with the actual number of hours spent studying, and the final grade that it outputs. So we’re giving it a data set, with the labels outlined, and asking it to predict the desired output for any given number of hours spent studying.
Regression and Classification
In supervised learning, we come across two types of problems — regression, or classification.
Regression, a term that may have popped up in Math or Science class, is a terms that describes a computational process to determine and quantify the strength of the relationship that binds variables together. When we plot a line of best fit onto our scatter plots, we’re really employing the trades of linear regression without even knowing it. A supervised learning problem is said to be regressive in nature if it involves continuos, or real number values.
What this implies, is that if we have a continuos set of numbers (i.e. any number within a certain range), we know we’re dealing with a regression problem.
If the data we have is discrete, then we have to tackle it as a classification problem. These problems, while still focusing on predicting the output, looks at discrete values, as opposed to continuos numbers. The tumor example, all the way back in post one, is a classification problem, as we have to ‘classify’ a tumor as either benign or malignant, and there’s nothing in between them so it is discrete.
Now, if we come back from this little field trip or detour we just took and look back at the grades problem, it becomes rather transparent that we are faced with a classic regression problem.
The desired outputs parameter, which is our final grade, and our number of hours spent studying, are both continuos numbers, so this problem quite easily falls under that of regression.
In the next post, I’ll try to cover how to actually implement and derive the linear regression algorithm, but I will warn you ahead of time, it gets very, very math heavy…