Using classification to predict iris species within the Iris dataset
Classification is a method, within the context of machine learning, to determine to what group some object belongs based on known categorization of similar objects. Or, to put it in English, if we have a bunch of data points, say people, about whom we know some qualitative measure, say whether or not they will vote in some upcoming election, as well as additional information such as their age, income, registered party, profession, we can train a model on these data points. This model will then take in new observations (new people), along with the information about their age, etc, and predict whether or not they will vote. Simple enough.
K nearest neighbors is a particular type of classification that serves as as a good example of machine learning, due to its relative simplicity within machine learning. Classification of observations is determined by Euclidean distance — whatever data point(s) from the training set the new observation is closest to, that new observation will take on that classifier.
The Iris dataset, generated in 1936 by the British statistician and biologist Ronald Fisher, records sepal length/width and petal length/width, in addition to one of three iris species, for 50 observations. This particular dataset is widely used in machine learning applications, and works well to illustrate the efficacy of classification models.
By plotting the petal length and width of the individually observed flowers, we see that these measurements seem to group by species:
In the above graph, each distinct color represents one of the three iris species in the dataset. The same grouping occurs when we plot sepal width and length.
We can train a K nearest neighbors model on some portion of the iris dataset (I used 70%), inputting petal width and length and sepal width and length as viable predictors of iris species, and have this model produce predictions of iris species. The classification algorithm, once trained on data with known values of species, takes our input of sepal and petal measurements, and compares them to the values it has stored from our training data. Depending on what data points from the training set are “nearest,” i.e. what observations most closely resemble our query, the model will then output a predictive classification.