A Machine Learning Model with Python

Published in

The Startup

7 min readJul 3, 2020

A simple Machine Learning model to predict the preference of music genres among men and women of varying ages.

In this discourse, we hope to make an actuality out of the thousands of words spoken about machine learning in artificial intelligence, wouldn’t it be interesting to at least build a real project together, getting the genius feeling of an AI developer? This machine learning model is going to help predict the genre of music that a user is likely to download and listen to, according to the trend among people of their age and gender. We will train the model with sufficient data about people of different ages and genders, and the songs they prefer. The model will learn from this and will be able to guess the genre of music a particular person might like given their age and gender.

This tutorial is focused on Python literates, people who are curious about the inner workings of machine learning algorithms, and the fans of artificial intelligence and data science; we shall be demonstrating an easy but complex principle in machine learning—the decision tree classifier.
While a basic knowledge of Python is expected, it is, however, interesting to note that a non-programmer who understands mathematics as basic as gradients and calculus might not find this discourse totally off their comprehension because mathematics is fundamental to machine learning and data science.

If you want to follow this tutorial practically, you can open a Colab tab in your browser to write live python codes. Colab allows anybody to write and execute arbitrary Python codes through the browser and is especially well suited to machine learning, data analysis and education, you can start writing python codes without installing anything on your computer.
For order and understanding, we will run this project in seven (7) stages;

1 Importing Data: You already know data is an indispensable aspect of machine learning, and how we often employ the methods of data science. Data is needed to train our model, and data is needed to test it. The files we use in machine learning and data science are often in data formats, with the .csv, .xlsx, .json, and .txt extensions. In Python, there are libraries and modules embedded with thousands of classes (compounds of reusable functions and attributes) in which we use cool methods (functions of these classes). The most common library in Python for working with data is the Pandas library, one of the favourites of data scientists, and of course, machine learning engineers too. Pandas allows us to import data, view its properties, and apply a lot of changes to it. To import the data that we are going to use, we first import Pandas itself. While importing Pandas, we could give it a custom name with the keyword “as”, so that we don’t always have to type “Pandas” every time we want to call it, but the alias we gave it, “pd”. After importing the Pandas library, we use its “read_csv” method to import our data file, we call this data “dataframe”.

import pandas as pddataframe = pd.read_csv('music.csv')

We use the inbuilt **print()** function to display a preview of the first 10 rows of the data — dataframe.head(10). We needed to represent the genders in integers, so we use 1 for male, and 0 for female — this is necessary because the model will only be able to read numerical values when we’re fitting a training data.

2 Cleaning the Data: Machine learning engineers are happy when they have the data organized into rows and columns, something specific to data scientists. However, it happens sometimes that some columns of data are not needed, or there are missing data in some cells; the machine learning engineer will be bothered to first clean this data and make it palatable for the project. In this music .csv file we are using, there are no unnecessary duplicates or columns, nor empty cells, however, we need to divide the data into two points — the input data and the output data. The input data will have the age and gender columns while the output will reserve what is to be predicted, the genre column. Thus, we train the model to accept two input values, age and gender and predict the result (genre) based on these values. To create this input set, we drop [or remove] the “MUSIC GENRE” column from the data frame, and we store the remaining columns of data (age and gender) as X;

X = dataframe.drop(columns=["MUSIC GENRE"])

After dropping the “MUSIC GENRE” column, we are left with “AGE” and “GENDER” which serve as the **input** set, X.

Now we need to create the output set too, y, in which we will have only the “MUSIC GENRE” column. It will serve as the “label” as commonly called in machine learning and will be used in predictions.

y = dataframe['MUSIC GENRE']

So, we have the age and gender of the user as the input set (X) and we expect our model to predict which genre of music (the output set — y) the user would like.

Now we see that having extracted the “MUSIC GENRE” column from the data frame and gave it the reference y, we have another set, our output set which I previewed with **print(y.head(10))**.

3 Splitting the data into a ‘training set’ and a ‘test set’: We need to split these data (X and y) into the training set and the test set. The training set is used to train the model, while the test set helps to test and evaluate its predictions, whether they're accurate enough. Thus, we split the X into X_train and X_test while we split the y into y_train and y_test. Our Pandas library cannot do this handily, so, we need to import a capable module from another library. The sci-kit learn library has the module Model Selection which has a splitter function, train_test_split(). This function will be employed in splitting our data sets into X_train, X_test, y_train, and y_test. The train_test_split() function takes three arguments, the input set (X), the output set (y), and the size of the test set. It’s a standard convention to use as much as 80% of the whole data frame for training and about 20% for testing. So, when calling the function, the third argument specifies the size of this test set in decimal, 0.2, as seen in the code block below. The return value of this function is four different tuples which we store into the four variables X_train, X_test, y_train, and y_test.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

4 Creating the Model: Having split the data frame, we can go ahead to create the model which is simple enough using the “tree” module from the same sci-kit learn library. The tree module has the class DecisionTreeClassifier. We can create this model with just a line of code after importing the DecisionTreeClassifier class from sklearn.tree — we create an instance of that class;

from sklearn.tree import DecisionTreeClassifiermodel = DecisionTreeClassifier()

5Training the Model: After creating the model, we train it with the training dataset we have, X_train and y_train. It will learn the relationship between the age and gender of every row in X_train and discover the corresponding genre of music they like in the y_train. So, simply writing another line of code lets us train the model.

model.fit(X_train, y_train)

6 Making Predictions: The next thing, after training the model, is to present it a different set of data different from the one it has observed (learned from), the X_train. We present X_test as a new set of data for which this model will predict corresponding values, these values can be stored into a new variable, y_prediction, and we can juxtapose them with the actual values we have in the y_test. So, ultimately, the y_test is reserved to evaluate the accuracy of the model’s prediction.
To make a prediction, we code;

y_prediction = model.predict(X_test)

Let us quickly preview the predictions by printing y_prediction.

print(y_prediction)

Printing y_prediction will give us an array of the predictions but this isn’t easy to visualize. To visualize well, we loop over this result and we print each value on a separate line

It is easier to visualize now. To recap; we had a test set, X_test, which was 20% of the overall X data set and we parsed it to our model to predict. The model looks at the first row of the data, age and gender, then it predicts the genre of music the person would like — from above, we observe the prediction is “classical music” for the first row. You can print the X_test alongside so as to see the data the model is predicting for.

7 Evaluation: A function allows us to evaluate the accuracy of this prediction, the accuracy_score() function in the metrics module of sklearn. This function takes two arguments, the two datasets we wish to compare. We are comparing the prediction, y_prediction, with the actual values that we stored in y_test to see if the model guessed right. We first import the function, then we use it to compare these data.

from sklearn.metrics import accuracy_scoreprediction_accuracy = accuracy_score(y_test, y_prediction)

The accuracy of our model’s prediction returns 0.7857…, equivalent to 78.57%; this is fairly good. When we make another prediction, we measure it’s accuracy again and see if the model is doing well.

From the accuracy of the model’s predictions, if we perceive it is not intelligent enough, we adjust the ratio of training data to test data, add more data, or change the learning algorithm to the one which gives us the maximum summation of accuracy scores in each turn of prediction. In this project, we used the decision tree classifier; of course, I mentioned there are other algorithms depending on what we intend to do with the model — Linear Regression(the simplest, basically y = mx + c), Logistic regression, Decision Tree (part of which we just used), Naive Bayes, k-Nearest Neighbour, Random Forest, etc.

You will find the source code in this repository.

Peter Michael
University of Lagos, Nigeria

SGC GI: 048 — Robotics/IoT/AI

A Machine Learning Model with Python

Published in The Startup

Written by Michael Peter

No responses yet