If you are just starting in Machine Learning and you want to know the basic workflow involved in making your own model, there could be a lot of confusion involved as to what processing needs to be done, how to organise data, scale or not to scale, to encode data or not and if there is a need of encoding data then exactly what data needs to be encoded , and how to apply something as simple as linear regression on your model, and finally how to achieve your final output. This post will help you out and get you familiar which the basic model pipeline you can use to get started and build your very first model and take a step towards bigger and better problems which are out there for the taking.
Wait, what is a Machine Learning Pipeline you say ?
When you are trying to make your machine learning model, a lots of moving parts are involved which when combined and executed will give us the results properly. Without these “parts” your model might not function properly. And when I say might what I mean is it really depends on your problem set and more specifically your data to decide which part you put in your model.
Each stage of your pipeline receives processed input from the preceding stage and applies it’s own processing and outputs it to the next stage. These stages are not based on whether you want to perform Logistic Regression or Support Vector Regression or K-Means clustering. It is important to remember here that I am not depicting a particular pipeline but rather the stages involved in the pipeline. As explained before, which stage you want to put in your pipeline really depends on what kind of learning you want your model to do and what is your problem set. Some of these stages are obviously necessary, some maybe not. I’ll explain in more detail as we go further. Stages we are going to cover are :
- Importing the dataset
- Taking care of missing data
- Splitting the dataset into the Training set and Test set
- Feature Scaling
- Encoding categorical data
- Fitting the estimator to training set and making confusion matrix
- Visualising Results
Importing the dataset
So the first and the most important thing you do is importing the dataset. What we generally do for importing a dataset is we read the dataset into something called a dataframe. A Dataframe is nothing but a 2-D data structure which stores our data in a tabular format i.e. in rows and columns. Yet another step involved while importing dataset is splitting the dependent and independent variables .We use pandas library for importing the data.
Taking care of missing data
Nothing in this world is perfect. Such is the case with our datasets . Sometimes some data might be missing from our dataset and that might cause problems because your ML algorithm does not support missing data. We will be using scikit-learn library Imputer to replace our missing values with numpy.nan
Splitting dataset into training set and test set
The next step is to split your dataset into training set which will be used by your algorithm to learn from and test set on which generalisation from our algorithm will be tested to find out how efficiently algorithm works on our dataset. This very simple step involves one line of code using scikit-learn library train_test_split.
Feature scaling is yet another important step in your machine learning pipeline. Feature scaling is nothing but standardising your data by removing the mean and scaling to unit variance. IF your data does not look like a standard normally distributed data your estimator might not work properly.
Suppose you have a couple of features in your dataset and one feature takes a relatively large number of values. For example you are doing housing price prediction and you have just 2 features the square-foot area and the number of bedroom. Your square-foot feature takes relatively large number of values as compared to your bedroom. When you draw the contour of the cost function it might look really oval and hence your gradient is going to have a lot of difficulty finding the optimum value. When you scale your features with standard deviation 1 then your contour look more round and gradient can take a much straighter path to your optimum value or minima. To picture this I will borrow images from Andrew Ng’s lecture slides
The code happens something like this
Encoding Categorial Data
Sometimes in your data instead of having numerical values you have labeled values. For example one of the feature of your housing dataset is the state in which the house is located. The state here is a labeled/categorical data. Your estimator has no idea how to process such an entity. That is why we need to encode categorical data. There are many ways to encode your label data. We are going to use one-hot encoding here.
In one-hot encoding, the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. For example imagine you have a gender feature in column 1. You have 2 distinct value for gender i.e Male and Female. So your encoded data would now look something like this
What is meant by this is that now your dataset instead of having a gender column now has a Male column and a Female column. For entities who are Male, their Male column value will be 1 and Female column value will be 0. Similarly for female entities, Female column value would be 1 and Male would be 0.
Now that we are done with encoding our data and our dataset is ready, now we can fit any estimator to our dataset.
Fitting the estimator to the dataset and making Confusion Matrix
Choosing the right estimator for your machine learning problem is a very important step and your output very highly depends on the estimator you choose. Since choice of estimator depends on what you want to accomplish and what kind of data you have, it could get a little tricky. It takes practice and experience. But good for those who are just starting with Machine Learning , scikit-learn has graced us with a very helpful chart to help us with choosing the right estimator for a job.
Once you have chosen the right estimator, fitting them to our data is relatively easy. For example below, I am using a K Nearest Neighbour classifier for a classification task.
We fit our training data using classifier’s fit() function. After that predicting the test result is as simple as using predict() function on your test set i.e X_test
Then we are using sklearn’s confusion_matrix to get ourselves a confusion matrix which gives us a better idea of how our model has performed. For a binary classification task, it is a simple 2X2 matrix which describes our model performance in terms of :
- True Positives — When the predicted output is true and so is the ground truth
- True Negatives — When the predicted output is false and so is the ground truth
- False Positives — When the predicted output is true but the ground truth is false
- False Negatives — When the predicted output is false but the ground truth is true
Note: Confusion Matrix can be just as easily expanded for more than two classes.
Now with this confusion matrix you can derive a lot of information about the performance of your model. Furthermore, you can use it to calculate various metrics such as :
- Misclassification Rate
- F1 score
- ROC Curve
any many other metrics. But these would the ones you would be mostly looking at. For better intuition we can also visualise results using a visualisation library.
To get a better and intuitive understanding of what a miracle you just created (🎉) let us try and visualise the result by library called matplotlib. You can use various other amazing libraries out there like Bokeh, Seaborn etc. but here we are going to stick with matplotlib. I am going to visualise result on my k-nn classifier I used earlier and give you an understanding of how to interpret the graph.
The ListedColormap class is just to fill colour to our dataset points.
Then we prepare a grid for our graph with np.meshgrid() with resolution of 0.01 specified by step = 0.01. Basically, we are taking minimum and maximum from our dataset with X_set[:, 0].min() and in the same way taking maximum with X_set[:,0].max() to set the range for the grid.
After that, plt.contourf() is the actual visualisation function we use to make the contour. And we use our predict() function to predict where each of the points lie[Either red or green — Specified by ListedColormap((‘red’,’green’))]
Then we are setting the limit to x and y coordinates
Next step is to actually make the plot, using plt.scatter() which is used to make a scatter plot in matplotlib.
After training on k-nn with dataset of age and estimated salary results look something like this :
THAT IS ALL!
Yes, that’s all there is to get started with making your own Machine Learning Model. If you have any doubts comment down and feel free to suggest edits :)