So far, we had an overview on what is Machine Learning, its practical use cases and how we can leverage the power of Machine Learning. We also discussed about the different types of Machine Learning techniques.
We will now look at the Machine Learning pipeline and its workflow.
As it can be clearly seen, it’s a simple 5 step process, but let me tell you beforehand, though it is a Sequential process yet there are a lot of iterations in the steps itself so that the model improves with time.
1. Get Data
As we know that data is the ultimate fuel for a Machine Learning algorithm, so the first step in getting started is to gather the data.
This depends on the type of problem we are tackling with and the kind of input and output that we desire.
For example : If I want to predict the number of Covid19 cases, then definitely my output should be continuous real number and accordingly I can make use of Regression based Techniques.
If I want to do a face recognition then clearly I expect an output video with the bounding-box being drawn. Accordingly, we will make use of Classification based Techniques.
Hence, the most important step is to understand the motive of the project and to identify the input and output formats. Our aim of building a Machine Learning model is to get meaningful insights and consistent results.
Once the motive is set, we need to embark on a quest to start finding the right data. Most of the times, if we are trying to solve a common use-case, curated datasets are readily available and we can just import them easily. But, let’s suppose we are trying to solve a novel problem for which the dataset is not available then we need to use web-scraping techniques to extract raw data from various sites.
“Getting the right data is the most difficult thing in the entire Machine Learning Workflow.”
2. Clean, Prepare and Manipulate Data
Once the dataset is obtained/created, the next and the most important step is to clean the data.
Now what is meant by cleaning the data ?
Most of the times, we find data which we can’t use in its natural raw from, there is a need to clean/filter data so that there exists a proper structure and uniformity in the data.
Understanding in depth :
Use-case: I want to analyze the Instagram comments on my post so that I can get the overall view of my followers.
Input : Since I want to analyze comments, my input will be the comments ie. the text data.
Output : I just want to make sure that my followers are liking my post or disliking it. I can simply use
1 → Like
0 → Dislike
Type of problem : It’s a classic example of Classification problem.
If you are not aware of the types of Machine Learning Techniques have a look at the previous part of this article.
So, cleaning of the data would involve the removal of various special characters like (#,*,!….) and white spaces and numbers.
P.S. Check out regex for the cleaning purpose.
Once the data is clean, we then look for the anomalies in the data.
This includes taking care of the missing values, in such cases either we can drop that particular data or we can use various imputation methods. We will cover them in detail in the upcoming articles.
There might be a strong possibility that the data values in the different input features could be in different scales ie. some might be in [0,30] and others might be in [90000, 1000000]. For such cases, we need to do scaling of data so that all the features are scaled onto a fixed scale. This helps us to predict the coefficients of the model well.
Now, comes the best part of Machine Learning ie, feature selection, once the data is preprocessed, we now need to look for the best features to select ie. we might not require all the features.
Well do i need to do it manually ?
No need for manual inspection, there are a lot a mathematical formulations and theories for the same. We can simply utilize variance thresholds and correlation thresholds to drop some features.
P.S. We can do PCA, RFE and a lot more.
3. Train the model
You must be thinking that training the model would be difficult, well it is the easiest thing to do, just hit 3 to 4 commands in python and you are good to go.
Python has a rich variety of libraries for Machine Learning : pandas, numpy, matplotlib and sklearn are the best and simple to use libraries to get started right away.
Well training a model makes use of complex mathematical computations and could be very lengthy but you don’t have to do anything in between, so meanwhile you can relax and sleep !
4. Test the model
Using the right set of hyper-parameters and loss functions and various other required fields need some knowledge and intuition.
Well, nowadays even these things can be easily done automatically.
You, need to keep an eye on the accuracy on the train and the test dataset.
There is a strong possibility that the model may start to overfit or underfit.
Overfitting is basically when your model tries to learn your data, it mugs up the data so well that it forgets to generalize. If I ask a question on which the model is trained, it will give me the perfect answer but if I ask an almost similar question it will fail drastically.
Underfitting on the other hand refers to the exact opposite, the model is not even able to learn the data well. So try to increase the data size.
“A good Machine Learning Algorithm is the one which is able to generalize its observations and hence delivers excellent performance on new unseen data.”
5. Improve the model
Well, it might seem that this is the last step after that we are ready to deploy our Machine Learning model.But, before that we need to improve the model, this is where the real knowledge and practical experience comes into play.
Improving the model could become an endless cycle, if performed without having an understanding of the various results/test score. Accuracy is not the primary source of evaluating the model.
“No matter how perfect you are, there is always a scope of improvement.”
After a lot of iterations and new methods and strategies you can achieve a pretty accurate model. Do brainstorm about how to improve the model. The best way is to visualize the learning curves and see where is the problem occuring.
“If there are two models with similar performance, always go for the simpler one.”
Thank you, will cover a hands on approach of all steps stated above in the next part. Stay tuned.
Suggestions are always welcome.