Machine Learning — Baseline helps in comprehensibility
Machine Learning… It started 2 years ago when I joined this small bustling really ambitious startup, Wynk Limited. The task of creating intelligent Music Search and Discovery, soon enough lied on my plate. This was my first encounter with Machine Learning as I went on to explore Music Recommendation, Auto-suggestion and Relevance re-ranking for Search. Today, after 2 years of practicing Software Engineering and Machine Learning, I sit here at North Carolina State University, half a world away, still practicing software engineering and machine learning. One thing, I should have told you is that, Machine Learning can be as simple as you want, as complex as you wish! “You can check out any time you like, but you can never leave!”. Haha, not entirely true but, machine learning captivates you. In this post, I am trying to understand high level details of Machine Learning, especially with respect to businesses. I am trying to explain and understand why a simple, comprehensible baseline model is important before delving into the off-the-shelf complex machine learning models.
Most of the problems I have solved are on natural language processing and involves word vectorization. Word frequency can be modeled as probabilistic distribution. Hence, more often than not I build a Naive Bayes model as my baseline and then go on towards building more complex ones such as Decision Tree, Support Vector Machine and Random Forest . Wait…! But what are these models? Why am I calling one as baseline, others as complex. Before, I delve into this, let me tell you why a baseline is important.
“Give me the fruitful error any time, full of seeds, bursting with its own corrections. You can keep your sterile truth for yourself” — Vilfredo Pareto
Consistently, we develop complex and complex softwares, and consistently, the winner comes out to be simple ones. Comprehensibility is an important criteria for a Machine Learning model. That is the reason we build simple, comprehensible and reasonable model as the baseline first. Then build more complex one on top of that and every time, do better than the baseline.
Naive Bayes is a probabilistic model that builds on top of a “naive” assumption that all features are conditionally independent. Not only, this is easy to understand and explain, but also very computationally efficient. And, Naive Bayes turns out to be better than random guess, almost always. However, I am not making a claim here that a baseline will always be Naive Bayes. All I say is, NB is a fantastic initial baseline model.
Classification is a very fundamental task in Machine Learning. Decision trees are very intuitive in this aspect, as it is very comprehensible. How is a Decision Tree created? Well, it is simple. Look for maximum diversity among all features, split on the feature that minimizes diversity of both the splits and recursively do this again and again, until you are satisfied that diversity is minimized. In case of categorical features (ordinal and nominal), entropy is used as a measure of diversity. While standard deviation is used for numerical categories(interval and ratio). Random forests are just a collection of these Decision trees, also known as ensemble. Ensemble generalizes the decision based on multiple models, hence the error is generally lower than individual model. However, we start losing comprehensibility now!
Support Vector Machine (SVM) is slightly different from the other models. SVM is build on the fact that certain data points are important than others, especially the ones very close to decision boundary. These points (essentially, vectors) are called support vectors. SVM tries to maximize margin between the class labels. Using a complex Lagrangian multiplier and quadratic programming, SVM comes up with data points (vectors) that are most useful for class differentiation. Although, from the mathematics it seems SVM is a very complex model. However, when you look closely, every step of SVM dual and primal problems can be explained geometrically. I believe, SVM is comprehensible, as it gives us the specific vectors that are discriminative of the class labels. Moreover, we can introduce penalty for misclassification in the optimization problem to get as hard margin for classification as we want.
Few steps of solving a machine learning problem would be as follows:
- Identify your problem and data set.
- Data cleaning, pre-processing and simple visualization to understand the patterns of different variables and class labels.
- Come up with a baseline model. As you see, Naive Bayes, acts as a tremendously good baseline model because of its probabilistic approach. It is comprehensible and simple.
- Establish your baseline evaluation metrics, such as Precision, Recall, F1 score, F2 score, etc.
- Try 2–3 different models to understand best baseline model such as NB, Decision Tree, Random Forest, SVM, etc.
- Now is the time for really engineering. Start with various processes that are followed in practice such as dimensionality reduction, use different kernels, add external features (word embeddings in NLP), Principal Component Analysis, etc. Combine different classifiers optimized for particular class labels. For instance, use a Max Voting Classifier of 2 different classifier, say SVM with linear kernel and Random Forest, to predict for 2 different class labels.
- Compare the performance of new classifier against baseline on the basis of evaluation metrics that were defined earlier.
- If the performance is better, establish this model as baseline for future developments. And continue, and continue. As more and more diverse data shows up for your problem, you will find the need to constantly beat your baseline model. You can beat it by creating new models, improving hyperparameters of baseline model or the new model.
It generally happens that we lose comprehensibility as we move to more complex models. But if we have good baseline models, we can always explore comprehensibility in terms of these models. For instance, we create a model for Natural Language Processing task using NB for categorizing news articles on the web. A deep learning model with Global Vector trained Word Embedding with Long Short-term Memory units may perform better. But how do we explain the usage of Word Embeddings? We look at the top words used for classification in case of NB model and compare it with this Neural Network model and explore comprehensibility. A Machine Learning model without appropriate comprehension is a black box. A business does not thrive on black box. A business works on actionable analytics, understanding of the experiments, successes and failures. Essentially, comprehension of every feature that is rolled out. A reason for a particular model to work and not to work.
Baselines are important in businesses. Baselines are important for multiple iterations quickly. Baselines are benchmarks for your use case. It is, oftentimes, required to roll out multiple versions of the product and note down the results of the experiments to analyze the model. Then we learn pros and cons of the particular model, not only with respect to baseline evaluation metrics but also with actual customer experience. A black box model does not let us incorporate the learnings of this outcome. A comprehensive model, however, lets us understand why a certain feature work and other not so much. A baseline is important to establish comprehensibility for complex machine learning tasks and businesses.
Let’s continue exploring machine learning applications in further blogs…
As I work mostly in Natural Language Processing and primarily in text based search. I have written some blogs on setting up and using apache SOLR in previous stories i.e. what is SOLR, setup solr and configuring solr.