Auto-Tagging Questions on Seedly with Deep Learning
Introduction
A month back, my friend and I embarked on a project to automate the question tagging process for Seedly. With the original tagging process being a manual task, creating an automated tagging system would definitely help save a ton of time and manpower. On top of the obvious perks of automation, this auto-tagging system also improves the user experience through reducing the time and effort taken for users to tag the questions, and it would do so with greater accuracy. Fast forward to today, our auto-tagging system is completed and has achieved an F1-Score of 90%! Although Seedly has yet to officially deploy our system on their website, I would like to outline some of my reflections and thought processes throughout this project.
This article will outline my thoughts and findings from these 4 main pointers:
1. Multiple Labels
2. Objective and Subjective Labels
3. Imbalance of Labels
4. Ease of Deployment
Multiple Labels
Dealing with multiple labels for each datapoint is not common in my academic journey of being a data scientist. Just to get the jargon out of the way, in terms of a machine learning classification problem, there are mainly 2 kinds scenarios — multi-class and multi-label. A multi-class problem means classifying something into a single class, or category, out of multiple classes, or categories. Whereas, a multi-label problem means being able to classify something into more than one category at a time, out of multiple categories. Seedly’s Question and Answer (Q&A) page has over 30 categories and each question can be labelled as more than 1 category at a time. This directly places this project into multi-label territory! However, there were quite a few occurrences while working with a multi-label scenario.
Firstly, being a multi-label scenario imposes a restriction on the machine learning models that we are able to use as the basis of our auto-tagging system. The intuition behind this is actually simpler than I thought. Multi-class problems imply that each category is mutually exclusive to each other. Once a category is chosen, it would be evident that the other categories would not be chosen or even considered. It feels typical for a computer since this kind of emulates binary decision making. If you think about it, a binary classification (1 or 0) is a special multi-class scenario of having only 2 categories. If you extend the number of categories to 3, it would just look like this (2 or (1 or 0)). If it is not 2, then it is either 1 or 0.
On the other hand, multi-labelling approaches things a little differently. It would make sense to consider the other labels when labelling a datapoint. For a more visual example, take a look at this cat.
Let’s say we have these labels to label the possible colour combinations of a cat: [white, black, grey, orange]. Now let’s say that we know that the colour orange is present on the cat. Do you think that information would help with churning the probability of the other labels being positive? With my current knowledge of cats, I would say that I have never seen a grey and orange cat before. Therefore, if I know that the colour orange was present, I would say that the probability of grey being present on the cat as well would be low while the chances of colours like white or black be high! Hence, showing the increase complexity of multi-label problems as compared to multi-class.
Secondly, as a result of the cross referencing between labels, some complications to the labelling accuracy arose. This was simply due to the nature of the questions themselves. Given that Seedly mainly deals with financial topics, most of their keywords we found were of that kind as well. Hence, making it difficult for the system to discern patterns for various sub-categories of the main financial category. For example, a question regarding Savings could seem like a question related to Retirement, even if it was not the intention.
This also brought up something interesting to consider. If a user asks question intended to only be under Savings uses keywords that allude to it being in Retirement as well, objectively speaking, should it also be under Retirement?
Objective and Subjective Labels
The data that we had to feed into our machine learning system only consisted of the labels that the users themselves tagged onto the question. This is definitely not the best arrangement as the users may not have properly labelled the question, as with any human decision, it is subjective. Hence, if we feed this user-skewed data into the auto-tagging system, it would only learn the labelling patterns of the users and not the objectively correct labels.
Although, intuitively, it makes sense to be more objective over subjective, making use of the user inputs could prove useful as well. Given that subjective data is what makes machine learning models so useful to begin with! With subjective data, the machine would be able to learn the patterns of the user to ‘understand’ why certain decisions made, or why the user labelled the question as such. This fully assumes that the user knows for sure what a question should be labelled as and what it should not. If we give them the benefit of doubt, it is then easy to see how valuable these subjective labels actually are.
Ultimately, it would definitely be best to find a way to objective label the questions while we take the subjective labels into consideration! Which is why we developed an ‘if-else’ model, on top of the machine learning model, that objectively labels the questions based on the presence of the respective category’s keyword. For example, if the word ‘save’ is in the question, the model would label it as ‘Savings’. Else, it wouldn’t. This would then help us generate an objective set of labels that we could then merge with the user labels to train the auto-tagging system with.
Imbalance of Labels
Out of the 30 over categories, each question would on average be labelled as 3–5 categories. This imbalance thus makes it difficult to evaluate the accuracy of the system. For example, let’s say there are 30 categories and the system would label a question 1 if it is in that category and 0 if it is not. Given a particular question that belongs to only 5 categories, if the system only labels 4 out of 5 of these categories correctly and proceeds to correctly label the remaining 25 categories correctly, how accurate would you say is this classification? Some possible ways to see this are:
1. having 29 labels out of 30 that’s correct (96.7%),
2. out of the 5 that was supposed to be labelled as 1, 4 were correct (80%). This is also known as Recall
3. out of the 4 labels that were labelled 1, all of them were properly labelled (100%). This is also known as Precision
But if you use the conventional Accuracy metric, this would be considered a 100% incorrect classification.
The first way mentioned above (the one with 96.7%) represents a metric called the Hamming Loss which looks at the number of labels that are correct as a whole. However, it would not make sense if only 3–5 categories are labelled as 1 on average. That would mean that even if we had all labels be 0, we would roughly get an evaluation value of 25/30 (83%) or higher. Which may seem fantastic given the value but is not truly representative of what we want to evaluate, which is how good the system is in predicting the right categories to label as 1.
Precision and Recall resonates more with how we want to evaluate our system. Recall combats the intuition of labelling all categories as 0. It looks at the categories that should actually be labelled as 1 and compares them to the predicted labels of those same categories. Hence, if we had all 0’s, the Recall would evaluate to be 0%. On the other hand, Precision combats the flip side of having more category labels as 1. It simply looks at all labels that are labelled as 1 and compares them to the actual categories that should be labelled as 1. Hence, if we label all 30 categories as 1, with only 5 categories actually being 1, the Precision would evaluate to be 5/30 (16.6%) only. Thankfully for us, there is a metric called the F1-Score that makes use of both Recall and Precision.
Basically, the equation shows how this F1-Score metric takes the Precision and Recall into consideration. Only when both of these values are 100% would the F1-Score be 100% as well. The converse is true as well, where if either Precision or Recall falls short, the F1-Score would be compromised. As mentioned in the introduction, our system managed to achieve an F1-Score of 90%!
Ease of Deployment
Programming is not a skillset that everyone has. Therefore, the need to package the system in a way that made it easier to use and understand seemed like an obvious step in our project. This then required quite some code cleaning and organising to fit them into various forms of ‘packaging’. I honestly enjoyed this process as it felt pretty therapeutic, anyway here are 3 iterations of the ‘packaging’ for this auto-tagging system.
In this image, you can see how raw the codes are in general and if you have no experience in programming, this probably looks mind boggling to you. Even if you do know how to code though, cleaning this up and organising it better would definitely help as well.
In this next image, the working codes are crammed into a single package/module, as you can see on the left. This package/module can be executed to tag questions using just a few lines of code, as seen on the right. It is evident that everything is so much more clean and easier to use in general!
In this last image, the code is placed on the backend of this website and leaves the website to showcase how to use such an auto-tagging system. If it makes any sense, the code is similar to the backstage crew in a concert while the website is like the performer at the concert. Coming from the very first iteration, as shown above, it does not take someone that knows code at all to be able to work this auto-tagging model.
Conclusion
Voila! That’s pretty much it. It was definitely interesting to see how machine learning and coding could be used for an application like this. Looking for things that could be improved from using machine learning is no simple task though. It’s always more comfortable to stick to the status quo, but if we do that, we would rarely see progression. Hopefully after this project I improve, not only on my coding and problem-solving skillset, but on finding aspects in the world that could benefit from coding, machine learning and AI.
Thanks for reading this far and stay safe!