Understanding the Common Interest between University Students and Industry with Explainable Machine Learning
How the public sector uses a machine learning model for strategic decision making and improvement on digital product application
Executive Summary
The Magang dan Studi Independen Bersertifikat (MSIB) program is created by The Ministry of Education, Culture, Research, and Technology (MoECRT) to make it easier for students and industry to find each other. Developing such a program requires a good understanding of both sides' interests. There are many factors to consider, such as students’ academic background, students’ domicile, industry category, and type of job positions.
A machine learning model can be applied to answer the aforementioned challenge by framing the problem into a binary classification task, i.e. predicting if a student and an industry partner are a match given the observed factors. Random forest is one of the models that are typically used in practice because of its good performance. Although Random Forest can determine the importance of each factor to the model, the interpretation of the results to the prediction value is still problematic.
Fortunately, a breakthrough from the research in explainable machine learning called SHAP (SHapley Additive exPlanations) can enhance it by assigning a contribution score to each factor. The enhanced factor importance can be used to determine some impactful factors for targeted industry acquisition and improvement on the program, such as accurate position recommendation. This way, SHAP is proven helpful in refining the user experience for students and industry partners in the MSIB program and the Kampus Merdeka platform.
MSIB Program
The Ministry of Education, Culture, Research, and Technology (MoECRT) has initiated a program to bridge the gap between students in colleges and industry. It is called “Magang dan Studi Independen Bersertifikat”, or MSIB in short, where students can search for internships or independent study opportunities. On the other hand, the industry can find talents to fill in their needs. The critical key to this initiative is the ease for both parties to find and connect with each other; that is one challenge to address in the MSIB program.
Let’s say, for example, there are 500 law students and a couple of industry partners who offer relevant opportunities with a total quota of 50. That’s 10% of the students, which means the rest 90% should find other less relevant opportunities or, else, they don’t participate in MSIB. Conversely, if industry partners need 200 accounting students, but there are no more than 20 accounting students, there will be wasted opportunities. Both scenarios give bad experiences to students and industry because they cannot accomplish what they aim for, i.e. being connected to the other side.
Such scenarios can be minimized by optimizing the MSIB program to accommodate the interests of both students and industry; to be precise, by studying multiple factors that are related to both student and industry partners being a match. Factors are collected from the student’s and industry’s interests. Developing a machine learning model is one way to solve this kind of task. However, having a good model is not enough, the model must be well-understood by businesses to make the right decision. This is where SHAP can be helpful in understanding how the model works in a human-friendlier way. Thus, a good machine learning model and SHAP will give the business insight into optimizing the MSIB program.
Matchmaking
Matchmaking is about determining which pair of a student & an industry partner leads to a match based on certain factors. Collecting factors to include in the matchmaking is not a single-man effort. It requires insight from all the teams involved in developing MSIB, either those who work in the field or those who work behind the scenes. It is then hypothesized that there are 24 potential factors that could contribute to the matchmaking process. From the student side, the factors include the student’s demographical data, academic background, and behaviour in applying to MSIB. From the industry side, the factors include the industry’s demographics, job requirements, and quotas.
There are thousands of observations (or data points) available for matchmaking. Each observation represents a pair of a student & an industry partner with a label indicating the match or no match. The goal is to understand the 24 factors that might affect matchmaking. This kind of problem is commonly addressed in the machine learning literature as a binary classification problem. The task is to develop a data model that best represents all observations, which is measured by calculating the classification error (e.g. the model says match while in fact not match; and vice versa).
There are thousands of observations (or data points) available for matchmaking. Each observation represents a pair of a student & an industry partner with a label indicating the match or no match. The goal is to understand the 24 factors that might affect matchmaking. This kind of problem is commonly addressed in the machine learning literature as a binary classification problem. The task is to develop a data model that best represents all observations, which is measured by calculating the classification error (e.g. the model says match while in fact not match; and vice versa).
Random forest is one of the machine learning algorithms that is known to give good results. In practice, however, developing several comparisons algorithms is better, because a good result is not the only thing to consider. Interpretability, types of data, size of data, and even the business domain are some other things one wants to take into account. A data model that produces a slightly lower result but is more suitable for the business is more likely to be chosen. For the student-industry observations, two algorithms are explored, namely logistics regression and random forest. The classical logistic regression has clearer interpretability than random forest, but the performance scores are sometimes low and sometimes high (inconsistent). On the other hand, although random forest has a slightly lower performance score than logistic regression, the consistency of its performance makes it more reliable. Thus, the random forest model is chosen with an average score of 80%.
Despite being less interpretable than logistic regression, one can make use of factor importance [1] for that purpose. Factor importance is a list of factors with corresponding scores that are ordered based on each factor’s contribution to the model’s performance. Therefore, factor importance can be used to determine what factor is more important in predicting whether a student and an industry partner are a match. That said, the factor importance score from random forest is not directly applicable to non-technical users. For instance, the factor importance might show that GPA is less important than the industry category, but it doesn’t tell how more likely a student with GPA 3.5 gets a match compared to another student with GPA 3.0.
Using SHAP to Enhance the Interpretability
SHAP (SHapley Additive exPlanations) is a method to interpret how a data model works to make a prediction [2]. For binary classification, instead of giving a match/no match output, it presents the result as a probability (0% means no match, 100% means a match). There is usually a threshold to separate between a match and a no match, e.g. a probability of >50% is a match; otherwise, no match. Even better, SHAP’s probability score can be broken down to each factor, similar to factor importance’s score in a random forest in that they can be used to determine which factors are more important. However, SHAP has better readability because of two aspects. Firstly, SHAP’s score directly applies to non-technical users as it’s easier to understand a probability score (0% to 100%). Secondly, SHAP’s score has a sign (positive or negative) that indicates if a factor is in favour of a match (adds the probability to 100%) or no match (reduces the probability to 0%). Finally, SHAP is model-agnostic, which means it can be used with any machine learning model, including random forest.
To get a better understanding, an example of SHAP’s factor importance is shown in the following illustration. Factors are sorted based on their importance with the most important factor on top. For example, factor 3 is the most important with a higher value of factor 3 (coloured in red) going towards a match (the SHAP value is positive). Moreover, how much it contributes to predictability can be seen on the x-axis. The red region spans from around 0.02 to 0.15, which means that when the value of factor 3 is higher, it can increase the probability of being a match by 2%-15%. On the other hand, when the value of factor 3 is lower (blue region), the probability of being a match is reduced by ~5%. The importance order is calculated by averaging contribution scores from all observations, which is why factor 2 becomes less important, although in some observations, it can give a contribution score up to ~18%.
Every factor is assigned with their contribution score together with the direction (towards match, no match, or indefinite). It’s important to note that the score is relative to the base score, which is the probability of getting a match without knowing anything. In a balanced observation (where the number of “match” and “no match” are the same), the base score is 50%. So, knowing that the GPA is 3.5 (contribution score of +1.2%) will change the probability of being matched to 50% + 1.2% = 51.2%. This interpretation format is available for all factors and observations; this is one of those useful insights for business. For instance, knowing that the contribution of industry category C1 is the greatest will tell the partnership team to engage more with industry partners from C1.
Last but not least, it cannot be emphasized enough that SHAP cannot tell if a factor is causing the match or not. The model says that a greater GPA has a bigger chance to match, but it doesn’t necessarily mean GPA is the cause of the matching. There might be another unknown factor that affects both GPA and being a match, such as the student’s achievement or internship history. To know whether a factor is indeed the cause, experimentation should be conducted, namely A/B testing. There has been published an article about how experimentation is conducted in the public sector; it can be read here.
Closing Statements
Machine learning models can learn hidden patterns within the observations with good performance. However, particularly for answering business questions, it would be impractical if the pattern is not readable to humans. SHAP responds to that challenge by revealing that pattern in a friendly way and digestible for everyone. The MoECRT also faces the challenge of working with multidimensional factors of interest and thus finds it helpful to use SHAP to narrow down which dimensions should be focused on. In this case, SHAP helps pinpoint what factors to look at to improve students’ and industries’ experience in the MSIB program and the Kampus Merdeka Platform.
Notes & Reference
[1] Factor importance is usually known as feature importance.
[2] Lundberg, S., & Lee, S.I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30, 4765–4774.