Srijit Mukherjee
Published in

Srijit Mukherjee

Job after MBA — A Data Analysis View

Data Description

You can find the data here. Let me explain it here for the following project to understand.

The data set consists of Placement data of students on a college campus. It includes secondary and higher secondary school percentages and specialization. It also includes degree specialization, type and Work experience and salary offers to the placed students.

The data contains the following Predictors:

  1. Gender (Categorical)
  2. Secondary Education Percentage (Numerical)
  3. Secondary Board of Education (Categorical)
  4. Higher Secondary Education Percentage (Numerical)
  5. Higher Secondary Board of Education (Categorical)
  6. Specialization in Higher Secondary Education (Categorical)
  7. Degree Percentage (Numerical)
  8. Field of degree Education (Categorical)
  9. Work Experience (Categorical)
  10. Employability Test Percentage (Numerical)
  11. Post Graduation(MBA)- Specialization (Categorical)
  12. MBA percentage (Numerical)

The Response Variables are

  1. Status of Placement (Categorical)
  2. Salary (Numerical)


Which factors are really important in getting placed?

Installing and Loading Packages in R


Loading and Cleaning the Data

#read the data
data <- read.csv("data.csv")
#changing data columns to factor and numeric using sapply
data[,c(2,4,6,7,9,10,12,14)] = sapply(data[,c(2,4,6,7,9,10,12,14)] , as.factor) #change to factor
data[,c(1,3,5,8,11,13,15)] = sapply(data[,c(1,3,5,8,11,13,15)], as.numeric) #change to numeric
#changing NA to 0.
newdata = data
newdata[] <- 0
#Dividing the data frame according to the two response variable of Salary and Job Status after eliminating the useless data of serial number.Data_status = Data[,-c(1,15)]

Method 1 (Decision Tree)

#Job Status Decision Treemodel_status = rpart(status ~ ., data = Data_status, method = "class")
rpart.plot(model_status, box.palette="RdBu", shadow.col="gray", nn=TRUE)
fancyRpartPlot(model_status) #For better visualization
#analyze and fit
model = rpart(status~., data = Data_status, method = 'class')
predict_unseen = predict(model, Data_status, type="class")
table_mat <- table(Data_status$status, predict_unseen)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste('Accuracy for test', accuracy_Test))
"Accuracy for test 0.893023255813953"

You can see from here, that the important variables to get a job are

  1. Secondary Percentage
  2. Higher Secondary Percentage
  3. Degree Percentage
  4. MBA Percentage
  5. Work Experience

Method 2 (Boruta: Feature Selection by Random Forest Wrappers)

Data[] <- 0
Data_salary = Data[,-c(1,14)]
boruta_salary <- Boruta(salary ~ ., data = Data_salary, doTrace = 2, maxRuns = 100)
plot(boruta_salary, las = 2, cex.axis = 0.5)
getSelectedAttributes(boruta_salary, withTentative = F)

You can see from here, that the important variables to get a job are

  1. Secondary Percentage
  2. Higher Secondary Percentage
  3. Degree Percentage
  4. Gender
  5. Specialization
  6. MBA Percentage
  7. Work Experience

Note: The prediction of salary by interpretable methods like Anova, Linear Models, and Decision was a really bad fit. Thus, I didn’t include it here to avoid confusion.

Note: I also didn’t show the train and test algorithm in the first part because I just wanted to show that it is a valid way of plausible reasoning for quicker action.

Thanks for reading the post.

Stay Tuned! Stay Blessed!

If you like the article, don’t forget to clap and share it with your enthusiastic friends.

Srijit Mukherjee.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store