How we used Machine Learning to identify what sort of technologists register on Geektrust

Sometime back at Geektrust, we were up with our recommendation engine, recommending kick-ass developers to our clients based on their profile, years of experience and technology. It was built on Go & Elasticsearch. What we did was match the user profiles for the requirements of the companies. In addition we matched the best developers who had a close similarity to what the company was looking for. We took small steps into the world of Machine Learning at that time by doing similarity matches.

But, we had a problem. We were relying on the technologies that our users entered, and people always entered Java. Not everyone, but at least 60–70% of them did that. This resulted in QA Engineers with some Java expertise getting recommended to a company that was looking for a hardcore Java Backend developer. This didn’t look good in front of the customer and we decided to tighten up our recommendations. What we needed to do was identify what kind of developers were registering on our site. And that’s where Machine Learning came to our help.

Training Data

We went with supervised learning, which means classifying data based on previously labelled training data.

As part of this, the first thing we did was to list down the features we wanted to identify in a developer to classify him/her as a Backend or Frontend or Fullstack developer. Keeping that in mind we came up with a set of features like count of backend technologies in a user and scan their resumes to understand what kind of technologies have they worked on.

Once these were identified, we churned out a file that marked these features against each developer in our database. This was the first step in creating our training data.

In the above file, the columns indicate the features and the last column indicates the item we want to predict. In the training data the existing data was cleaned and marked with a specific role. This can be generated with scripts, but without 100% accuracy. We had to modify our training data to label the column that needs to be predicted, which is role. 6 of us worked through around 7000 rows of data and verified, validated and modified the data, if required.

Training & Testing

Once the training data was ready we had to test the accuracy of this model. We tried different classifiers like K Nearest Neighbours (KNN), Decision Trees and Support Vector Machines(SVM). We also had to go back and fine tune the training data, adding more features, removing some of them. All these decisions were based on the accuracy we got on these different classifiers. SVM gave us the best accuracy and we went ahead with it using a linear Kernel. The implementation was in Python. With well established libraries like scikit-learn , NumPy, SciPy and pandas available, we had little doubt in selecting Python as our go-to language for ML.

The above code shows how you can train & test your model. The trained data in csv file is split into test & train data and then the test data is used to check if the prediction is correct or not. It also gives you the probability percentage of the prediction. The predictions made by the SVM were verified against the already trained data and it was pretty accurate.

And that’s it. Now for every profile update the developer does on Geektrust, we build the feature data and predict the role from it. Once we predict what kind of a developer he/she is, we use that information to recommend them accurately to a specific requirement of a company. Now our recommendations to companies are better than earlier.

However, this doesn’t guarantee that the predicted role is correct all the time. Sometimes, we get predicted role with probabilities as low as 20–30%. In such cases we manually override the prediction, to what we think is the role of the user. We also update the existing training data at periodic intervals, which helps in improving future predictions.

At Geektrust, we continue to work on ML & AI related stuff and this is an ongoing activity. We have started working on our own resume parser and look forward to building an advanced code evaluation engine that checks for clean code.

[About the author — Dhanush is co-founder at Geektrust and takes care of all things tech.

Geektrust is built for technologists to connect with remarkable job opportunities.]