Domain Classification based on LinkedIn Summaries

Akash Tripathi
AITS Journal
4 min readSep 15, 2020

--

Introduction

The spectrum of job titles or positions for working professionals keeps increasing on a daily basis. Many organizations describe more or less the same responsibilities disparately. The social media of working professionals, LinkedIn can be used to go through such extensive individual summaries. LinkedIn is home to many job seekers and employers alike. LinkedIn summaries act as the first interaction between the applicant and employer. Furthermore, the app also uses the same along with many other parameters to suggest job openings. But, many times, due to some specific keywords in the summaries, the job search also narrows down, excluding many potentially positive job openings for the user.

Problem Statement

We aim to read LinkedIn summaries of users to generalize their work domain to better suggest job openings and also be useful for further analysis.

Dataset and Preprocessing

We used a Kaggle dataset (https://www.kaggle.com/heet9022/linkedin-dataset) which consisted of various values of the user’s profile. The columns include Category, LinkedIn profile picture, Description Summary, Experience, Name, Location, Skills

For our domain classification, we needed only description, which is the LinkedIn summary and category as labels. The dataset contained many null rows, which were dropped and the final row count came to 670. This amount of data is normally low for deep learning algorithms. The pre-processing steps included converting to lower case, removing stop words, punctations and links in the description.

The above steps were common in all algorithms implemented. For Linear SVC we used tf-idf (term frequency — inverse document frequency)weight vectors to train efficiently. For the training of Text-CNN, we tokenized the data to obtain a text sequence. Furthermore, the input was from the embedded matrix which contained pre-trained Glove vectors for the words in the sequence.

Methodology

To achieve the best results, we decided to apply machine learning algorithms such as Linear SVC (*** Paper link***), Random Forest, KNearestNeighbours and more. The best accuracy, although low, mainly due to lack of data was obtained with Linear SVC (test accuracy: 57.4%) and with MLP Classifier(test accuracy: 59.07%) .

Job_CNN

Using the context of deep learning, we implemented TextCNN (*** Paper link ***) to classify domains according to the description. CNN would read the text as images and provides advantage to extract important features. We generate a sentence matrix with pre-trained GloVe vectors and tokenization sequence. The matrix has dimensions (10000, 300) as 10000 is the maximum feature-length, i.e. the maximum number of words to be included in the token sequence, and 300 is the vector dimension. The CNN architecture with filter sizes as 1, 2, 3 and 5 and the number of filters to be a single value i.e. 36. The soft-max layer was used as an output to classify from 25 classes.

Results

Job Loss
Confusion Matrix
Job Whiskers

You can find ***Project Link*** on Github.

You can find ***Paper Link*** here .

Future Work

We aim to use such a classifier, primarily to suggest job openings according to the predicted domain from the users LinkedIn summaries. Furthermore, we could use the same for further analysis of LinkedIn users to find the major working domains of the users or even filter applications from resume summaries in a large organization.

CREDITS : Balan Dhanka, Taha Zanzibarwala

--

--