Predicting Fraud Job Vacancies with 99% Accuracy using Natural Language Processing in Python

8 min readJun 30, 2020

Given how important job vacancy is, it can’t be help that many people tries to create a fraud using job vacancy. Therefore, it is best for a job vacancy’s site to have some sort of algorithms to detect a fraud job vacancy. One of them is using Natural Language Processing to automatically detect fraud job vacancy. This story will give a step-by-step worked-example on how to create a simple natural language processing in order to predict fraud job vacancy.

Importing Data

First of all, real-world data of real and fake job is needed. The data used will be from https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction. After downloading the data, it’s time for importing the data into jupyter notebook using code below. Sample of the data is also shown below

Looking inside the Data

The code below will provide a general overview of the data.

Some key points that is observed is:

a. Job id is uniform and only an identification of job vacancy

b. Some salary range is written in range, some in the value, and some is missing. However, there are very many different range.

c. Company profile and Description is text data.

d. Categorical data like employment type, required education, and required experience have many missing values

e. Fraudulent is unbalanced, the percentage of real job vacancy and fake job vacancy is 20:1

Exploratory Data Analysis

Next, the percentage of fraud job vacancy among categorical variables will be observed. Matplotlib and seaborn will be used to plot the results. To see whether missing data has a bigger/lower percentage of fraud data job, missing categorical data will first be imputed with new class, “missing_columns”

Telecommuting

As observed, Job that allowing telecommuting (work from home) is almost 2x more probable to be fraud.

2. Company Logo

A job vacancy without company logo is 8x more probable to be a fraud job vacancy.

3. Has Question

Job vacancy without question is 3x more probable to be a fraud.

4. Employment Type

Part-time job has a very high percentage of being fraud, almost 10%. Meanwhile, job vacancies that didn’t specify the employment type being the employment type with second highest fraud’s percentage

5. Required Education

A shocking observation shows that index 9 (Some High School Coursework) has a very high percentage, about 70%, of data with index 9 is fraud job posting. Meanwhile, certification is also high, about 10%.

6. Salary Range

Looking into the salary range, we found data such as below

Some value of salary range, found in profile report

There are very many different range of salary. Hence, the median of the salary range will be extracted.

ValueError

There exist data with strange value, hence the name will be assigned as “strange salary_range”. Next, the boxplot of median of salary will be shown.

It looks like there are outliers that is very far from the rest of data. To look at the distribution of the data, those outliers first needed to be deleted for a moment.

The distribution is highly skewed. There are so many high values. to see the percentage of salary range in each group, the salary data will be grouped into several groups: 5 groups will be created. However, because there are outliers with very high values, the last group will be split into 2 groups.

Strangely, there is data that is still in numerical type. But since there is only one, it will be converted manually into its group, medium_salary_range. After that, the bar plot will be shown

Exceptionally high salary range has a very high percentage of being fraud, 30%. However only 37 of data is exceptionally high. Moreover, data without salary range has lower percentage of being fraud

Feature Engineering

The data needed to be processed first before inserted into machine learning model. First all of the feature will be made into one text. The text will then undergo various text cleaning like normalization. Finally the feature from data will be extracted with TfIdfVectorizer.

Changing feature before appending into one text

First, all of the boolean value needed to be converted to text so that TfIdfVectorizer could differentiate which boolean belongs to which. After that, categorical data needed to be converted so that the information that the categorical data is indeed categorical and not another text still holds. This could be done by appending column name and changing “_” from space.

2. Appending into one text

The appending part is quite easy, just append with code below. However, job id will be dropped because it contains no information and fraudulent column will also be dropped because fraudulent is the target column that’s going to be predicted.

3. Text Cleaning

There are some text cleaning method that needs to be done before we extract the feature from the text.

a. Lowering the case

Lowering the case is important since capital case in english is only used for first word in sentence and/or name. Since we expect same predictive power from lower or capital case, we will lower all the word in text.

b. Lemmatization and Removing Stopwords

Lemmatization is the process of converting a word into its base form. For example, feett will be converted to foott. Lemmatization could be done by the following code using spacy. Tag argument is set to be true to differentiate which one is noun, verb, etc. Parse argument is set to be false to make the model faster. Finally, entity argument is set to be true to differentiate named entity(like Apple, Google, etc) from regular noun. After that, stopwords(“The”,”in”) will be removed since it doesn’t have predictive power. However, “no” and “not” will be kept.

c. Removing Special Character and Punctuation

Special character like $,%,#, and punctuation like ; and ? will be removed because no predictive power is expected and will just introduce noice to the text. Because of the previous result, deleting punctuation will create double space in the data, therefore double space will be converted into single space. Possessive “s” like in he’s will also created “ ‘s ”. ‘s will also be deleted

d. Combining

It’s time to apply the text cleaning into the text. Since there are 17880 data, it will take times to finish

4. Feature Extraction using TfIdfVectorizer

Next, text that has been cleaned will be transformed into matrix that machine learning model could process. One way of doing that is applying TfIdfVectorizer. This method will count the importance of words in a document and the rarity of the word itself. ngram_range argument is used to specify what n-grams value to be extracted. Value of (1,2) means extract both unigrams (ex : box) and bigrams (ex : black box).

5. Splitting into Train and Test data

Train data will be used to train the model and test data will be used to check the accuracy of the prediction. The data will be split 80–20.

6. Oversampling

As observed in the beginning, the target class is very imbalance. Hence, oversampling method, ADASYN, will be used to prevent class imbalance.

Modelling

Feature Selection

Various model will be trained. In order to minimize training time, feature selection will be conducted first. Feature selection will be conducted by selecting only important feature by training the data first using Linear SVC.

2. Base model

Base Random Forest, LinearSVC, Gradient Boosting, and XGBoost model will be trained without tuning the hyper-parameter. The performance which each model will be measured by accuracy score since for this particular model it is assumed that the weights of false positive and false negative results is same.

With faster speed, Random Forest and SVC have better accuracy score, with SVC is slightly better. Therefore, only SVC and Random Forest will be tuned.

3. Tuning SVC

SVC will be tuned using GridSearchCV.

It seems like using C = 1 is already the best choice.

4. Tuning Random Forest

The result showed that besides min_samples_split and n_estimators, original parameters already contribute to best accuracy score. The model will be trained again by changing min_samples_split and n_estimators.

Model accuracy has been successfully increased. However, LinearSVC is a better model overall.

5. Conclusion and Feature Importance

Linear SVC gives the best accuracy score with almost 99% accuracy. Let’s take a look into the classification metrics.

The precision score of the model is lower than the rest because of the amount of false negative. Recall and Accuracy score gives satisfying result with score greater than 95%. Finally, let’s look at how the Linear SVC model classify the results by looking at top 10 most important feature

No company logo and no question becomes the most important feature in the classification. Administrative surprisingly also play a big factor in the classification of the model. This results should be kept in mind when one tries to look for a job.

Reference

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
https://towardsdatascience.com/text-classification-in-python-dd95d264c802
https://www.machinelearningplus.com/nlp/lemmatization-examples-python/