Classification of Real and Fake Job Postings Using Ensemble Model

14 min readFeb 26, 2022

Fake job postings, just like spam and fake news, are conducive to scam and other types of abuse, so all job ads platforms want to filter them out. What are the characteristics of fake job ads? How can we use machine learning model to do this job?

In this project on a dataset of real and fake job postings, which includes textual and non-textual fields of data, I first explore the characteristics of fake jobs ads, then derive features to highlight them for model training. While classification of dataset with imbalanced classes and the treatment of missing values are two of the tricky issues of machine learning, I am trying to tackle these two problems with an ensemble model, by training three different machine learning models with different segments of samples, and taking a simple majority vote of three models as the final predictions. Apparently the three models complement each other and achieve decent results.

The Dataset

The dataset, named The Employment Scam Aegean Dataset (EMSCAD) by researchers of University of the Aegean, consists of 17,800 job ads posted between 2012 to 2014 through Workable, a recruiting software, whose 866 fraudulent job ads were manually annotated by employees of Workable. The criteria of inclusion is said to be including “client’s suspicious activity on the system, false contact or company information, candidate complaints and periodic meticulous analysis of the clientele”. So on one hand there may be a small number of mislabeled job ads, on the other hand the basis of annotation may include factors not contained in the dataset.

And the dataset includes structured and unstructured data, open text fields include “title”, “company profile”, “description”, “requirements”, “benefits”, and to some extent “location”, “department” and “salary range”. Structured fields include “employment type”, “required experience”, “required education”, “industry” and “function”, and there are binary fields indicating whether the ad has “company logo” and screening “questions”, and whether the job involves “telecommuting”. As opposed to other fields which are filled by the clients, these three binary fields are supposed to be generated by the system. The following table lists the details of the feature fields of the dataset:

Some Examples of Fake Job Ads

What do fake job ads look like? Let’s see some examples:

A casual review of the fake ads reveals some observations, such as some fake ads are using typical spam phrases, such as “Use Your Spare Time to Start Earning More” in title, or emphasize the money earned with dollar signs in titles, and talk about earning money remotely. These ads seem to target people with high school educational attainment. Some fake ads include external links of URL, EMAIL and PHONE, is it due to scammers wanting to bypass the ad system to make direct contacts with their targets?

Furthermore, some fake ads are very brief, barely provide any details, with one extreme case that just has title(Office Manager) and location(PL, MZ, Warsaw), and leaves all other fields blank.

But there are also fake ads with long descriptions and appear like genuine ones, like the following description pretending from Netgear:

NETGEAR, Inc. (NASDAQGM: NTGR) designs innovative, branded technology solutions that address the specific networking, storage, and security needs of small- to medium-sized businesses and home users. The company offers an end-to-end networking product portfolio to enable users to share Internet access, peripherals, files, multimedia content, and applications among multiple computers and other Internet-enabled devices. Products are built on a variety of proven technologies such as wireless, Ethernet and powerline, with a focus on reliability and ease-of-use. NETGEAR products are sold in over 27,000 retail locations around the globe, and via more than 37,000 value-added resellers. The company’s headquarters are in San Jose, Calif., with additional offices in 25 countries.\xa0Director of Engineering | HMA Security ProductsSan Jose, CAReporting to the VP of Service Provider Engineering, the Director of Engineering will have responsibility for managing the successful development and deployment of the company’s Security products and solutions.The individual will manage a focused team of engineers in addition to leading and directing numerous outside technology partners. This includes partnerships with chip set providers, ODMs, new technology start-ups, and 3rd party software providers. As the engineering leader, the individual will work closely with the product marketing team in generating the roadmap of products and solutions that will need to be developed. The Director of Engineering and their team of engineers will then be responsible for determining the best engineering approach to realizing that roadmap, including product architectures, selection of technology partners, resource planning, test planning, product scheduling, costing, and NPI planning. The individual will then manage the team and external partners to ensure the project objectives are met.The Director of Engineering will work closely with customers to get products certified and approved for use.\xa0 Once introduced, the Engineering Manager will work closely with the sales and technical support to ensure customer satisfaction and product quality objectives are being met.\xa0Job ResponsibilitiesAbility to be both a strong Manager and technical leader for the group, with strong domain/forum knowledge of Security products, tables, routers, wireless, and hands-on IP networking experience.\xa0 VoIP experience an advantage.Ideally be known-in and reputable within the networking Industry.10+ years of demonstrable success of strong engineering management background in communications networking hardware and software utilizing Test Driven DevelopmentDemonstrated understanding and skills in project and program management, risk management, including 3rd partiesDemonstrated success in developing products by utilizing outside company resources and partnerships.\xa0 Experience with ODM developments an advantageCan attract, motivate and retain top caliber engineers for the organization.One whom customers and technology partners find credible and look to for direction.5 or more years experience working for a small company, in addition to 5 or more years experience working for a larger more mature market leader.Team player who can effectively work with the cross functional team, and can effectively communicate throughout all levels of the organization.An understanding and desire of how to continuously improve product quality. Demonstrated ability to use lightweight processes to improve engineering results.Can stay on top of and apply the latest technology trends and engineering processes for the organization.\xa0

Characteristics of Fake Job Ads

The Clues of Missing Values

At the first look of the dataset with identical duplicated records dropped, which amount to less than 2% of all records, two things stand out. The first is the highly imbalance of data, with the fraudulent class of job ads totaling less than 5% of all the records. So we may set the baseline of accuracy of the trained model at 95%, as this score can be achieved by simply labelling all ads as genuine.

The second thing is the extensive presence of missing values, with some features the missing value rates exceeding 80%. In most cases, missing values is an inconvenient trouble for machine learning, as it reflects system malfunction or human error. But in this job ads dataset, we can assume that it is resulted from willful omission by the clients. Sometimes they choose to, say, put information of benefits in the field of description or even in title, sometimes they just leave some fields blank. So missing value itself may become a clue for identifying fake ads.

And the following comparison table shows that, in all but one features which have missing values, the fake ads are more likely to have missing values, with the exception of “salary range”, in which fake ads (26%) are more likely to provide information than real ads (16%). A series of two proportion z-tests, with the two-tailed p-value significance level setting at 0.005, indicate that differences of na-rates among real and fake job-postings are mostly statistically significant, except in ‘location’, ‘department’, ‘benefits’ and ‘function’. The difference is largest in “company profile”, where 68% of fake ads do not have company profiles, but only 16.1% of real ads omit company profiles.

2. Shorter Text Length

Apart from more likely to omit information, fake job ads tend to be shorter in textual content, as the following graph of total number of characters in textual fields shows:

3. Missing Company Logo

Examination of three binary features show that ads of jobs involving telecommuting, do not have company logo and screening questions are more likely to be fake. Among the three features, the absence of company logo is the most indicative, with 16% of ads without company logo to be fake, while only 2% of ads with company logo are fraudulent.

4. Title mentions “$”

Some fake jobs ads emphasize money earned in dollar sign, with 6.8% of fake ads have this characteristic, but only 0.4% of real ads do so.

5. External Links

Some fake job ads contain external links of URL, email and phone, as scammers are suspected to try to bypass the ad system to make direct contacts with their targets. While real and fake ads have #URL links in more or less same rates, 20% of fake ads have #EMAIL links but less than 7% of real ads do so, and 9.2% of fake ads have #PHONE links while only 2.6% of real ads have them.

6. List US as Location

84% of fake ads in the dataset list US as location, while 58% of real ads come from the US.

7. Other Peculiar Traits

In terms of employment type, ads which specify “full-time”, “contract” and “temporary” are less likely to be fake, while “part-time” has a fraudulent rate of about 9%, which is highest in this field. For required experience, ads which specify “associate”,”mid-senior level” and “internship” are less likely to be fake. And for required education, each category has more or less the same low fraudulent rates, with the exception of “some high school coursework”, where the fraudulent rate is 75%. For functions listed, “administrative” with a fraudulent rate of about 20% stands out. As for the 131 categories of industry listed, the most common in fake ads is “oil and energy”. Among ads in that industry, the fraud rate 37.8% is also among the highest, though some isolated cases make the fraud rates of some industries like ranching, military and animation even higher.

Features Construction

To recap the above observations, the following features are constructed as they potentials to identify fake job ads:

missing company profile
missing salary range
title mention $
text length
email link
phone link

For the missing values, in categorical fields “na” would be used as a category, but in the combined text it is replaced by empty space to avoid overlapping with the newly created features denoting absence of values. Furthermore, the less frequently appeared countries in the location column are simplified as “other”.

And the content of the open textual fields “title”, ”company_profile”, “description”, “requirements” and “benefits” are joined for bag-of-word (BoW) analysis.

After one hot encoding, Mutual information scores show that among the non-textual features, “text length”, “missing company profile”, “missing salary range”, “has company logo”, “industry: oil and energy” and “location: US”are the top 6 features that have relationship with fake job ads.

Sampling and Model Construction Strategy

In dealing with imbalanced classification, two of the most common strategies are undersampling and oversampling. One drawback of undersampling is the undersampled majority class sample may lose some information and be biased, and one drawback of upsampling is it may be prone to overfitting, especially when the upsampling rate is high.

On the other hand, this dataset has rich textual contents for constructing bag-of-words (BoW) models, and also has a large number of features for training a separate model. So to keep the upsampling rate low while using all of the majority class cases, I devise a strategy of dividing the majority class cases into three portions, each combine with the randomly upsampled full set of minority class cases for training three models, and make a ensemble of the three models by simple majority votes. This strategy is similar to bagging in ensemble method, except in this case the samples are not drawn and replaced.

In the dataset, the real ads to fake ads ratio is about 19:1. When we divide the real cases into three portions while keeping the whole fake cases, the ratio becomes about 6.3:1. With an upsampling rate of about 4 times, the ratio can further be reduced to 1:0.6.

And the three models are:

A BoW model using simple Countvectorizer and linear support vector machine (SVM)

2. A BoW model using TF-IDF vectorizer and random forest model

3. A XGBoost classifier model on the non-textual features

The choice of algorithms for modeling also take imbalanced classification as consideration. Both SVM and random forest have inbuilt feature of balancing class weight, and XGBoost is known to be effective on imbalanced datasets.

SVM BoW Model

A bag-of-word model simply keeps counts of occurance of each word in the text. By feeding it into a machine learning algorithm, we hope the algorithm can identify keywords that can distinguish fake job ads from the real ones. Linear SVM is known to be working quickly on high dimensional data and is highly interpretable. The resulting performance is quite noticeable.

I use Susan Li’s function for listing the top textual features used for predictions, and it reveals that some of the words raised in earlier discussion like “earn”, “immediate”, “cash”, “money”, “apply link”, “phone”, “work home” emerge as keywords for predictions, while certain top words related to “northwestern hospital build website” apparently learn from a series of fake ads in the sample.

Random Forest TF-IDF BoW Model

TD-IDF (term frequency–inverse document frequency) changes the word counts of BoW model by suppressing the importance of words occur across texts, thus concentrating on unique words that appear frequently in a particular text. The eli5 library can reveal the textual features that carry the most weight in the random forest model, but don’t distinguish whether they are used to predict real or fake ads, but we can see certain words like “earn”, “work home”, “no experience” appear again. And compared with the SVM BoW model, which has more false positives than false negatives, the random forest TF-IDF BoW Model has more false negatives than false positives.

XGBoost Features Model

The non-textual features model, even using the highly powerful XGBoost algorithm, still has lower accuracy, particularly it has a large number of false positives, perhaps reflecting that without textual information it is not enough to distinguish truly fake ads from the genuine ones. And the feature importance figures show that it is dominated by variables of industry, required education and location, while missing salary range becomes the top one.

Making the Ensemble Model

The ensemble model is made in a simple way. We have the predictions of the three models, where 1 denotes a prediction of a fake ad, adding the scores of each record, and when the total score is 2 or 3, it has the majority vote of classified as fraudulent, otherwise it’s classified as real.

The accuracy (98.6%) and f1 scores (0.85) of the ensemble model is higher than the SVM BoW model (97.7% and 0.78), the random forest TF-IDF BoW model (98.3% and 0.81) and the XGBoost features model (96.0% and 0.66).

When we apply the ensemble model for the whole dataset, the numbers look even more impressive. But as each component model has seen 1/4 of class 0 data and 3/4 of class 1 data, the numbers are somehow inflated.

Among the correctly predicted fake ads which get 2 votes and just pass the majority threshold, the 2 votes are coming from all three combinations, indicating the three component models complement each other.

Evaluation of Incorrect Classifications

Still we should examine cases of incorrect classifications to see what can be improved.

Among the 36 false negative cases, the following one which makes all 3 component models get it wrong. The content looks genuine and without all the telltale words, so it escapes the detection of both BoW models. And on the features side, though it has no company profile and company logo, it seems that it does not share many common features with other fake ads to make the XGBoost features model raise the flag.

And in a false positive case which gets all three component models to wrongly raise the flags, it seems that the term “datum entry” leads both BoW models to classify it as faked. And on the features side, it seems that the omission in several fields and relative shortness of the text contents leads to the classification of fake by XGBoost. One point worth noticing is in this prediction, the XGBoost model takes missing salary range and nontelecommuting as top contributing factors of classification as fraudulent, which are contrary to the observations from data exploration, which raise the possibility that the model operates in a more complex way beyond easy comprehension.

Conclusion

This exercise of ensemble model illustrates that, even by just using simple majority vote, ensemble method can let different algorithms complement each other’s limitations and improve performance, which should be welcomed in the tricky problem of imbalanced classification. Apparently by changing the threshold of classification, say raise it from two votes to three votes, or lower it from two votes to one vote the other way round, we could change the numbers of false positives and false negatives, depending on which way suits the need.

Though the modeling method can be improved, the examination of false negative and false positive cases shows that, in identifying fraudulent job ads, or spam emails or fake news, we need to use two kinds of information. The first is more “generic”, such as the emphasis of “easy money” and the lack of information provided, somehow anyone who sees them should raise suspicion. But the fraudsters can always make a trap by making the ads seem genuine, and there are always real ads sharing much common characteristics of the fake ones. To make proper distinctions, we need the more ‘empirical’ information, such as the location and industries the fraudsters habitually targeted, or the genuine companies they used to impose. This kind of information is not apparent as it seems, and needs to be inferred from known cases. That means an automated algorithm in itself is never sufficient. We need other ways to know and keep on collecting fraudulent ads, and continue to analyse and feed them to machine learning models.

Original Dataset from Kaggle and EMSCAD

Source Code: Github and Kaggle