Identification of Online Recruitment Fraud (ORF) through Predictive Models

Job postings online have become popular these days due to connecting to job seekers around the world. There are also instances where the fraudulent employer posts a job online and expects people to apply to these postings. These fraudulent employers impend job seekers' privacy, spawns fake job offers, and wanes. We perceived that most of the Online Recruitment Fraud (ORF) has matching features. Though the user cannot categorize them, we propose using various predictive models like Support Vector Machine (SVM), Artificial Neural Network (ANN), Random Forest, Naïve Bayes, or Logistics Regression to detect them effortlessly. Dataset with 17780 job postings was downloaded from Kaggle to identify which proposed model best predicts the fraudulent job posting. The dataset includes 14 features to determine whether online job posting is fraudulent or non-fraudulent. 70% of these job postings train the model, and the remaining 30% test the model's efficiency. The outcomes of each model are predicted using four evaluation metrics – Classification Accuracy (CA), Precision, Recall and F-1 score. The research found its suitability from two sides: the websites can identify fake jobs before being published, and job seekers are sheltered from fraudulent job postings.


Introduction
The advent of the online platform has advantages and disadvantages. On the practical side, we can share ideas, news, and other information. It is exceedingly challenging to classify which news and information are deceitful and actual on the disadvantageous side. One of the increasingly online incidents is fake job postings, often termed Online Recruitment Fraud (ORF) (Vidros et al., 2017). (Ryan, 2018) assertions that almost 90% of the job seekers fall prey to ORF. (Mehta, 2020) ORF has become an industry in India, with an unemployment rate of 7.9%; job seekers easily fall victim to these rackets. Also, in 2021, fake jobs cheated 27,000 people in India by posting 13,000 vaccines online (DNA, 2021). Agency Central, a recruitment agency in the UK, remarks a 300% rise in ORF from September 2014 to September 2016 and is expected to grow even more (Agency Central, 2019). It further states that only 17% of students and graduates in the UK are aware of ORF. The article published in CNN states that more than 16,000 fell prey to online employment scams in the US, with more than $59 million in losses (CNN, 2021).
The motivation of the study is thus based on the fact that ORF is on the rise worldwide. The people involved in ORF are sneaky and use sophisticated technology to conduct scam unsuspicious job searches (Phillips, 2021). Some of the various common cheats used by almost all in ORF are: conducting interviews over messaging services, asking for personal information, and the company asks for payment before joining (Porter, 2021).
The study finds its objective to identify and analyze the factors that cause job seekers to believe in job postings online by identifying the variables that impact them. First, the study uses a selected predictive model to analyze the data with 14 features. The predictive models use three accuracy standards classification accuracy, precision, and recall to test the effectiveness of the proposed model. The results obtained help the adoption of the best predictive model to identify the causes of fraudulent jobs.
The rest of the paper is as follows; In Section 2, the literature review to identify the use of predictive modelling for ORF is identified. Section 3 does the research methodology that mentions the business understanding, data understanding, and data preparation for data analysis. Section 4 is devoted to data modelling, which exemplifies the model construction phase and executed predictive model. Section 5 presents the analysis results, comparing using five accuracy standards for the proposed models. The outcome deployment is in section 6, based on the predictive models for 30% of the test data.

Literature Review
There is enough literature about online crime detection models of different forms available. Almost all of them apply different predictive models for identifying ORF. (Alharby & Alghamdi, 2019) performed the ORF using the Random Forest algorithm and gauged classification accuracy to assess the model's effectiveness. Another suggested approach for ORF is Hierarchical Clusters-based Deep Neural Networks (HC-DNN), which conducts the fraud detection through crossvalidation (Kim et al., 2019) and calculates the precision for ascertaining the efficacy of the model. (Lal et al., 2019) proposed the ensemble learning-based model for ORF detection and evaluated it based on the calculating recall parameter with 94% and 95.4% score of F1score and accuracy. (Cao et al., 2019) introduced a model called TitAnt to classify between fraudulent and nonfraudulent job postings. (Choi, 2020) conducted the study using two classifiers -single and ensemble classifiers to classify the fraudulent job posts. (Ranparia et al., 2020) used natural language processing (NLP) to analyze the sentiments and patterns of online job postings to predict the fraudulent jobs posted on Linkedin. (Nasser & Alzaanin, 2020) conducted the study using multiple predictive models as Multinomial Naïve Bayes, Support Vector Machine, Decision Tree, K-Nearest Neighbours and Random Forest to differentiate between fraudulent and non-fraudulent job postings. (Shree et al., 2021) used Ensemble algorithm and Logistic Regression, K Nearest Neighbors, and Random Forest predictive models to analyze job scams. (Shibly et al., 2021) performed the test of identifying the fraudulent job postings online using the combination of decision trees and random forest algorithms. (Shishupal et al., 2021) used NLP algorithm to classify job postings as fake or real.
The present study uses five predictive models, Random Forest, Artificial Neural Network, Naïve Bayes, Logistic Regression and Support Vector Machine, to classify fraudulent and non-fraudulent job postings.
The efficiency and effectiveness of the predictive models are judged using four different accuracy standards -CA, Precision, Recall and F1-score, respectively.

Research Methodology
The research methodology adopted for the study uses the CRISP-DM framework, termed as Cross-Industry Standard Process for Data Mining, methods for thorough data analysis. The CRISP-DM approach is applied to use the five different predictive models for identifying the fraudulent job postings and has 6 phases: The results are evaluated based on different performance metrics. (section 5).
6. Outcome deployment: Use the predictive models to predict fraudulent job postings. (section 6).

Business Understanding
The research objectives for the study identifies the business understanding for the case. There are two research objectives: i. To use five predictive models for identification of ORF, and, ii.
To evaluate the effectiveness of predictive models for predictive accuracy.

Data Understanding
The dataset used for the study involves 17780 job postings collected from the Kaggle web source. There are 14 features included in the downloaded dataset, with the details mentioned in Table  1. Apart from the mentioned 14 features, the dependent variable, fraudulent, identifies 1 when the job posting is fraudulent and 0 when identified as non-fraudulent.
The predictive models used for the study divides the data randomly into two different subsets: 70% of the data (12516) for training the model and 30% of the data (5364) for testing the model.

.1 Descriptive Analysis for general variables
Descriptive analysis helps in detecting the dissemination of the target variable in the dataset. As shown in Figure 1, 56% of job postings online are for full-time jobs and 44% (22% each) for part-time and contract jobs.  In figure 3, we reported the fraudulent job postings as a function of the education level. The dataset has four education fields: associate degree, bachelor's degree, master's degree, and high school or equivalent. Figure 3 also mentions that the highest fraudulent job postings are for bachelor's degree, while the number decrease for master's degree. Job postings with associate degrees and unspecified education levels do not have any fraudulent job postings.  Figure  4 shows the distribution of fraudulent job postings industry-wise. Almost onethird of the job postings did not mention any industry-specific information. The job postings with the highest percentage of fraudulent job postings are oil & energy, with 12% reported cases. Accounting and Healthcare reported 6% fraudulent job postings and Marketing with 5% cases. The criteria used to select data were to identify the reported cases, where the number of fraudulent job postings was> 10.

Data preparation for predictive modeling
Data preparation is the most essential and critical step for predictive modeling. Data preparation for the study was conducted to map the textual values to numeric values for in-depth analysis. (see table 3).

Data Modeling
The fraudulent job posting classification is driven using five different predictive models. All five predictive models use the binary classifications of the job postings as fraudulent (1) and non-fraudulent (0). The section describes the mathematical formulation of each of the selected predictive models and the mechanism of its classification. The selection of the model is based on the Receiver Operating Characteristic (ROC) curve accordingly.

Support Vector Machine (SVM)
SVM finds the optimal hyperplane concerning the feature sets to divide the data points into two classes. Several hyperplanes separate the data points into different categories, but SVM help in finding the optimal hyperplane, which directly affects the models' performance. The model trains the data points through the closest information lying on two hyperplanes (classified as fraudulent and non-fraudulent) on either side of the ideal known help vectors. The cost function for the SVM classifier is given mathematically in eq. 4.

Naïve Bayes
Naive Bayes is a probabilistic predictive model that depends on the Bayes Theorem, as mentioned in eq. 5 = The text data of the job description is converted into vectors. These encoding text in the form of numbers will help to decide whether the vector representation of news belongs to a fraudulent or non-fraudulent. A Naive Bayes classifier is trained with a given set of features so that fraudulent job descriptions are automatically categorized into fraudulent or nonfraudulent using the probabilities defined in the Bayes theorem. From the Bayes Theorem, Replacing A and B from eq. 5 to X and Y as feature matrix and response vector respectively, the eq. 6 will become: = Thus, the probability of predicting a target with class k has given feature matrix X, given a particular class (fraudulent or nonfraudulent) of y times the probability of belonging to a specific class.

Artificial Neural Network (ANN)
There are two steps to outline the ANN experimental setup for classifying fraudulent employers. Logistic Regression for activation layer and Stochastic Gradient Descent (SGD) to identify the cost. The present section explains the mathematical representation for each of the two steps in conjunction.
(3) (4) Apart from these 14 features, we also have classification as fraudulent or not (binary classifier labelled as 1, for fraudulent and 0, for non-fraudulent). These data are grouped into a matrix, column corresponds to features, and row represents a single data point . We will then have a vector containing the outputs, either 0 for not fraudulent or 1 for fraudulent. For ANN, we also define the weight vector (adjust the values as per the cost function) . The weighted sum is characterized as: (7) which is as: is . The Logistic Regression uses probabilistic logistic function based on equation (8) and is as: (9) Based on equation (9), if the weighted sum for a data point is nearing 1, then we can predict the data point to have a class 1, 0 otherwise, which is as: For correct classification, Stochastic Gradient Descent (SGD) iterates the cost function as (11) where initial , and is based on equation (10).

Random Forest
The random forest predictive model uses decision trees and fits multiple decision trees using averaging to improve the predictive accuracy and control over-fitting. The random forest predictive model uses the Gini index to measure a variable's probability of being wrongly classified when randomly chosen. The degree of the Gini index varies between 0 and 1 and is denoted as: where pi is the probability of an object being classified to a particular class, and c refers to the classes (fraudulent and not fraudulent).
The analysis and adoption of the model are based on the ROC curve. ROC curve is considered the most accurate and straightforward method of classifying the classes as 0 or 1. By analogy, the ROC curve states higher the AUC (Area Under Curve), the better is the predictive model is at differentiating between fraudulent and non-fraudulent job postings. The AUC values for five different predictive models is given in Table 4 as: Based on table 4, the AUC value for Random Forest predictive model gives the highest value and is thus suitable to predict the job postings as fraudulent and nonfraudulent. For detailed analysis, in the following section, we will use four accuracy standards for all the five predictive models and then select the suitable ones.

Results Evaluation
The data modeling process involves selecting machine learning techniques for predictive modeling. The artificial neural networks (ANN) model is used to recognize fraudulent job postings online. The accuracy standards from the confusion matrix are classification accuracy, precision, and recall (see Table 4)  This phase assesses the predictive models' abilities using four accuracy criteria (classification accuracy, precision, recall, and F-1 score) from the confusion matrix, summarized in Table 5. The outcomes are for 70% of the dataset for training the predictive models.
As mentioned in table 6, the Random Forest predictive model recognizes almost 95% accuracy for all four accuracy criteria. Table 7 testifies the confusion matrix of the Random Forest predictive model, which suitably classified the job postings as fraudulent or non-fraudulent with 95.2% accuracy for 30% of the test data. • The true-negative rate (TNR) gave 100% results, indicating that the job postings which were non-fraudulent were classified correctly by the model; • The true-positive rate (TPR) is also 95%, claiming that the proposed model classified 5096 job postings correctly, out of the 5364.

Outcome deployment
This paper uses the ANN predictive model approach to study fraudulent job postings online for the downloaded Kaggle dataset. The results were analyzed using three performance metrics for assessing the proposed model: classification accuracy (CA), precision, recall, and F-1 score. The experiment revealed that the ANN provides acceptable outcomes (CA=0.950, precision=0.907, recall=0.950, and F-1 score=0.928).
The prospective future work for this study will be a further development of the model by deepening analysis on variables used in the models. Data available have restrictions regarding the specifics of defaulters and timeline, which stipulates the behaviour of default job postings.