PREDICTING CYBER CRIME PATTERNS AND THE ROLE OF CONFUSION MATRIX - Case Study
What is a Cybercrime?
Cybercrime is criminal activity that either targets or uses a computer, a computer network or a networked device.
Most, but not all, cybercrime is committed by cybercriminals or hackers who want to make money. Cybercrime is carried out by individuals or organizations.
Some cybercriminals are organized, use advanced techniques and are highly technically skilled. Others are novice hackers.
Rarely, cybercrime aims to damage computers for reasons other than profit. These could be political or personal.
Examples of the different types of cybercrime :
- Email and internet fraud.
- Identity fraud.
- Theft of financial or card payment data.
- Theft and sale of corporate data.
- Cyberextortion (demanding money to prevent a threatened attack).
- Ransomware attacks (a type of cyberextortion).
- Cryptojacking (where hackers mine cryptocurrency using resources they do not own).
- Cyberespionage (where hackers access government or company data).
Most cybercrime falls under two main categories :
- Criminal activity that targets
- Criminal activity that uses computers to commit other crimes
Now let’s look into a case study that predicts Cybercrime
Case Study :
KNOWLEDGE BASED SYSTEM FOR PREDICTING CYBER CRIME PATTERNS USING DATA MINING TECHNIQUES by Dr. G. Michael
In this study, a novel approach comprising data mining technology and visualization technique are implemented for predicting the distribution of cyber crime over major areas of India. Initially, cyber crime datasets were preprocessed and data mining algorithms were utilized to extort the facts out of them and then concealed interactions among the data were explored that is promoted to report. Then, the cyber crime prototypes were explored which were useful to cyber crime analysts in analyzing those networks by means of visualization for cyber crime prediction and hence compassionate in preclusion of cyber crimes. This work depicts the design of a knowledge base for an intelligent cyber crime pattern identification system dealing with cases of Information Technology Act (IT Act).
PROPOSED MODEL FOR CYBER CRIME ANALYSIS — KNOWLEDGE BASED SYSTEM
The purpose of the proposed model is to predict cyber crime patterns over selected areas across India more efficiently by utilizing data mining techniques and to propose a Knowledge Based System for cyber crime data analysis. For this intention, cyber crime datasets had been collected over 11 states of India for duration of 11 years and had been stored in cyber crime database. The development of the proposed model had been performed in two phases that are as follows:
Phase I: Develop Enhanced Random Forest classifier Component
1: Pre-processing by Attribute Greedy Stepwise selection method.
2: Clustering by TwoLogMean clustering algorithm.
3: To improve the efficiency of the classifier, clustered cyber crime datasets were executed in Enhanced Random Forest classifier.
Phase II: Develop Knowledge Based System Component
1: Collect and feed the IT and IPC sections of cyber crime types of existing cyber crime cases in India.
2: Enclose the details of penalty and punishments of every type of cyber crime.
3: Embed the preventive suggestions (before & after) of such cyber crimes.
We will focus more on Phase — 1 to understand the role of confusion matrix with the help of this case study
Enhanced Random Forest
With the number of trees, Random Forest creates a forest that can be utilized for classification, regression and other data mining tasks. During training phase, the Random Forest classifier crafts multiple decision trees. The variables can be ranked based on their priority by utilizing Random Forest. The reputation of decision tree replica in data mining had begun from their ease of exploit, flexibility in terms of managing diverse data element types, and interpretability. Conversely, single decision tree models were rickety and excessively susceptible to specific training data. Ensemble techniques in Random Forest try to solve this issue by crafting a set of models and summating their prophecy in deceiving the class label for a data point. Ensembles execute fine when individual elements are contradictory, and random forests attain deviation among individual trees by utilizing two foundations for randomness: Initially, every tree is constructed on separate bootstrapped models of the training data; then only an arbitrarily selected subset of data aspects are measured at each node in the construction of the individual trees.
Every tree in the collection is crafted for a given training cybercrime dataset of N cases depicted by B attributes,
— Acquire a bootstrap model of N cases
— At every node, arbitrarily decide on a subset of b attributes
— Develop the complete tree devoid of pruning As every tree had been crafted autonomously of the others, Random forests were computationally proficient.
Let’s look into the above process step-wise
Process Enhanced Random Forest -
Input: clustered cybercrime dataset
Output: decision rule for Knowledge Based System
Load the attribute selected cyber crime dataset and assign data to clustered dataset from TwoLogMean cluster algorithm
Interpret the datafile and assign validation to null Step 4: Perform Ten split cross validation
Perform Ten split cross validation
Separate the total number of instances in to training and testing arrays
Execute the clustered cyber crime dataset in classifier
For every training-testing split pair, train and test the classifier
Compute the overall accuracy of classifier on all splits
Display the classifier name, accuracy, confusion matrix and generate ROC curve
Let’s evaluate the above model
Performance analysis of Enhanced Random Forest with existing classifier :
For the comparative analysis of Enhanced Random Forest with the existing classifier, diverse evaluation measures have been considered. The collected number of samples of cyber crime dataset constitutes 717.
From Table1, it had been clear that Enhanced Random Forest was more accurate (99.58%) than existing Naïve Bayes classifier (87.30%) with comparatively less amount of time.
In Table2, the performance assessment of existing Naïve Bayes Vs Enhanced Random Forest classifier had been tabulated. From Kappa statistic value of Enhanced Random Forest (0.9697), it had been proved that been proved that it were perfect classifier.
But what is a confusion matrix and why is it important?
A confusion matrix is a table that displays and compares actual values with the model’s predicted values. Within the context of machine learning, a confusion matrix is utilized as a metric to analyze how a machine learning classifier performed on a dataset. A confusion matrix generates a visualization of metrics like precision, accuracy, specificity, and recall.
The reason that the confusion matrix is particularly useful is because the confusion matrix generates a more complete picture of how a model performed. Only using a metric like accuracy can lead to a situation where the model is completely and consistently misidentifying one class, but it goes unnoticed because on average performance is good. Meanwhile, the confusion matrix gives a comparison of different values like False Negatives, True Negatives, False Positives, and True Positives.
The Positive/Negative label refers to the predicted outcome of an experiment, while the True/False refers to the actual outcome.
Now let’s understand each term :
- True Positive — When the actual class of a data point is 1 and model predicted 1. (Model is truly saying positive, you can trust)
- False Negative — When the actual class of data point is 1 and model predicted 0. (Model is falsely saying Negative, not reliable)
- False Positive — When the actual class of data point is 0 and model predicted 1. (Model is falsely saying Positive, not reliable)
- True Negative — When the actual class of a data point is 0 and model predicted 0. (Model is truly saying negative, trustworthy)
Given below is a list of rates that are often computed from a confusion matrix for a binary classifier:
- Accuracy: Overall, how often is the classifier correct?
- Misclassification Rate: Overall, how often is it wrong?
It is equivalent to 1 minus Accuracy and also known as “Error Rate”
- True Positive Rate: When it’s actually yes, how often does it predict yes?
TP/actual yes ,also known as “Sensitivity” or “Recall”
- False Positive Rate: When it’s actually no, how often does it predict yes?
- True Negative Rate: When it’s actually no, how often does it predict no?
-It is equivalent to 1 minus False Positive Rate and also known as “Specificity”
- Precision: When it predicts yes, how often is it correct?
- Prevalence: How often does the yes condition actually occur in our sample?
Let’s get back to our model and look at its confusion matrix again and understand
In Table2, the performance assessment of existing Naïve Bayes Vs Enhanced Random Forest classifier had been tabulated.
So comparing the values obtained in the confusion matrix :
- True Positive — For Enhanced Random Forest is higher which is good meaning the number of cases which were predicted right are high.
- False Negative — For Enhanced Random Forest is zero which is really good meaning the number of cases which were predicted wrong are low.
And undoubtedly subsequently the accuracy rate for Enhanced Random Forest is very high that is 99.58%.
Machine learning techniques have proven to be beneficial for the whole security industry. However, the application of machine learning is often limited by the lack of standardized datasets, overfitting issues, the architecture cost, and so on. Therefore, it is important to apply and design new approaches to maintain the benefits of machine learning algorithms while addressing the limitations in practice. To facilitate law enforcement officials for saving humanity and for the purpose of envisaging cyber crimes, data mining algorithms and visualization techniques were utilized.
The developed cyber crime analysis tool affords a framework for visualizing the diverse cyber crime types and cyber crime prone areas in India and investigating them by data mining algorithms using the Google Maps. This task facilitates the law enforcement officials to scrutinize the cyber crime networks by means of interactive visualizations. The interactive and visual aspect relevance will be supportive in exposure and discerning the cyber crime prototypes. From the performance evaluation of existing and proposed classifiers, Enhanced Random Forest acquired 99.58% of accuracy rate with less computation time than Naïve Bayes.