Exploratory Data Analysis (EDA), Feature Selection, and machine learning prediction on time series data.
INTRODUCTION
Dataset description: Swedish crime statistics from 1950 to 2015.
Attribute information of dataset: crimes.total: total number of reported crimes. crimes.penal.code: total number of reported crimes against the criminal code. crimes.person: total number of reported crimes against a person. murder: total number of reported murder. sexual.offences: total number of reported sexual offences. rape: total number of reported rapes. assault: total number of reported aggravated assaults. stealing.general: total number of reported crimes involving stealing or robbery. robbery: total number of reported armed robberies. burglary: total number of reported armed burglaries. vehicle.theft: total number of reported vehicle thefts. house.theft: total number of reported theft inside a house. shop.theft: total number of reported theft inside a shop. out.of.vehicle.theft: total number of reported theft from a vehicle. criminal.damage: total number of reported criminal damages. other.penal.crimes: number of other penal crime offenses. fraud: total number of reported frauds. narcotics: total number of reported narcotics abuses. drunk.driving: total number of reported drunk driving incidents. Year: the year. population: the total estimated population of Sweden at the time
Download Dataset: https://www.kaggle.com/mguzmann/swedishcrime
GOAL : In this project we will train a machine learning model to predict muder rate in sweden using the sweden crime rate dataset , also perform exploratory data analysis (EDA) and feature selection by accomplishing the following on the sweden crime dataset:
1. Load and view dataset
2. Data visualization
3.Data preprocessing(data encoding , handling missing values, handling outliers (detection, removal and replacements) and normalization)
4. feature selection with filter , embedded and wrapper methods.
5. Compare training without feature selection and with feature selection (Filter method(chi-square), wrapper method (RFE), and embedded method(Lasso))
6. Time series or regression algorithms comparison (Naïve Bayes, k-nearest neighbor, Support vector machines, Convolutional neural network and Recurrent Neural Network(RNN)(LSTM)
7. Save trained model
1. load and view dataset
View first five rows and the shape of the dataset , also check for missing data. finally check dataset data types
2. Data visualization
We can plot some visualizations using line chart, density plot, scatter plot, bar charts and histogram.
3.Data preprocessing(data encoding , handling missing values, handling outliers (detection, removal and replacements) and normalization)
3.1 handling missing values:
Replace the missing data with the mean of the columns
3.2 Encoding data
Since there are not object columns in the dataset there is no need to encode the columns
3.3 Outlier detection
3.3.1 visualization method
the visualization method can be done with boxplots and distribution plot and other plots, boxplots and distribution plot are used below.
3.3.2 Z-score detection
Another method for outlier detection is the Z-score:
Z score is also known as standard score. Z score tells how many standard deviations away a data point is from the mean, It helps to understand if a data value is greater or smaller than mean and how far away it is from the mean. More specifically.
We calculate the Z-Scores for each column. And set a threshold, which indicates that the data point is quite different from the other data points
3.3.3 Interquartile range
IQR is used to measure variability by dividing a data set into quartiles. Q1, Q2, Q3 called first, second and third quartiles are the values for splitting the dataset.
- Q1 represents the 25th percentile of the data.
- Q2 represents the 50th percentile of the data.
- Q3 represents the 75th percentile of the data.
IQR is the range between the first and the third quartiles namely Q1 and Q3: IQR = Q3 — Q1. The data points which fall below Q1–1.5 IQR or above Q3 + 1.5 IQR are outliers.
3.3.4 Removing outliers
In the previous section, we saw how one can detect the outlier using Z-score, and inter quartile range , but now we want to remove or filter the outliers and get the clean data.
We now a new shape of (47,21) after removal with Z-score.
We now a new shape of (51,21) after removal with inter-quartile range.
We can also replace outlier with median value of columns as shown below.
3.4 Normalization with minmax scalar
Min-max normalization is one of the most common ways to normalize data. For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.
4. feature selection with filter , embedded and wrapper methods.
Feature selection is also known as attribute selection is a process of extracting the most relevant features from the dataset and then applying machine learning algorithms for the better performance of the model. Feature selection usually can lead to better learning performance, higher learning accuracy, lower computational cost, and better model interpretability.
4.1 Filter method
In the Filter method, features are selected based on statistical measures. It is independent of the learning algorithm and requires less computational time. Examples are Information gain, chi-square test, Fisher score, correlation coefficient, and variance threshold or ANOVA.
4.1.1 chi-square
Calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores. It determines if the association between two categorical variables of the sample would reflect their real association in the population.
4.1.2 Pearson correlation
Correlation is a measure of the linear relationship of 2 or more variables. Through correlation, we can predict one variable from the other. The logic behind using correlation for feature selection is that the good variables are highly correlated with the target.
4.1.3 Information gain
It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.
4.1.4 ANOVA
ANOVA is an acronym for analysis of variance and is used for determining whether the means from two or more samples of data (often three or more) come from the same distribution or not. The results of this test can be used for feature selection where those features that are independent of the target variable can be removed from the dataset.
4.2 wrapper method
The Wrapper methodology considers the selection of feature sets as a search problem, where different combinations are prepared, evaluated, and compared to other combinations. A predictive model is used to evaluate a combination of features and assign model performance scores. RFE , Forward selection and backward elimination are used on the dataset.
4.2.1 RFE
Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. Features are ranked by the model’s coef or feature_importances_ attributes, and by recursively eliminating a small number of features per loop, RFE attempts to eliminate dependencies and collinearity that may exist in the model.
4.2.2 Forward selection
Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.
4.2.3 backward elimination
In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.
4.3 Embedded method
These methods encompass the benefits of both the wrapper and filter methods, by including interactions of features but also maintaining reasonable computational cost. Embedded methods are iterative in the sense that takes care of each iteration of the model training process and carefully extracts those features which contribute the most to the training for a particular iteration.
4.3.1 Lasso Regularization
Lasso or L1 Regularization consists of adding a penalty to the different parameters of the machine learning model to avoid over-fitting. In linear model regularization, the penalty is applied over the coefficients that multiply each of the predictors. From the different types of regularization, Lasso or L1 has the property that is able to shrink some of the coefficients to zero. Therefore, that feature can be removed from the model.
4.3.2 Ridge Regression
L2 or ridge regression, on the other hand, is useful when you have collinear/codependent features.Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.
4.3.3 Random forest importance
Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.
Random Forests is a kind of a Bagging Algorithm that aggregates a specified number of decision trees. The tree-based strategies used by random forests naturally rank by how well they improve the purity of the node, or in other words a decrease in the impurity (Gini impurity) over all trees.
4.3.4 Principle component Analysis (PCA)
PCA is a dimensionality reduction method. The PCA method can be described and implemented using the tools of linear algebra.The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).
5. Compare training without feature selection and with feature selection (Filter method(chi-square), wrapper method (RFE), and embedded method(Lasso))
5.1 Naïve Bayes training without feature selection
We have an accuracy of 0.7 after training with naïve Bayes without feature selection.
5.2 Naïve Bayes training with feature selection
5.2.1 Chi-square
We have an accuracy of 0.714 after training with naïve Bayes with filter method (chi-square) feature selection.
5.2.2 RFE
We have an accuracy of 0.643 after training with naïve Bayes with wrapper method (RFE) feature selection.
5.2.3 LASSO
We have an accuracy of 0.785 after training with naïve Bayes with embedded method (LASSO) feature selection.
6. Time series or regression algorithms comparison (Naïve Bayes, k-nearest neighbor, Support vector machines, Convolutional neural network and RNN(LSTM)
6.1 Naive bayes
it is an algorithm based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data
6.2 K nearest neighbor
KNN algorithm can be used for both classification and regression problems. The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set.
6.3 Support Vector Machine (SVM)
Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm (maximal margin). In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem.
6.4 Convolutional Neural Network
Convolutional Neural Network (CNN) models are mainly used for two-dimensional arrays like image data. However, we can also apply CNN with regression data analysis. In this case, we apply a one-dimensional convolutional network and reshape the input data according to it. Keras provides the Conv1D class to add a one-dimensional convolutional layer into the model.
6.5 RNN(LSTM)
RNN’s (LSTM’s) are pretty good at extracting patterns in input feature space, where the input data spans over long sequences. Given the gated architecture of LSTM’s that has this ability to manipulate its memory state, they are ideal for regression or time series problems.
7. Save trained model
Since the machine learning model has been trained, we can now save this model with pickle.
CONCLUSION
This project explained the process of EDA on the Swedish crime rate dataset. we covered how to perform visualization, data preprocessing by handling of missing data, outliers, normalization, Explained feature selection methods, and compared chisquare. rfe and lasso training accuracy, and finally compared the SVM, KNN, Naïve Bayes, CNN and LSTM.
WRITER: OLUYEDE SEGUN . A(jr)
Resources used (References) and further reading:
linkedin profile: https://www.linkedin.com/in/oluyede-segun-adedeji-jr-a5550b167/
Link to explanatory notebook: https://github.com/juniorboycoder/TIME_SEREIS_EDA_FEATURE_SELECTION_AND_PREDICITVE_ANALYSIS/blob/main/eda_and_feature_Selection_timeseries_project.ipynb
twitter profile: https://twitter.com/oluyedejun1
TAGS: #FeatureSelection #Outlier #timeseries #regression #CNN #SVM #KNN #LSTM #Naivebayes #filter #wrapper #embedded #EDA