Using Logistic Regression Analysis to predict parole violators using R

4 min readApr 4, 2020

In this study we will try to predict if the parole will try to violate the terms of the parole. In this case of course this is a limited data set, and no scientific conclusions could be made from this study. This is a limited data set and is only used to show logistic regression functionality.

This limited data set only is for the year 2004, and is for prisoner who only served 6 months in prison and maximum terms was only 18 months. The data set contains data who either successfully completed their term of parole during the year 2004 or those who violated their terms.

The parole data set contains the following variables:

male: 1 if the parolee is male, 0 if female
race: 1 if the parolee is white, 2 otherwise
age: the parolee’s age (in years) when he or she was released from prison
state: a code for the parolee’s state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the data set.
time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
multiple.offenses: 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.
crime: a code for the parolee’s main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.

Lets view the basic structure of the data, and the various dimensions, and we see that we have 675 rows and 9 columns, all columns are numeric in nature. The number of dimensions is low as well, so we will be able to produce a simple model.

The summary statistics shows that there are no nulls and there are no outliers of the data as well.

First we will split the data into train and test, we see that there are 78 parole violators and 597 which did not violate the terms for parole.

We divide the data into test and train and we see that the train count is 540 and test count is 135

Next, we will run the logistic regression model and see what are the various coefficients that came as significant.

The value of AIC is low and also the number of Fisher iterations is also low so we know the model is a good fit, however we can improve the model, by removing the values that are not significant.

Let us re-run the model again, by removing the variables that are not significant i.e. male, age, time.served and crime.

With our new model, we see that all of the variables are now significant, and the AIC is also lower than before and fisher score is also low, so this model is much better representation and is a much better fit. Race also increases the odds by 0.8176. State and max.sentence are also significant, but their coefficients are negative.

The above model predicts that a parolee who committed multiple offenses has 1.5486 times higher odds of being a violator than a parolee who did not commit multiple offenses.

We also do not see any multi-colinearity in the model as well as all vif() scores are below 5.

The R square of the model is as follows, this is also called as Mcfadden R square value for the logistic regression model.

Now we run the model on the test data, and we see that the accuracy of the model is 0.866667 on the test data that is not seen by the model, as shown below.