In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.
Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.
For this prediction task, we will use data parole.csv from the United States 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004. We limited our focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:
- male: 1 if the parolee is male, 0 if female
- race: 1 if the parolee is white, 2 otherwise
- age: the parolee's age (in years) when he or she was released from prison
- state: a code for the parolee's state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the dataset.
- time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
- max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
- multiple.offenses: 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.
- crime: a code for the parolee's main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
- violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.
We can load the dataset into R with the following command:
parole = read.csv("parole.csv")
Then we can count the number of parolees in the dataset with str(parole) or with nrow(parole) and the count is 675.
78 of the parolees in the dataset violated the terms of their parole.This can be observed by running
table(parole$violator)
The variables male, race, state, crime, and violator are all unordered factors, and only state and crime have at least 3 levels in this dataset.
To convert to factors, the following commands should be run:
parole$state = as.factor(parole$state)
parole$crime = as.factor(parole$crime)
The output of summary(parole$state) or summary(parole$crime) now shows a breakdown of the number of parolees with each level of the factor, which is most similar to the output of the table() function.
Splitting into a Training and Testing Set
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
Roughly 70% proportion of parolees have been allocated to the training and 30% testing sets.
Building a Logistic Regression Model
If we tested other training/testing set splits in the previous section, please re-run the original 5 lines of code to obtain the original split.
Using glm (and remembering the parameter family="binomial"), train a logistic regression model on the training set. Our dependent variable is "violator", and we should use all of the other variables as independent variables.
mod = glm(violator~., data=train, family="binomial")
summary(mod)
The variables race, state and multiple.offenses are significant variables (have at least one star, or should have a probability less than 0.05 (the column Pr(>|z|) in the summary output))
Two important properties are very important to note:
-
If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.
-
If we have a coefficient c for a variable, then that means the odds are multiplied by e^c for a unit increase in the variable.
For parolees A and B who are identical other than A having committed multiple offenses, the predicted log odds of A is 1.61 more than the predicted log odds of B. Then we have:
ln(odds of A) = ln(odds of B) + 1.61
exp(ln(odds of A)) = exp(ln(odds of B) + 1.61)
exp(ln(odds of A)) = exp(ln(odds of B)) * exp(1.61)
odds of A = exp(1.61) * odds of B
odds of A= 5.01 * odds of B
Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. According to the model, the odds and probability that this individual is a violator can be claculated:
From the logistic regression equation, we have
log(odds) = -4.2411574 + 0.3869904*male + 0.8867192*race - 0.0001756*age + 0.4433007*state2 + 0.8349797*state3 - 3.3967878*state4 - 0.1238867*time.served + 0.0802954*max.sentence + 1.6119919*multiple.offenses + 0.6837143*crime2 - 0.2781054*crime3 - 0.0117627*crime4.
This parolee has male=1, race=1, age=50, state2=0, state3=0, state4=0, time.served=3, max.sentence=12, multiple.offenses=0, crime2=1, crime3=0, crime4=0. We conclude that log(odds) = -1.700629.
Therefore, the odds ratio is exp(-1.700629) = 0.183, and the predicted probability of violation is 1/(1+exp(1.700629)) = 0.154.
Evaluating the Model on the Testing Set
The following commands make the predictions and display a summary of the values:
predictions = predict(mod, newdata=test, type="response")
summary(predictions)
To obtain the confusion matrix, we can use the following command:
table(test$violator, as.numeric(predictions >= 0.5))
There are 202 observations in the test set.
The accuracy (percentage of values on the diagonal) is (167+12)/202 = 0.886.
The sensitivity (proportion of the actual violators we got correct) is 12/(11+12) = 0.522, and the specificity (proportion of the actual non-violators we got correct) is 167/(167+12) = 0.933.
For the accuracy of a simple model that predicts that every parolee is a non-violator:
table(test$violator)
We can see that there are 179 negative examples, which are the ones that the baseline model would get correct. Thus the baseline model would have an accuracy of 179/202 = 0.886.
Consider a parole board using the model to predict whether parolees will be violators or not. The job of a parole board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards tend to be particularily concerned about releasing prisoners who will violate their parole.
If the board used the model for parole decisions, a negative prediction would lead to a prisoner being granted parole, while a positive prediction would lead to a prisoner being denied parole. The parole board would experience more regret for releasing a prisoner who then violates parole (a negative prediction that is actually positive, or false negative) than it would experience for denying parole to a prisoner who would not have violated parole (a positive prediction that is actually negative, or false positive).
Decreasing the cutoff leads to more positive predictions, which increases false positives and decreases false negatives. Meanwhile, increasing the cutoff leads to more negative predictions, which increases false negatives and decreases false positives. The parole board assigns high cost to false negatives, and therefore should decrease the cutoff.
The model at cutoff 0.5 has 12 false positives and 11 false negatives, while the baseline model has 0 false positives and 23 false negatives. Because a parole board is likely to assign more cost to a false negative, the model at cutoff 0.5 is likely of value to the board.
From the previous question, the parole board would likely benefit from decreasing the logistic regression cutoffs, which decreases the false negative rate while increasing the false positive rate.
The AUC value for the model is 0.8945834 and it can be obtained by the following commands.
library(ROCR)
pred = prediction(predictions, test$violator)
as.numeric(performance(pred, "auc")@y.values)
The AUC deals with differentiating between a randomly selected positive and negative example. It is independent of the regression cutoff selected.