Sun 21 June 2015
| by Usman Ijaz Malik
| in Trees
| tags: R Data Analysis
The United States government periodically collects demographic information by conducting a census.
Today, we are going to use census information about an individual to predict how much a person earns -- in particular, whether the person earns more than $50,000
per year. This data comes from the UCI Machine Learning Repository.
The file census.csv contains 1994 census data for 31,978 individuals in the United States.
The dataset includes the following 13 variables:
age = the age of the individual in years
workclass = the classification of the individual's working status (does the person work for the federal government, work for the local government, work without pay, and so on)
education = the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)
maritalstatus = the marital status of the individual
occupation = the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)
relationship = relationship of individual to his/her household
race = the individual's race
sex = the individual's sex
capitalgain = the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)
capitalloss = the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)
hoursperweek = the number of hours the individual works per week
nativecountry = the native country of the individual
over50k = whether or not the individual earned more than $50,000 in 1994
A Logistic Regression Model
Let's begin by building a logistic regression model to predict whether an individual's earnings are above $50,000 (the variable "over50k") using all of the
other variables as independent variables. First, let's read the dataset census.csv into R.
census = read.csv("census.csv")
Let's split the data randomly into a training set and a testing set, setting the seed to 2000 before creating the split.
We are splitting the data so that the training set contains 60% of the observations, while the testing set contains 40% of the observations.
library(caTools)
set.seed(2000)
spl = sample.split(census $ over50k , SplitRatio = 0.6)
Train = subset(census, spl==TRUE)
Test = subset(census, spl==FALSE)
Next, let's build a logistic regression model to predict the dependent variable "over50k", using all of the other variables in the dataset as independent
variables. We are using the training set to build the model.
censusLog = glm(over50k ~ . , data = Train, family = "binomial")
summary(censusLog)
We can see that the age, workclass, education, maritalstatus, occupation, relationship, sex, capitalgain, capitalloss and hoursperweek are the significant variables if we
use 0.1 as your significance threshold, so variables with a period or dot in the stars column should be counted too.
predictLog = predict(censusLog,newdata = Test,type="response")
table(Test $ over50k , predictLog >= 0.5)
If we divide the sum of the main diagonal by the sum of all of the entries in the matrix, we obtain the testing set accuracy:
(9051+1888)/(9051+662+1190+1888) = 0.8552107
For the testing set baseline accuracy, we need to first determine the most frequent outcome in the training set. To do that, we table the dependent variable in the training set:
"<=50K" is the more frequent outcome (14570 observations), so this is what the baseline predicts.
To generate the accuracy of the baseline on the test set, we can table the dependent variable in the test set:
The baseline accuracy is
9713/(9713+3078) = 0.7593621.
To find out Area Under the Curve(AUC):
library(ROCR)
PredictROC = predict(censusLog, newdata = Test)
pred = prediction(PredictROC,Test $ over50k )
as.numeric(performance(pred,"auc")@y.values)
perfLog=performance(pred,"tpr","fpr")
plot(perfLog)
And the AUC value is : 0.9061598
A CART Model
We have just seen how the logistic regression model for this data achieves a high accuracy. Moreover, the significances of the variables give us a way to
gauge which variables are relevant for this prediction task. However, it is not immediately clear which variables are more important than the others,
especially due to the large number of factor variables in this problem.
Let us now build a classification tree to predict "over50k". We will use the training set to build the model, and all of the other variables as independent
variables. Let's use the default parameters, so we don't set a value for minbucket or cp. We also need to specify method="class" as an argument to rpart,
since this is a classification problem.
library("rpart")
library(rpart.plot)
CARTmodel = rpart(over50k ~ ., data=Train,method="class")
prp(CARTmodel)
As we print the tree, we can see that there are four splits. The first split uses the variable relationship. The second splits are on capitalgains and education.
Let's find out the out of sample accuracy on the test data set.
cartPredict = predict(CARTmodel,newdata=Test,type="class")
table(Test $ over50k ,cartPredict)
(9243+1596)/(9243+470+1482+1596) = 0.8473927
This highlights a very regular phenomenon when comparing CART and logistic regression. CART often performs a little worse than logistic regression in
out-of-sample accuracy. However, as is the case here, the CART model is often much simpler to describe and understand.
Let us now consider the ROC curve and AUC for the CART model on the test set. We will need to get predicted probabilities for the observations in the test set to
build the ROC curve and compute the AUC. We need to remember that we can do this by removing the type="class" argument when making predictions, and taking the
second column of the resulting object.
Let's Plot the ROC curve for the CART model we have estimated.
predictTest = predict(CARTmodel,newdata=Test)
pred1 = prediction(predictTest[,2],Test $ over50k )
pred1auc = as.numeric(performance(pred1,"auc")@y.values)
perf=performance(pred1,"tpr","fpr")
plot(perf)
We observe that compared to the logistic regression ROC curve, the CART ROC curve is less smooth than the logistic regression ROC curve.
The probabilities from the CART model take only a handful of values (five, one for each end bucket/leaf of the tree);
the changes in the ROC curve correspond to setting the threshold to one of those values. The breakpoints of the curve correspond to the false and true positive
rates when the threshold is set to the five possible probability values.
The area under the curve for the CART model is 0.8470256 .
A Random Forest Model
Before building a random forest model, we'll down-sample our training set. While some modern personal computers can build a random forest model on the entire training set, others might run out of memory when trying to train the model since random forests is much more computationally intensive than CART or Logistic Regression. For this reason, before continuing we will define a new training set to be used when building our random forest model, that contains 2000 randomly selected obervations from the original training set. Do this by running the following commands in your R console (assuming your training set is called "train"):
set.seed(1)
trainSmall = train[sample(nrow(train), 2000), ]
Let us now build a random forest model to predict "over50k", using the dataset "trainSmall" as the data used to build the model. Let's set the seed to 1 again
right before building the model, and use all of the other variables in the dataset as independent variables. (If you get an error that random forest "can not
handle categorical predictors with more than 32 categories", re-build the model without the nativecountry variable as one of the independent variables.)
set.seed(1)
trainSmall = Train[sample(nrow(Train), 2000), ]
library("randomForest")
set.seed(1)
censusForest = randomForest(over50k ~ . , data= trainSmall)
predictForest = predict(censusForest,newdata = Test)
table(Test $ over50k ,predictForest)
The accuracy of the model should be around
(9614+1050)/nrow(test) = 0.8337112
Also, note that your accuracy might be different from the one reported here, since random forest models can still differ depending on your operating system,
even when the random seed is set.
As we know, random forest models work by building a large collection of trees. As a result, we lose some of the interpretability that comes with CART in terms
of seeing how predictions are made and which variables are important. However, we can still compute metrics that give us insight into which variables are
important.
One metric that we can look at is the number of times, aggregated over all of the trees in the random forest model,
that a certain variable is selected for a split. To view this metric, let's run the following lines of R code .
vu = varUsed(censusForest, count=TRUE)
vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vusorted $ x , names(censusForest $ forest $ xlevels [vusorted $ ix ]))
This code produces a chart that for each variable measures the number of times that variable was selected for splitting (the value on the x-axis).
We can see that the varibale age is the most important in terms of the number of splits.
A different metric we can look at is related to "impurity", which measures how homogenous each bucket or leaf of the tree is.
In each tree in the forest, whenever we select a variable and perform a split, the impurity is decreased. Therefore, one way to measure the importance of a
variable is to average the reduction in impurity, taken over all the times that variable is selected for splitting in all of the trees in the forest.
To compute this metric, run the following command in R :
We can see that occupation gives a larger reduction in impurity than the other variables.
We can notice that the importance as measured by the average reduction in impurity is in general different from the importance as measured by the number of times the
variable is selected for splitting. Although age and occupation are important variables in both metrics, the order of the variables is not the same in the two plots.
Selecting cp by Cross-Validation
We now conclude our study of this data set by looking at how CART behaves with different choices of its parameters.
Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds. We can do this by using the train function.
Let's set the seed beforehand to 2. We are testing cp values from 0.002 to 0.1 in 0.002 increments, by using the following command:
cartGrid = expand.grid( .cp = seq(0.002,0.1,0.002))
Also, remember to use the entire training set "train" when building this model. The train function might take some time to run.
The final output should read
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.002.
In other words, the best value was cp = 0.002, corresponding to the lowest cp value. If we look more closely at the accuracy at different cp values,
we can see that it seems to be decreasing steadily as the cp value increases. Often, the cp value needs to become quite low before the accuracy begins
to deteriorate.
library(caret)
set.seed(2)
fitControl = trainControl( method = "cv", number = 10 )
cartGrid = expand.grid( .cp = seq(0.002,0.1,0.002))
train( over50k ~ . , data = Train, method = "rpart", trControl = fitControl, tuneGrid = cartGrid )
We can create a CART model with the following command:
model = rpart(over50k~., data=train, method="class", cp=0.002)
We can make predictions on the test set using the following command (where "model" is the name of your CART model):
predictTest = predict(model, newdata=test, type="class")
Then we can generate the confusion matrix with the command
table(test $ over50k , predictTest)
The accuracy is (9178+1838)/(9178+535+1240+1838) = 0.8612306 .
If we plot the tree with prp(model), where "model" is the name of your CART tree, we should see that there are 18 splits!
Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the
old one -- or should we?
This highlights one important tradeoff in building predictive models. By tuning cp, we improved our accuracy by over 1%, but our tree became significantly
more complicated. In some applications, such an improvement in accuracy would be worth the loss in interpretability. In others, we may prefer a less accurate
model that is simpler to understand and describe over a more accurate -- but more complicated -- model.