READING TEST SCORES

The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data we are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.

Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:

  • grade: The grade in school of the student (most 15-year-olds in America are in 10th grade)

  • male: Whether the student is male (1/0)

  • raceeth: The race/ethnicity composite of the student

  • preschool: Whether the student attended preschool (1/0)

  • expectBachelors: Whether the student expects to obtain a bachelor's degree (1/0)

  • motherHS: Whether the student's mother completed high school (1/0)

  • motherBachelors: Whether the student's mother obtained a bachelor's degree (1/0)

  • motherWork: Whether the student's mother has part-time or full-time work (1/0)

  • fatherHS: Whether the student's father completed high school (1/0)

  • fatherBachelors: Whether the student's father obtained a bachelor's degree (1/0)

  • fatherWork: Whether the student's father has part-time or full-time work (1/0)

  • selfBornUS: Whether the student was born in the United States of America (1/0)

  • motherBornUS: Whether the student's mother was born in the United States of America (1/0)

  • fatherBornUS: Whether the student's father was born in the United States of America (1/0)

  • englishAtHome: Whether the student speaks English at home (1/0)

  • computerForSchoolwork: Whether the student has access to a computer for schoolwork (1/0)

  • read30MinsADay: Whether the student reads for pleasure for 30 minutes/day (1/0)

  • minutesPerWeekEnglish: The number of minutes per week the student spend in English class

  • studentsInEnglish: The number of students in this student's English class at school

  • schoolHasLibrary: Whether this student's school has a library (1/0)

  • publicSchool: Whether this student attends a public school (1/0)

  • urban: Whether this student's school is in an urban area (1/0)

  • schoolSize: The number of students in this student's school

  • readingScore: The student's reading score, on a 1000-point scale

The datasets can be loaded with:

pisaTrain = read.csv("pisa2009train.csv")
pisaTest = read.csv("pisa2009test.csv")

We can then access the number of rows in the training set with str(pisaTrain) or nrow(pisaTrain) which is 3663.

The average reading test score of males is 483.5325 and for females 512.9406 using the following command.

tapply(pisaTrain$readingScore, pisaTrain$male, mean)

We can read which variables have missing values from summary(pisaTrain). Because most variables are collected from study participants via survey, it is expected that most questions will have at least one missing value.

Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.

By typing the following commands into our R console to remove observations with any missing value from pisaTrain and pisaTest:

pisaTrain = na.omit(pisaTrain)
pisaTest = na.omit(pisaTest)

After running the provided commands we can use str(pisaTrain) and str(pisaTest), or nrow(pisaTrain) and nrow(pisaTest), to check the new number of rows in the datasets. Now we have 2414 observations in the training set and 990 in the testing set.

Factor Variables

Factor variables are variables that take on a discrete set of values, like the "Region" variable. This is an unordered factor because there isn't any natural ordering between the levels. An ordered factor has a natural ordering between the levels (an example would be the classifications "large," "medium," and "small").

Male only has 2 levels (1 and 0) and is an ordered factor. There is no natural ordering between the different values of raceeth, so it is an unordered factor. Meanwhile, we can order grades (8, 9, 10, 11, 12), so it is an ordered factor.

To include unordered factors in a linear regression model, we define one level as the "reference level" and add a binary variable for each of the remaining levels. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset.

As an example, consider the unordered factor variable "color", with levels "red", "green", and "blue". If "green" were the reference level, then we would add binary variables "colorred" and "colorblue" to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.

Now, consider the variable "raceeth" in our problem, which has levels "American Indian/Alaska Native", "Asian", "Black", "Hispanic", "More than one race", "Native Hawaiian/Other Pacific Islander", and "White". Because it is the most common in our population, we will select White as the reference level.

Considering adding our unordered factor race to the regression model with reference level "White". For a student who is Asian, raceethAmerican Indian/Alaska Native, raceethBlack, raceethHispanic, raceethMore than one race and raceethNative Hawaiian/Other Pacific Islander will be set to 0 and raceethAsian will be set to 1.

For a student who is white, and "White" is the reference level, a white student will have all raceeth binary variables set to 0.

Building a Model

Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() -- we can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically ("American Indian/Alaska Native") as the reference level of our factor instead of the most common level ("White"). We can set the reference level of the factor by typing the following two lines in your R console:

pisaTrain$raceeth = relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth = relevel(pisaTest$raceeth, "White")

Now, we build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.

It would be time-consuming to type all the variables, but R provides the shorthand notation "readingScore ~ ." to mean "predict readingScore using all the other variables in the data frame." The period is used to replace listing out all of the independent variables. As an example, if your dependent variable is called "Y", your independent variables are called "X1", "X2", and "X3", and your training data set is called "Train", instead of the regular notation:

LinReg = lm(Y ~ X1 + X2 + X3, data = Train)

We would use the following command to build your model:

lmScore = lm(readingScore~., data=pisaTrain)

We can then read the training set R^2 from the "Multiple R-squared" value of summary(lmScore) which is 0.3251.

For the training set RMSE:

The training-set RMSE can be computed by first computing the SSE:

SSE = sum(lmScore$residuals^2)

and then dividing by the number of observations and taking the square root:

RMSE = sqrt(SSE / nrow(pisaTrain))

A alternative way of getting this answer would be with the following command:

sqrt(mean(lmScore$residuals^2)).

The training set RMSE is 73.36555.

Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. The coefficient 29.54 on grade is the difference in reading score between two students who are identical other than having a difference in grade of 1. Because A and B have a difference in grade of 2, the model predicts that student A has a reading score that is 2*29.54 larger.

Also, The only difference between an Asian student and white student with otherwise identical variables is that the former has raceethAsian=1 and the latter has raceethAsian=0. The predicted reading score for these two students will differ by the coefficient on the variable raceethAsian.

About the significance of variables in the model, From summary(lmScore), we can see which variables were significant at the 0.05 level. Because several of the binary variables generated from the race factor variable are significant, we should not remove this variable.

Using the "predict" function and supplying the "newdata" argument, we use the lmScore model to predict the reading scores of students in pisaTest. We call this vector of predictions "predTest".

predTest = predict(lmScore, newdata=pisaTest)

From summary(predTest), we see that the maximum predicted reading score is 637.7, and the minimum predicted score is 353.2. Therefore, the range is 284.5.

The sum of squared errors (SSE) of lmScore on the testing set is 5762082 and the root-mean squared error (RMSE) of lmScore on the testing set is 76.29079.

 sum((predTest-pisaTest$readingScore)^2)
 sqrt(mean((predTest-pisaTest$readingScore)^2))

The predicted test score used in the baseline model is 517.9629 and can be computed as:

baseline = mean(pisaTrain$readingScore)

The sum of squared errors of the baseline model on the testing set which is also called the total sum of squares (SST) is 7802354 and can be computed as:

sum((baseline-pisaTest$readingScore)^2).

The test-set R-squared value of lmScore is 0.2614944

The test-set R^2 is defined as 1-SSE/SST, where SSE is the sum of squared errors of the model on the test set and SST is the sum of squared errors of the baseline model. For this model, the R^2 is then computed to be 1-5762082/7802354.

blogroll

social