Table of Contents
In this report, the Prostate cancer dataset will be used for analysis. Comparison of the data will be analyzed by using the Tree and Regression analyses for predicting both lpsa and lcavol. In total, there will be four separate analyses that you compare. The prostate cancer dataset is a study by Stamey et al. (1989) that examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy.
Install and Load Packages
There will be multiple libraries we need to install and load in this R tutorial before we load the data.
Input:
install.packages('C50') install.packages('gmodels') install.packages('party') install.packages('RColorBrewer') install.packages('RWeka') install.packages('psych') install.packages('rpart') install.packages('rpart.plot') library(C50) library(gmodels) library(party) library(RColorBrewer) library(RWeka) library(psych) library(rpart) library(rpart.plot)
Since we will be using the concrete data set, you will need to download this data set. This dataset is already packaged and available for an easy download from the dataset page. or directly from here Prostate – prostate.csv
Load and View the Prostate Data
In the below, we will load the prostate data file and view the contents of the data with the summary() function and build histrograms with the histogram() function.
summary() function
Input:
prostate <- read.csv("prostate.csv", stringsAsFactors = TRUE) summary(prostate)
Output:
lcavol age lbph lcp gleason Min. :-1.3471 Min. :41.00 Min. :-1.3863 Min. :-1.3863 Min. :6.000 1st Qu.: 0.5128 1st Qu.:60.00 1st Qu.:-1.3863 1st Qu.:-1.3863 1st Qu.:6.000 Median : 1.4469 Median :65.00 Median : 0.3001 Median :-0.7985 Median :7.000 Mean : 1.3500 Mean :63.87 Mean : 0.1004 Mean :-0.1794 Mean :6.753 3rd Qu.: 2.1270 3rd Qu.:68.00 3rd Qu.: 1.5581 3rd Qu.: 1.1787 3rd Qu.:7.000 Max. : 3.8210 Max. :79.00 Max. : 2.3263 Max. : 2.9042 Max. :9.000 lpsa Min. :-0.4308 1st Qu.: 1.7317 Median : 2.5915 Mean : 2.4784 3rd Qu.: 3.0564 Max. : 5.5829
hist() function
Input:
hist(prostate$lpsa)
Output:
Input:
hist(prostate$lcavol)
Output:
Rplot
This classification is comparing the variable of type to all predictors within the prostate dataset. We will be comparing two predictors; lcavol and lpsa.
LCAVOL Rplot
Input:
lcavol_modelR <- lm(lcavol ~ age + lbph + lcp + lpsa + gleason, data = prostate) lcavol_modelR
Output:
Call: lm(formula = lcavol ~ age + lbph + lcp + lpsa + gleason, data = prostate) Coefficients: (Intercept) age lbph lcp lpsa gleason -1.49371 0.01902 -0.08918 0.29727 0.53955 0.05240
LPSA Rplot
Input:
lpsa_modelR <- lm(lpsa ~ lcavol + age + lbph + lcp + gleason, data = prostate) lpsa_modelR
Output:
Call: lm(formula = lpsa ~ lcavol + age + lbph + lcp + gleason, data = prostate) Coefficients: (Intercept) lcavol age lbph lcp gleason 1.83165 0.65523 -0.01027 0.14506 0.07173 0.06168
Train and Test Data
rpart Formula Implementation
We can specify age as the outcome variable and use the dot notation to allow all the other columns in the credit_train data frame to to be used as predictors.
Input:
lcavol_pred <- predict(lcavol_modelR, prostate) lpsa_pred <- predict(lpsa_modelR, prostate) prostate_train <- prostate[1:60, ] prostate_test <- prostate[61:97, ]
LCAVOL Train Data
Input:
lcavol.rpart <- rpart(lcavol ~ age + lbph + lcp + gleason, data = prostate_train) lcavol.rpart
Output:
n= 60 node), split, n, deviance, yval * denotes terminal node 1) root 60 61.667440 0.81724440 2) lcp< 0.2101769 50 41.521730 0.57621020 4) gleason< 6.5 28 22.037790 0.27570190 8) age< 63.5 17 12.795910 -0.08539058 * 9) age>=63.5 11 3.599646 0.83375390 * 5) gleason>=6.5 22 13.737240 0.95867530 10) age>=63.5 15 7.008366 0.73164000 * 11) age< 63.5 7 4.298893 1.44518000 * 3) lcp>=0.2101769 10 2.716462 2.02241500 *
LPSA Train Data
Input:
lpsa.rpart <- rpart(lpsa ~ age + lbph + lcp + gleason, data = prostate_train) lpsa.rpart
Output:
n= 60 node), split, n, deviance, yval * denotes terminal node 1) root 60 38.815370 1.809756 2) gleason< 6.5 30 20.284900 1.481545 4) lbph< 0.9410666 19 12.333000 1.192772 * 5) lbph>=0.9410666 11 3.630779 1.980336 * 3) gleason>=6.5 30 12.067120 2.137967 6) lbph< 0.3306992 14 7.923121 1.927040 * 7) lbph>=0.3306992 16 2.976120 2.322529 *
Decision Trees
As one can see below, these decision trees are a big improvement from the earlier chapter. This gives greater detail on the outcome and predictors for the credit set. I prefer the rplot with the color coded decision trees for percentages and yes and no outcomes. Let’s visualize the data for lcavol and lpsa.
LCAVOL Decision Tree
Input:
rpart.plot(lcavol.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
Output:
LPSA Decision Tree
Input:
rpart.plot(lpsa.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
Output:
Model Performance Evaluation
LCAVOL
Input:
p.lcavol <- predict(lcavol.rpart, prostate_test) summary(p.lcavol)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. -0.08539 1.44518 2.02242 1.66620 2.02242 2.02242
LPSA
Input:
p.lpsa <- predict(lpsa.rpart, prostate_test) summary(p.lpsa)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.193 1.927 1.980 2.063 2.323 2.323
MAE
Measuring correlation on lcavol and lpsa.
LCAVOL
Input:
cor(p.lcavol, prostate_test$lcavol)
Output:
[1] 0.3837933
LPSA
Input:
cor(p.lpsa, prostate_test$lpsa)
Output:
[1] 0.05227984
Actual and Predicted function for MAE
MAE <- function(actual, predicted) { mean(abs(actual - predicted)) }
LCAVOL MAE
Input:
MAE(p.lcavol, prostate_test$lcavol) mean(prostate_test$lcavol) MAE(2.21, prostate_test$lcavol)
Output:
[1] 0.7968151 [1] 2.213953 [1] 0.6830092
LPSA MAE
Input:
MAE(p.lpsa, prostate_test$lpsa) mean(prostate_test$lpsa) MAE(3.56, prostate_test$lpsa)
Output:
[1] 1.499867 [1] 3.562653 [1] 0.5247962
Regression
Correlation matrix for age, lbph, lcp, gleason
Input:
cor(prostate[c("lcavol", "age", "lbph", "lcp", "gleason", "lpsa")])
Output:
lcavol age lbph lcp gleason lpsa lcavol 1.0000000 0.2249999 0.027349703 0.675310484 0.43241706 0.7344603 age 0.2249999 1.0000000 0.350185896 0.127667752 0.26889160 0.1695928 lbph 0.0273497 0.3501859 1.000000000 -0.006999431 0.07782045 0.1798094 lcp 0.6753105 0.1276678 -0.006999431 1.000000000 0.51483006 0.5488132 gleason 0.4324171 0.2688916 0.077820447 0.514830063 1.00000000 0.3689868 lpsa 0.7344603 0.1695928 0.179809410 0.548813169 0.36898680 1.0000000
scatterplot matrix using pairs()
Input:
pairs(prostate[c("lcavol", "age", "lbph", "lcp", "gleason", "lpsa")])
Output:
Correlation Ellipse
The oval-shaped object on each scatterplot is called a correlation ellipse. This shows a visual of the correlation strength. The dot at the center of the correlation ellipse, indicates the point at the mean values for the x and y axis variables. The correlation between the two variables is indicated by the shape of the ellipse is shown by the shape more stretched, the stronger the correlation. When there’s a perfect rounded oval, this indicated a very weak correlation.
Input:
pairs.panels(prostate[c("lcavol", "age", "lbph", "lcp", "gleason", "lpsa")])
Output:
Implementing lm() function
The lm() function will return a regression model object that can be used to make predictions. Interactions between independent variables can be specified using the * operator.
Dependent lcavol against all other variables
> pro_model1 <- lm(lcavol ~ age + lbph + lcp + gleason, data = prostate) > pro_model1
Call: lm(formula = lcavol ~ age + lbph + lcp + gleason, data = prostate) Coefficients: (Intercept) age lbph lcp gleason -0.78185 0.02085 -0.01689 0.51970 0.13253
Dependent lpsa against all other variables
> pro_model2 <- lm(lpsa ~ age + lbph + lcp + gleason, data = prostate) > pro_model2
Call: lm(formula = lpsa ~ age + lbph + lcp + gleason, data = prostate) Coefficients: (Intercept) age lbph lcp gleason 1.319355 0.003393 0.133996 0.412253 0.148513
Improving Model Performance
The below break down for the lm() and using summary() is showing residuals and coefficients. The residuals section provides a statistical summary for the prediction errors. Some of these errors are quite substantial. As one can see, 50 percent of the errors fall into the 1Q and 3Q values. For the dependent variable lcavol, this means that predictions were between -0.527 over the true value and 0.578 were under the true value. While lpsa were between -0.444 over the true value and 0.522 were under the true value
For each of the estimated regression coefficient, the p-value provides an estimate of the probability that the true coefficient is given zero given the value of the estimate.The below model we created, has several highly significant variables, and they seem to be related to the outcomes in logical ways.
The coefficient of determination, also known as the multiple R-squared value, provides a measure of how well our model as a whole explains the values of the dependent variable. This means that the closer the value is to 1.0, the better the model explains the data. As one can see the below data, the adjusted R-squared data is 0.664 for lcavol and 0.575 for lpsa. This shows us that the 66 percent of the variation is in the dependent variable.
> summary(pro_model1)
Call: lm(formula = lcavol ~ age + lbph + lcp + gleason, data = prostate) Residuals: Min 1Q Median 3Q Max -1.93235 -0.43727 0.00085 0.50103 2.51215 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.78185 1.15927 -0.674 0.502 age 0.02085 0.01315 1.586 0.116 lbph -0.01689 0.06528 -0.259 0.796 lcp 0.51970 0.07400 7.023 3.66e-10 *** gleason 0.13253 0.14733 0.899 0.371 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8677 on 92 degrees of freedom Multiple R-squared: 0.4805, Adjusted R-squared: 0.458 F-statistic: 21.28 on 4 and 92 DF, p-value: 1.901e-12
> summary(pro_model2)
Call: lm(formula = lpsa ~ age + lbph + lcp + gleason, data = prostate) Residuals: Min 1Q Median 3Q Max -2.05359 -0.56531 -0.03972 0.57489 2.34593 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.319355 1.277511 1.033 0.3044 age 0.003393 0.014489 0.234 0.8154 lbph 0.133996 0.071935 1.863 0.0657 . lcp 0.412253 0.081544 5.056 2.17e-06 *** gleason 0.148513 0.162362 0.915 0.3627 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9563 on 92 degrees of freedom Multiple R-squared: 0.3423, Adjusted R-squared: 0.3137 F-statistic: 11.97 on 4 and 92 DF, p-value: 7.115e-08
How can we improve model performance?
The key difference between regression modeling and other machine learning approaches is that regression typically leaves feature selection and model specification to the user.
Adding non-linear relationships
The below will add age and age2 to the lm() formula using expenses ~ age + age2 form. This will allow the model to separate the linear and non-linear impact of age on lcacol and lpsa.
Converting a numeric variable to a binary indicator
A feature called ifelse() will be used to convert a numeric variable to a binary indicator.
Putting it all together
From the above model improvements, we can try to view an improvement in the regression formula: 1. Adding a non-linear term for age 2. Created gleason index of 6
> prostate$age2 <- prostate$age^2 > prostate$gle13 <- ifelse(prostate$gleason == 6, 1, 0)
LCAVOL
> pro_model1l <- lm(lcavol ~ age + age2 + lbph + lcp + gleason + gle13, data = prostate) > summary(pro_model1l)
Call: lm(formula = lcavol ~ age + age2 + lbph + lcp + gleason + gle13, data = prostate) Residuals: Min 1Q Median 3Q Max -1.81382 -0.47487 -0.05274 0.42309 2.62414 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6376607 4.2453612 0.386 0.7006 age 0.0015325 0.1273732 0.012 0.9904 age2 0.0001434 0.0010434 0.137 0.8910 lbph -0.0258883 0.0652195 -0.397 0.6924 lcp 0.4869437 0.0761986 6.390 7.15e-09 *** gleason -0.1037065 0.2032584 -0.510 0.6111 gle13 -0.5211846 0.3097377 -1.683 0.0959 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8639 on 90 degrees of freedom Multiple R-squared: 0.4964, Adjusted R-squared: 0.4628 F-statistic: 14.78 on 6 and 90 DF, p-value: 1.094e-11
LPSA
> pro_model2l <- lm(lpsa ~ age + age2 + lbph + lcp + gleason + gle13, data = prostate) > summary(pro_model2l)
Call: lm(formula = lpsa ~ age + age2 + lbph + lcp + gleason + gle13, data = prostate) Residuals: Min 1Q Median 3Q Max -2.41908 -0.61200 0.06326 0.53455 2.52596 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.122462 4.574862 1.994 0.0492 * age -0.157913 0.137259 -1.150 0.2530 age2 0.001304 0.001124 1.160 0.2491 lbph 0.118201 0.070281 1.682 0.0961 . lcp 0.361107 0.082113 4.398 2.99e-05 *** gleason -0.236838 0.219034 -1.081 0.2825 gle13 -0.825832 0.333778 -2.474 0.0152 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9309 on 90 degrees of freedom Multiple R-squared: 0.3903, Adjusted R-squared: 0.3497 F-statistic: 9.602 on 6 and 90 DF, p-value: 3.77e-08
I believe for the prostate dataset, the decision tree was the best when compared to regression. As one could see from the above, there was no improvement on the r sqaured when adding a non-linear term for age. LCAVOL R-squared improved from 0.4805 to 0.4964 and LPSA improved from 0.3423 to 0.390.
The decision really visualizes the biggest factors in the prostate data. As one can see the lcavol dependent variable has the biggest factor of lcp then gleason. Lpsa has the biggest factor of gleason then lbph. Overall, the linear regression was a better method for prediction of the prostate dataset.