Table of Contents

In this report, the Prostate cancer dataset will be used for analysis. Comparison of the data will be analyzed by using the Tree and Regression analyses for predicting both **lpsa** and **lcavol**. In total, there will be four separate analyses that you compare. The prostate cancer dataset is a study by Stamey et al. (1989) that examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy.

## Install and Load Packages

There will be multiple libraries we need to install and load in this R tutorial before we load the data.

**Input:**

install.packages('C50') install.packages('gmodels') install.packages('party') install.packages('RColorBrewer') install.packages('RWeka') install.packages('psych') install.packages('rpart') install.packages('rpart.plot') library(C50) library(gmodels) library(party) library(RColorBrewer) library(RWeka) library(psych) library(rpart) library(rpart.plot)

Since we will be using the concrete data set, you will need to download this data set. This dataset is already packaged and available for an easy download from the dataset page. or directly from here Prostate – prostate.csv

## Load and View the Prostate Data

In the below, we will load the prostate data file and view the contents of the data with the** summary()** function and build histrograms with the **histogram()** function.

### summary() function

**Input:**

prostate <- read.csv("prostate.csv", stringsAsFactors = TRUE) summary(prostate)

**Output:**

lcavol age lbph lcp gleason Min. :-1.3471 Min. :41.00 Min. :-1.3863 Min. :-1.3863 Min. :6.000 1st Qu.: 0.5128 1st Qu.:60.00 1st Qu.:-1.3863 1st Qu.:-1.3863 1st Qu.:6.000 Median : 1.4469 Median :65.00 Median : 0.3001 Median :-0.7985 Median :7.000 Mean : 1.3500 Mean :63.87 Mean : 0.1004 Mean :-0.1794 Mean :6.753 3rd Qu.: 2.1270 3rd Qu.:68.00 3rd Qu.: 1.5581 3rd Qu.: 1.1787 3rd Qu.:7.000 Max. : 3.8210 Max. :79.00 Max. : 2.3263 Max. : 2.9042 Max. :9.000 lpsa Min. :-0.4308 1st Qu.: 1.7317 Median : 2.5915 Mean : 2.4784 3rd Qu.: 3.0564 Max. : 5.5829

### hist() function

**Input:**

hist(prostate$lpsa)

**Output:**

**Input:**

hist(prostate$lcavol)

**Output:**

## Rplot

This classification is comparing the variable of type to all predictors within the prostate dataset. We will be comparing two predictors; **lcavol and lpsa**.

### LCAVOL Rplot

**Input:**

lcavol_modelR <- lm(lcavol ~ age + lbph + lcp + lpsa + gleason, data = prostate) lcavol_modelR

**Output:**

Call: lm(formula = lcavol ~ age + lbph + lcp + lpsa + gleason, data = prostate) Coefficients: (Intercept) age lbph lcp lpsa gleason -1.49371 0.01902 -0.08918 0.29727 0.53955 0.05240

### LPSA Rplot

**Input:**

lpsa_modelR <- lm(lpsa ~ lcavol + age + lbph + lcp + gleason, data = prostate) lpsa_modelR

**Output:**

Call: lm(formula = lpsa ~ lcavol + age + lbph + lcp + gleason, data = prostate) Coefficients: (Intercept) lcavol age lbph lcp gleason 1.83165 0.65523 -0.01027 0.14506 0.07173 0.06168

## Train and Test Data

### rpart Formula Implementation

We can specify age as the outcome variable and use the dot notation to allow all the other columns in the credit_train data frame to to be used as predictors.

**Input:**

lcavol_pred <- predict(lcavol_modelR, prostate) lpsa_pred <- predict(lpsa_modelR, prostate) prostate_train <- prostate[1:60, ] prostate_test <- prostate[61:97, ]

### LCAVOL Train Data

**Input:**

lcavol.rpart <- rpart(lcavol ~ age + lbph + lcp + gleason, data = prostate_train) lcavol.rpart

**Output:**

n= 60 node), split, n, deviance, yval * denotes terminal node 1) root 60 61.667440 0.81724440 2) lcp< 0.2101769 50 41.521730 0.57621020 4) gleason< 6.5 28 22.037790 0.27570190 8) age< 63.5 17 12.795910 -0.08539058 * 9) age>=63.5 11 3.599646 0.83375390 * 5) gleason>=6.5 22 13.737240 0.95867530 10) age>=63.5 15 7.008366 0.73164000 * 11) age< 63.5 7 4.298893 1.44518000 * 3) lcp>=0.2101769 10 2.716462 2.02241500 *

### LPSA Train Data

**Input:**

lpsa.rpart <- rpart(lpsa ~ age + lbph + lcp + gleason, data = prostate_train) lpsa.rpart

**Output:**

n= 60 node), split, n, deviance, yval * denotes terminal node 1) root 60 38.815370 1.809756 2) gleason< 6.5 30 20.284900 1.481545 4) lbph< 0.9410666 19 12.333000 1.192772 * 5) lbph>=0.9410666 11 3.630779 1.980336 * 3) gleason>=6.5 30 12.067120 2.137967 6) lbph< 0.3306992 14 7.923121 1.927040 * 7) lbph>=0.3306992 16 2.976120 2.322529 *

## Decision Trees

As one can see below, these decision trees are a big improvement from the earlier chapter. This gives greater detail on the outcome and predictors for the credit set. I prefer the rplot with the color coded decision trees for percentages and yes and no outcomes. Let’s visualize the data for lcavol and lpsa.

### LCAVOL Decision Tree

**Input:**

rpart.plot(lcavol.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

**Output:**

### LPSA Decision Tree

**Input:**

rpart.plot(lpsa.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

**Output:**

### Model Performance Evaluation

### LCAVOL

**Input:**

p.lcavol <- predict(lcavol.rpart, prostate_test) summary(p.lcavol)

**Output:**

Min. 1st Qu. Median Mean 3rd Qu. Max. -0.08539 1.44518 2.02242 1.66620 2.02242 2.02242

### LPSA

**Input:**

p.lpsa <- predict(lpsa.rpart, prostate_test) summary(p.lpsa)

**Output:**

Min. 1st Qu. Median Mean 3rd Qu. Max. 1.193 1.927 1.980 2.063 2.323 2.323

## MAE

Measuring correlation on lcavol and lpsa.

### LCAVOL

**Input:**

cor(p.lcavol, prostate_test$lcavol)

**Output:**

[1] 0.3837933

### LPSA

**Input:**

cor(p.lpsa, prostate_test$lpsa)

**Output:**

[1] 0.05227984

### Actual and Predicted function for MAE

MAE <- function(actual, predicted) { mean(abs(actual - predicted)) }

#### LCAVOL MAE

**Input:**

MAE(p.lcavol, prostate_test$lcavol) mean(prostate_test$lcavol) MAE(2.21, prostate_test$lcavol)

**Output:**

[1] 0.7968151 [1] 2.213953 [1] 0.6830092

#### LPSA MAE

**Input:**

MAE(p.lpsa, prostate_test$lpsa) mean(prostate_test$lpsa) MAE(3.56, prostate_test$lpsa)

**Output:**

[1] 1.499867 [1] 3.562653 [1] 0.5247962

## Regression

Correlation matrix for age, lbph, lcp, gleason

**Input:**

cor(prostate[c("lcavol", "age", "lbph", "lcp", "gleason", "lpsa")])

**Output:**

lcavol age lbph lcp gleason lpsa lcavol 1.0000000 0.2249999 0.027349703 0.675310484 0.43241706 0.7344603 age 0.2249999 1.0000000 0.350185896 0.127667752 0.26889160 0.1695928 lbph 0.0273497 0.3501859 1.000000000 -0.006999431 0.07782045 0.1798094 lcp 0.6753105 0.1276678 -0.006999431 1.000000000 0.51483006 0.5488132 gleason 0.4324171 0.2688916 0.077820447 0.514830063 1.00000000 0.3689868 lpsa 0.7344603 0.1695928 0.179809410 0.548813169 0.36898680 1.0000000

### scatterplot matrix using pairs()

**Input:**

pairs(prostate[c("lcavol", "age", "lbph", "lcp", "gleason", "lpsa")])

**Output:**

### Correlation Ellipse

The oval-shaped object on each scatterplot is called a **correlation ellipse**. This shows a visual of the correlation strength. The dot at the center of the correlation ellipse, indicates the point at the mean values for the x and y axis variables. The correlation between the two variables is indicated by the shape of the ellipse is shown by the shape more stretched, the stronger the correlation. When there’s a perfect rounded oval, this indicated a very weak correlation.

Input:

pairs.panels(prostate[c("lcavol", "age", "lbph", "lcp", "gleason", "lpsa")])

**Output:**

## Implementing lm() function

The **lm()** function will return a regression model object that can be used to make predictions. Interactions between independent variables can be specified using the * operator.

### Dependent lcavol against all other variables

> pro_model1 <- lm(lcavol ~ age + lbph + lcp + gleason, data = prostate) > pro_model1

Call: lm(formula = lcavol ~ age + lbph + lcp + gleason, data = prostate) Coefficients: (Intercept) age lbph lcp gleason -0.78185 0.02085 -0.01689 0.51970 0.13253

### Dependent lpsa against all other variables

> pro_model2 <- lm(lpsa ~ age + lbph + lcp + gleason, data = prostate) > pro_model2

Call: lm(formula = lpsa ~ age + lbph + lcp + gleason, data = prostate) Coefficients: (Intercept) age lbph lcp gleason 1.319355 0.003393 0.133996 0.412253 0.148513

## Improving Model Performance

The below break down for the lm() and using summary() is showing residuals and coefficients. The residuals section provides a statistical summary for the prediction errors. Some of these errors are quite substantial. As one can see, 50 percent of the errors fall into the 1Q and 3Q values. For the dependent variable lcavol, this means that predictions were between -0.527 over the true value and 0.578 were under the true value. While lpsa were between -0.444 over the true value and 0.522 were under the true value

For each of the estimated regression coefficient, the p-value provides an estimate of the probability that the true coefficient is given zero given the value of the estimate.The below model we created, has several highly significant variables, and they seem to be related to the outcomes in logical ways.

The coefficient of determination, also known as the multiple R-squared value, provides a measure of how well our model as a whole explains the values of the dependent variable. This means that the closer the value is to 1.0, the better the model explains the data. As one can see the below data, the adjusted R-squared data is 0.664 for lcavol and 0.575 for lpsa. This shows us that the 66 percent of the variation is in the dependent variable.

> summary(pro_model1)

Call: lm(formula = lcavol ~ age + lbph + lcp + gleason, data = prostate) Residuals: Min 1Q Median 3Q Max -1.93235 -0.43727 0.00085 0.50103 2.51215 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.78185 1.15927 -0.674 0.502 age 0.02085 0.01315 1.586 0.116 lbph -0.01689 0.06528 -0.259 0.796 lcp 0.51970 0.07400 7.023 3.66e-10 *** gleason 0.13253 0.14733 0.899 0.371 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8677 on 92 degrees of freedom Multiple R-squared: 0.4805, Adjusted R-squared: 0.458 F-statistic: 21.28 on 4 and 92 DF, p-value: 1.901e-12

> summary(pro_model2)

Call: lm(formula = lpsa ~ age + lbph + lcp + gleason, data = prostate) Residuals: Min 1Q Median 3Q Max -2.05359 -0.56531 -0.03972 0.57489 2.34593 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.319355 1.277511 1.033 0.3044 age 0.003393 0.014489 0.234 0.8154 lbph 0.133996 0.071935 1.863 0.0657 . lcp 0.412253 0.081544 5.056 2.17e-06 *** gleason 0.148513 0.162362 0.915 0.3627 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9563 on 92 degrees of freedom Multiple R-squared: 0.3423, Adjusted R-squared: 0.3137 F-statistic: 11.97 on 4 and 92 DF, p-value: 7.115e-08

How can we improve model performance?

The key difference between regression modeling and other machine learning approaches is that regression typically leaves feature selection and model specification to the user.

### Adding non-linear relationships

The below will add age and age2 to the **lm()** formula using expenses ~ age + age2 form. This will allow the model to separate the linear and non-linear impact of age on **lcacol** and **lpsa**.

### Converting a numeric variable to a binary indicator

A feature called **ifelse()** will be used to convert a numeric variable to a binary indicator.

### Putting it all together

From the above model improvements, we can try to view an improvement in the regression formula: 1. Adding a non-linear term for age 2. Created gleason index of 6

> prostate$age2 <- prostate$age^2 > prostate$gle13 <- ifelse(prostate$gleason == 6, 1, 0)

#### LCAVOL

> pro_model1l <- lm(lcavol ~ age + age2 + lbph + lcp + gleason + gle13, data = prostate) > summary(pro_model1l)

Call: lm(formula = lcavol ~ age + age2 + lbph + lcp + gleason + gle13, data = prostate) Residuals: Min 1Q Median 3Q Max -1.81382 -0.47487 -0.05274 0.42309 2.62414 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6376607 4.2453612 0.386 0.7006 age 0.0015325 0.1273732 0.012 0.9904 age2 0.0001434 0.0010434 0.137 0.8910 lbph -0.0258883 0.0652195 -0.397 0.6924 lcp 0.4869437 0.0761986 6.390 7.15e-09 *** gleason -0.1037065 0.2032584 -0.510 0.6111 gle13 -0.5211846 0.3097377 -1.683 0.0959 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8639 on 90 degrees of freedom Multiple R-squared: 0.4964, Adjusted R-squared: 0.4628 F-statistic: 14.78 on 6 and 90 DF, p-value: 1.094e-11

#### LPSA

> pro_model2l <- lm(lpsa ~ age + age2 + lbph + lcp + gleason + gle13, data = prostate) > summary(pro_model2l)

Call: lm(formula = lpsa ~ age + age2 + lbph + lcp + gleason + gle13, data = prostate) Residuals: Min 1Q Median 3Q Max -2.41908 -0.61200 0.06326 0.53455 2.52596 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.122462 4.574862 1.994 0.0492 * age -0.157913 0.137259 -1.150 0.2530 age2 0.001304 0.001124 1.160 0.2491 lbph 0.118201 0.070281 1.682 0.0961 . lcp 0.361107 0.082113 4.398 2.99e-05 *** gleason -0.236838 0.219034 -1.081 0.2825 gle13 -0.825832 0.333778 -2.474 0.0152 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9309 on 90 degrees of freedom Multiple R-squared: 0.3903, Adjusted R-squared: 0.3497 F-statistic: 9.602 on 6 and 90 DF, p-value: 3.77e-08

I believe for the prostate dataset, the decision tree was the best when compared to regression. As one could see from the above, there was no improvement on the r sqaured when adding a non-linear term for age. LCAVOL R-squared improved from 0.4805 to 0.4964 and LPSA improved from 0.3423 to 0.390.

The decision really visualizes the biggest factors in the prostate data. As one can see the lcavol dependent variable has the biggest factor of **lcp** then **gleason**. Lpsa has the biggest factor of **gleason** then **lbph**. Overall, the linear regression was a better method for prediction of the prostate dataset.