In the last tutorial, Decision Tree Analysis with Credit Data in R | Part 1, we learned how to create decisions trees using ctree(). This function is a recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework.
C5.0 algorithm
In this R tutorial, we will learn to use the C5.0 algorithm which was improved over the C4.5 algorithm by adding adaptive boosting. This method in which decision trees are built and the trees vote on the best class for each example. Applying the function is easy by adding a trials parameter to indicate the number of separate decision trees to be used in the boosted team. The function will ultimately set an upper limit; in which the algorithm will stop adding any additional trees if it recognizes that additional trials do not seem to be improving the accuracy.
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
1 2 3 4 5 6 7 8 | install.packages("C50") install.packages("gmodels") install.packages("party") install.packages("RColorBrewer") library(C50) library(gmodels) library(party) library(RColorBrewer) |
Download and Load the Credit Dataset
Now that our libraries are uploaded, let’s pull in the data. Since we will be using the used cars dataset, you will need to download this data set. This dataset is already packaged and available for an easy download from the dataset page or directly from here Credit Dataset – credit.csv
Input:
1 | credit <- read.csv("credit.csv") |
10 Trials Credit Dataset Data
We will start out with 10 trials, which is the de facto standard. The reason being is that research has shown that it can reduce errors rates on the testing data by 25 percent. In addition, using trials will help to boost the decision tree. This will help set an upper limit and will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy.
Input:
1 2 3 4 5 6 7 8 9 10 | set.seed(123) train_sample <- sample(1000, 900) credit_train <- credit[train_sample, ] credit_test <- credit[-train_sample, ] credit_boost10 <- C5.0(credit_train[-17], credit_train$default, trials = 10) credit_boost_pred10 <- predict(credit_boost10, credit_test) CrossTable(credit_test$default, credit_boost_pred10, prop.chisq = FALSE, prop.r = FALSE, prop.c = FALSE, dnn = c('actual default', 'predicted default')) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | Cell Contents |-------------------------| | N | | N / Table Total | |-------------------------| Total Observations in Table: 100 | predicted default actual default | no | yes | Row Total | ---------------|-----------|-----------|-----------| no | 62 | 5 | 67 | | 0.620 | 0.050 | | ---------------|-----------|-----------|-----------| yes | 13 | 20 | 33 | | 0.130 | 0.200 | | ---------------|-----------|-----------|-----------| Column Total | 75 | 25 | 100 | ---------------|-----------|-----------|-----------| |
What’s the cost for defaulting?
When a person default it’s costly to the lender or bank and to the person’s credit. This mistake can be very costly to the lender and a solution is to reduce the total of false positives. The lenders could be rejecting applications that are close to the margin because of the possible risk involved.
The C5.0 algorithm allows the assigning of different errors. This will allow the discouragement from making costly mistakes and is also known as a cost matrix. In each cost matrix there needs to be an outline of the various types of errors that could occur.
Four types of credit default:
- Predicted no, actual no
- Predicted yes, actual no
- Predicted no, actual yes
- Predicted yes, actual yes
Creating Matrix dimensions
A cost matrix will specify how much costlier each errors is.
Input:
1 2 3 | matrix_dimensions <-list(c("no", "yes"), c("no", "yes")) names(matrix_dimensions) <-c("predicted", "actual") matrix_dimensions |
Output:
1 2 3 4 5 | $predicted [1] "no" "yes" $actual [1] "no" "yes" |
Create Error Cost
Now that the names for the matrix are created, we must create the error cost of the four errors.
Input:
1 2 3 4 5 | error_cost <- matrix(c(0, 1, 4, 0), nrow = 2, dimnames = matrix_dimensions) credit_cost <- C5.0(credit_train[-17], credit_train$default, cost = error_cost) error_cost |
Output:
1 2 3 4 | actual predicted no yes no 0 4 yes 1 0 |
As you can see from the above, each error has a cost. For example, creating a false negative(predicted no, actual yes) has a cost factor of 4.
With the addition of error cost in the below Cross Table, overall number of mistakes increased. There’s a 37 percent margin of error among applicants while the boosted case only had 18 percent margin of error.
Input:
1 2 3 4 5 6 7 8 9 10 | matrix_dimensions <-list(c("no", "yes"), c("no", "yes")) names(matrix_dimensions) <-c("predicted", "actual") error_cost <- matrix(c(0, 1, 4, 0), nrow = 2, dimnames = matrix_dimensions) credit_cost <- C5.0(credit_train[-17], credit_train$default, cost = error_cost) credit_cost_pred <- predict(credit_cost, credit_test) CrossTable(credit_test$default, credit_cost_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default')) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | Cell Contents |-------------------------| | N | | N / Table Total | |-------------------------| Total Observations in Table: 100 | predicted default actual default | no | yes | Row Total | ---------------|-----------|-----------|-----------| no | 37 | 30 | 67 | | 0.370 | 0.300 | | ---------------|-----------|-----------|-----------| yes | 7 | 26 | 33 | | 0.070 | 0.260 | | ---------------|-----------|-----------|-----------| Column Total | 44 | 56 | 100 | ---------------|-----------|-----------|-----------| |
Lenders take a major risk of lending and walking a fine line if an applicant is close on rejection. From the analysis, lenders take chances on the applicants and can become a rejection by default for a number of reasons. As seen within the decision trees, age, employment duration and savings all play a major percentage of defaulted applications. From the analysis, savings would help to decrease defaults among applicants by having money to fall back on in the case of an emergency.