In this R tutorial, we will review credit scoring of mortgage loans and the criteria that cause an applicant to be rejected. This will be the review of each applicant and review the percentage of the applications that were approved but should have been rejected. The risk of loaning mortgages inquires a great detail of review of each applicant and walking the fine line of who should and shouldn’t be approved.
C5.0 Algorithm
In addition, the method that will be used in reviewing credit defaults is the C5.0 algorithm, created by J. Ross Quinlan. This algorithm is used to produce decision trees. The C5.0 algorithm has some great strengths in reviewing the default assessment, such as the results can be interpreted without a mathematical background and it can be used in small or large datasets.
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
1 2 3 4 5 6 7 8 | install.packages("C50") install.packages("gmodels") install.packages("party") install.packages("RColorBrewer") library(C50) library(gmodels) library(party) library(RColorBrewer) |
Download and Load the Credit Dataset
Now that our libraries are uploaded, let’s pull in the data. Since we will be using the used credit dataset, you will need to download this dataset. This dataset is already packaged and available for an easy download from the dataset page or directly from here Credit Dataset – credit.csv
1 | > credit <- read.csv("credit.csv") |
View the Used Cars Dataset Data
Once the data is imported, you can run a series of commands to see sample data of the credit data.
str() function
The str() command displays the internal structure of an R object. This function is an alternative to summary(). When using the str() function, only one line for each basic structure will be displayed.
Input:
1 | str(credit) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | 'data.frame': 1000 obs. of 17 variables: $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ... $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ... $ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ... $ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ... $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ... $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ... $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ... $ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ... $ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ... $ age : int 67 22 49 45 53 35 53 35 61 28 ... $ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ... $ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ... $ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ... $ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ... $ dependents : int 1 1 2 2 2 2 1 1 1 1 ... $ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ... $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ... |
summary() function
The summary() function is a basic function that issued to produce the result summary of various model functions.
Input:
1 | summary(credit) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | checking_balance months_loan_duration credit_history purpose < 0 DM :274 Min. : 4.0 critical :293 business : 97 > 200 DM : 63 1st Qu.:12.0 good :530 car :337 1 - 200 DM:269 Median :18.0 perfect : 40 car0 : 12 unknown :394 Mean :20.9 poor : 88 education : 59 3rd Qu.:24.0 very good: 49 furniture/appliances:473 Max. :72.0 renovations : 22 amount savings_balance employment_duration percent_of_income Min. : 250 < 100 DM :603 < 1 year :172 Min. :1.000 1st Qu.: 1366 > 1000 DM : 48 > 7 years :253 1st Qu.:2.000 Median : 2320 100 - 500 DM :103 1 - 4 years:339 Median :3.000 Mean : 3271 500 - 1000 DM: 63 4 - 7 years:174 Mean :2.973 3rd Qu.: 3972 unknown :183 unemployed : 62 3rd Qu.:4.000 Max. :18424 Max. :4.000 years_at_residence age other_credit housing existing_loans_count Min. :1.000 Min. :19.00 bank :139 other:108 Min. :1.000 1st Qu.:2.000 1st Qu.:27.00 none :814 own :713 1st Qu.:1.000 Median :3.000 Median :33.00 store: 47 rent :179 Median :1.000 Mean :2.845 Mean :35.55 Mean :1.407 3rd Qu.:4.000 3rd Qu.:42.00 3rd Qu.:2.000 Max. :4.000 Max. :75.00 Max. :4.000 job dependents phone default management:148 Min. :1.000 no :596 no :700 skilled :630 1st Qu.:1.000 yes:404 yes:300 unemployed: 22 Median :1.000 unskilled :200 Mean :1.155 3rd Qu.:1.000 Max. :2.000 |
head() function
In order to have an idea of what data is being processed, we can use the head() function to print the first 6 lines of data for the below:
- Checking balance
- Savings balance
- Default
Input:
1 | head(credit$checking_balance) |
Output:
1 2 | [1] < 0 DM 1 - 200 DM unknown < 0 DM < 0 DM unknown Levels: < 0 DM > 200 DM 1 - 200 DM unknown |
Input:
1 | head(credit$savings_balance) |
Output:
1 2 | [1] unknown < 100 DM < 100 DM < 100 DM < 100 DM unknown Levels: < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown |
Input:
1 | head(credit$default) |
Output:
1 2 | [1] no yes no no yes no Levels: no yes |
table() function
In addition, we can use the table() function to print the total yes or no defaults within the credit data.
Input:
1 | table(credit$default) |
Output:
1 2 | no yes 700 300 |
For the below, we will create a credit default plot for the above table. First, we must convert a column into a factor column by using the as.factor() function in R.
Input:
1 2 | credit_fac <- as.factor(credit$default) plot(credit_fac) |
Output:
1 2 3 4 5 | no yes 0.7033333 0.2966667 no yes 0.67 0.33 |
To Prune Early..or to Prune Later?
A decision tree can continuously grow because of the splitting features and how the data is divided. Just like if you had an oversized tree in your yard, pruning would be a good idea. In this analogy, pruning is a good idea as well to reduce the size. Pruning can be pre-pruning or post-pruning. Pre-pruning is used at a certain number of decision or decision nodes. In my opinion, pre-pruning a decision tree before letting the tree grow to an optimal size could miss important patterns. The purpose of a decision tree is to learn the data in depth and pre-pruning would decrease those chances. In my opinion, I would rather post-prune because it will allow the decision tree to maximize the depth of the decision tree. This will allow the algorithm to have all of the important data.
Putting C5.0 Algorithm into Action
In the below CrossTable, there’s a total of 100 applications. Below shows the false positives and negatives of incorrect approval for the lender. Below you will see a set.seed(123) function which is a random number generator. This function is very useful for creating simulations or random objects that can be reproduced.
Input:
1 2 3 4 5 6 7 8 9 | set.seed(123) train_sample <- sample(1000, 900) credit_train <- credit[train_sample, ] credit_test <- credit[-train_sample, ] credit_model <- C5.0(credit_train[-17], credit_train$default) credit_pred <- predict(credit_model, credit_test) CrossTable(credit_test$default, credit_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default')) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | Cell Contents |-------------------------| | N | | N / Table Total | |-------------------------| Total Observations in Table: 100 | predicted default actual default | no | yes | Row Total | ---------------|-----------|-----------|-----------| no | 59 | 8 | 67 | | 0.590 | 0.080 | | ---------------|-----------|-----------|-----------| yes | 19 | 14 | 33 | | 0.190 | 0.140 | | ---------------|-----------|-----------|-----------| Column Total | 78 | 22 | 100 | ---------------|-----------|-----------|-----------| |
Predicted no, actual no – 59
Predicted yes, actual no – 8
Predicted no, actual yes – 19
Predicted yes, actual yes – 14
ctree() Evaluation
We will be using the ctree() algorithm to evaluate the credit_train between default and age. ctree() (Conditional Inference trees) are recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework.
Based on the decision tree, there are 175 applicants less than 25 years-old but account about 42 percent of defaults. However, there are 725 applicants over 25 years-old and account for 25 percent of this age category. This breaks down the below 74 applicants defaulting under age 25 and about 181 defaulting older than 25.
Input:
1 2 3 | credit_train <- credit[train_sample, ] credit_ctree_age <- ctree(default ~ age, data = credit_train ) |
Decision Tree Includes Default and Age
This age gap shows major underlying issues of defaulting because of the age groups. This could be for a variety of reasons such as not having financial responsibility or stable employment.
Input:
1 | plot(credit_ctree_age) |
Output:
Total percentage breakdown:
- Node 2 – 175 applicants * .42 = 74 defaults
- Node 3 – 725 applicants * .25 = 181 defaults
Decision Tree Includes Employment Duration
If instability to keep a job is a factor, adding the employment_duration to the decision tree will show how the age groups ranks. As one can see below, node 2 has an employment duration of less than a year and unknown shows a total of 206 applicants with a default percentage of 0.39. On node 3, there are a total of 694 applicants. Their employment duration is greater than 7 years, 4-7 years and 1-4 years with a default percentage of 0.25.
Input:
1 2 3 4 | credit_train <- credit[train_sample, ] credit_ctree_job <- ctree(default ~ employment_duration, data = credit_train ) plot(credit_ctree_job) |
Output:
Total percentage breakdown:
- Node 2 – 206 applicants * .39 = 80 defaults
- Node 3 – 694 applicants * .25 = 173 defaults
From the above analysis, there’s a larger default percentage with employment duration versus the age. I would believe the largest factor in defaults is due to losing employment and no back-up to pay the lender.
Decision Tree Includes Savings Balance
The biggest personal concern for me with a large loan such as a mortgage is to think of the worst.
- What if I lose my job?
- Will I have enough in my savings to cover my mortgage plus other expenses?
With the below decision tree based on savings, this shows a great difference between having a large sum of savings saved up. There are two nodes broken up between greater than a 1000 and 5-1000 in savings. Node three is broken into less than 100 and 100-500 in savings.
Input:
1 2 3 4 | credit_train <- credit[train_sample, ] credit_ctree_savings <- ctree(default ~ savings_balance, data = credit_train ) plot(credit_ctree_savings) |
Output:
Total percentage breakdown:
- Node 2 – 261 applicants * 0.18 defaults = 47 defaults
- Node 3 – 639 applicants * 0.38 defaults = 242 defaults
As per above, there’s a 20% difference in defaults by having financial savings. This is the largest percentage seen by the decision tree based on age, employment duration and savings. I believe that saving money will help decrease on defaults but lenders are not able to force each applicant to be financially savvy.
Decision Tree Includes Multiple Factors
In the below decision tree, there are multiple factors used and will require some modification to the plot. This modification will allow you to easily read the data.
Input:
1 2 3 4 | credit_train <- credit[train_sample, ] output_credit_ctree <- ctree(default ~ checking_balance + other_credit + months_loan_duration + credit_history + savings_balance, data = credit_train) plot(output_credit_ctree) |
Output:
As you can see in the above plot, the data is split into 17 nodes and 9 terminal nodes. In addition, we can view the inputs of output_credit_ctree.
Input:
1 | output_credit_ctree |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | Conditional inference tree with 9 terminal nodes Response: default Inputs: age, checking_balance, other_credit, months_loan_duration, credit_history, savings_balance Number of observations: 900 1) checking_balance == {> 200 DM, unknown}; criterion = 1, statistic = 120.21 2) other_credit == {none, store}; criterion = 1, statistic = 21.281 3) checking_balance == {unknown}; criterion = 0.989, statistic = 10.421 4)* weights = 314 3) checking_balance == {> 200 DM} 5)* weights = 51 2) other_credit == {bank} 6)* weights = 47 1) checking_balance == {< 0 DM, 1 - 200 DM} 7) months_loan_duration <= 20; criterion = 1, statistic = 25.7 8) credit_history == {perfect, very good}; criterion = 1, statistic = 24.001 9)* weights = 24 8) credit_history == {critical, good, poor} 10)* weights = 234 7) months_loan_duration > 20 11) savings_balance == {> 1000 DM, unknown}; criterion = 0.985, statistic = 16.461 12) checking_balance == {< 0 DM}; criterion = 1, statistic = 17.417 13)* weights = 18 12) checking_balance == {1 - 200 DM} 14)* weights = 21 11) savings_balance == {< 100 DM, 100 - 500 DM, 500 - 1000 DM} 15) months_loan_duration <= 47; criterion = 0.967, statistic = 7.664 16)* weights = 157 15) months_loan_duration > 47 17)* weights = 34 |
In the next R tutorial, we will improve the model performance. We will continue to use the C5.0 algorithm as it was improved over the C4.5 with the addition of boosting. Within C5.0 the trials are an integer specifying the number of boosting iterations and the value of one indicates that a single model is used. Also, the C5.0 model can take the form of a full decision tree or a collection of rules which help us to expand in the tutorial of decision trees for credit data.
Check out Part 2: Decision Tree Analysis with Credit Data in R | Part 2