In this R tutorial, we will be estimating the quality of wines with regression trees and model trees. Machine learning has been used to discover key differences in the chemical composition of wines from different regions or to identify the chemical factors that lead a wine to taste sweeter. In most cases, wine experts rate wine that can predict whether the wine is labeled as the bottom or top shelf.
The methods used will be regression trees and model trees to create a system capable of mimicking ratings of wine. This will allow the winemakers to identify the key factors that contribute to better-rated wines.
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | install.packages("C50") install.packages("gmodels") install.packages("party") install.packages("RColorBrewer") install.packages("psych") install.packages("rpart") install.packages("rpart.plot") install.packages("RWeka") library(C50) library(gmodels) library(party) library(RColorBrewer) library(psych) library(rpart) library(rpart.plot) >library(RWeka) |
Download and Load the White Wine Dataset
Since we will be using the wine datasets, you will need to download the datasets. The datasets are already packaged and available for an easy download from the dataset page or directly from here White Wine – whitewines.csv
Example import command for the red and white wine excel CSV file.
Input:
1 | wine <- read.csv("whitewines.csv", stringsAsFactors = FALSE) |
View the White Wine Dataset
str() function
Input:
1 | str(wine) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 | 'data.frame': 4898 obs. of 12 variables: $ fixed.acidity : num 6.7 5.7 5.9 5.3 6.4 7 7.9 6.6 7 6.5 ... $ volatile.acidity : num 0.62 0.22 0.19 0.47 0.29 0.14 0.12 0.38 0.16 0.37 ... $ citric.acid : num 0.24 0.2 0.26 0.1 0.21 0.41 0.49 0.28 0.3 0.33 ... $ residual.sugar : num 1.1 16 7.4 1.3 9.65 0.9 5.2 2.8 2.6 3.9 ... $ chlorides : num 0.039 0.044 0.034 0.036 0.041 0.037 0.049 0.043 0.043 0.027 ... $ free.sulfur.dioxide : num 6 41 33 11 36 22 33 17 34 40 ... $ total.sulfur.dioxide: num 62 113 123 74 119 95 152 67 90 130 ... $ density : num 0.993 0.999 0.995 0.991 0.993 ... $ pH : num 3.41 3.22 3.49 3.48 2.99 3.25 3.18 3.21 2.88 3.28 ... $ sulphates : num 0.32 0.46 0.42 0.54 0.34 0.43 0.47 0.47 0.47 0.39 ... $ alcohol : num 10.4 8.9 10.1 11.2 10.9 ... $ quality : int 5 6 6 4 6 6 6 6 6 7 ... |
Histogram of the Quality of Wine
Input:
1 | hist(wine$quality) |
Output:
Create Wine Train and Test Models
Input:
1 2 | wine_train <- wine[1:3750, ] wine_test <- wine[3751:4898, ] |
Methods for training a model on the data
The rpart() will be used to specify quality as the outcome variable and use the dot notation to allow all the other columns in the wine_train data frame to be used in predictors.
Input:
1 2 | m.rpart <- rpart(quality ~ ., data = wine_train) m.rpart |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | n= 3750 node), split, n, deviance, yval * denotes terminal node 1) root 3750 2945.53200 5.870933 2) alcohol< 10.85 2372 1418.86100 5.604975 4) volatile.acidity>=0.2275 1611 821.30730 5.432030 8) volatile.acidity>=0.3025 688 278.97670 5.255814 * 9) volatile.acidity< 0.3025 923 505.04230 5.563380 * 5) volatile.acidity< 0.2275 761 447.36400 5.971091 * 3) alcohol>=10.85 1378 1070.08200 6.328737 6) free.sulfur.dioxide< 10.5 84 95.55952 5.369048 * 7) free.sulfur.dioxide>=10.5 1294 892.13600 6.391036 14) alcohol< 11.76667 629 430.11130 6.173291 28) volatile.acidity>=0.465 11 10.72727 4.545455 * 29) volatile.acidity< 0.465 618 389.71680 6.202265 * 15) alcohol>=11.76667 665 403.99400 6.596992 * |
Decision Tree Visualization
As one can see from the below, the visualization in the decision tree is much easier to read. Also, the digits parameter rounds all digits to the 3 places.
Input:
1 | rpart.plot(m.rpart, digits = 3) |
Output:
fallen.leaves() addition to the decision tree
This addition will show visualizations with the dissemination of regression tree results, as they are readily understood even without a mathematics background. The lead nodes are predicted values for the examples reaching that node.
Input:
1 | rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101) |
Output:
Test Data Prediction
We must now make predictions on the test data, we use the predict() function. This will return the estimated numeric value for the outcome variable.
Input:
1 2 | p.rpart <- predict(m.rpart, wine_test) summary(p.rpart) |
Output:
1 2 | Min. 1st Qu. Median Mean 3rd Qu. Max. 4.545 5.563 5.971 5.893 6.202 6.597 |
Input:
1 | summary(wine_test$quality) |
Output:
1 2 | Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 5.000 6.000 5.901 6.000 9.000 |
Correlation
We can now check the correlation between the predicted and actual quality values provides a simple way to gauge the model’s performance. As one can see that the outcome is 0.54, which is acceptable but not ideal. This correlation only measures how strong the predictions are related to the true value. This is not a measure of how far off the predictions were from the true values.
Input:
1 | cor(p.rpart, wine_test$quality) |
Output:
1 | [1] 0.5369525 |
Mean Absolute Error
Let’s turn the tables and thinking of another way we could improve the model’s performance. We could consider how far, on average, its prediction was from the true value.
Input:
1 2 | MAE <- function(actual, predicted) { mean(abs(actual - predicted))} MAE(p.rpart, wine_test$quality) |
Output:
1 | [1] 0.5872652 |
As shown above, on average, the difference between our model’s predictions and the true quality score was about 0.59.
Input:
1 | mean(wine_train$quality) |
Output:
1 | [1] 5.870933 |
Input:
1 | > MAE(5.87, wine_test$quality) |
Output:
1 | [1] 0.6722474 |
The above shows room improvement. As one can see, MAE shows 0.59, which comes closer to the average to the true quality score than the imputed mean, MAE 0.67.
M5 Algorithm Improvement
The M5 algorithm will return a model tree object that can be used to make predictions. Also, we will use the predict function that will return a vector of predicted numeric values.
Input:
1 2 | m.m5p <- M5P(quality ~ ., data = wine_train) summary(m.m5p) |
Output:
1 2 3 4 5 6 7 8 | === Summary === Correlation coefficient 0.6666 Mean absolute error 0.5151 Root mean squared error 0.6614 Relative absolute error 76.4921 % Root relative squared error 74.6259 % Total Number of Instances 3750 |
Let’s predict the unseen data and the correlation is rather high.
Input:
1 2 | p.m5p <- predict(m.m5p, wine_test) summary(p.m5p) |
Output:
1 2 | Min. 1st Qu. Median Mean 3rd Qu. Max. 4.389 5.430 5.863 5.874 6.305 7.437 |
Input:
1 | cor(p.m5p, wine_test$quality) |
Output:
1 | [1] 0.6272973 |
Input:
1 | MAE(wine_test$quality, p.m5p) |
Output:
1 | [1] 0.5463023 |
Decision trees were used for numeric prediction to model the wine data. The model trees, which builds a regression model at each leaf node in a hybrid approach. However, the latest cor() did not improve much, it did surpass the performance of the neural network model published. Our model output was close to the published mean absolute error value of 0.45 for the support vector machine model. This was a much simpler learning method for the data used.