Table of Contents
In this R tutorial, we will use data analysis for mushrooms. Every year many people become sick or even die from eating mushrooms that are poisonous. The purpose of this report to is to show the distinction between edible and non-edible mushrooms. The differences will be created by the odor of the mushrooms.
What are Greedy Algorithms?
In dealing with the various mushrooms types and odors; greedy algorithms will be used. The methods that will be used are OneR and JRip(). OneR() is used for testing against the variable(types) and the predictors(odor, etc). This classification will be used to compare the odors of each mushroom, whether poisonous or non-poisonous. This will be used to classify how odors help to differentiate the mushrooms from being deadly. JRip() can be used like an if-else statement, which is similar to most programming logic. Also, this method will print rules that can be used for determining the difference between an edible or poisonous mushroom.
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
install.packages("C50") install.packages("gmodels") install.packages("party") install.packages("RColorBrewer") install.packages("RWeka") library(C50) library(gmodels) library(party) library(RColorBrewer) library(RWeka)
Download and Load the Mushrooms Dataset
Since we will be using the mushrooms data set, you will need to download this dataset. This dataset is already packaged and available for an easy download from the dataset page or directly from here Mushroom Dataset – mushrooms.csv
Input:
mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE) mushrooms$veil_type <- NULL
View the Mushrooms Dataset
Once the data is imported, you can run a series of commands to see sample data of the mushrooms dataset.
str() function
Now you can run the str() function to see the sample data of the mushroom dataset.
Input:
str(mushrooms)
Output:
'data.frame': 8124 obs. of 23 variables: $ type : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ... $ cap_shape : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ... $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ... $ cap_color : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ... $ bruises : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ... $ odor : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ... $ gill_attachment : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ... $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ... $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ... $ gill_color : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ... $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ... $ stalk_root : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ... $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ... $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ... $ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ... $ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ... $ veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ... $ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ... $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ... $ ring_type : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ... $ spore_print_color : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ... $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ... $ habitat : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
table() function
We can use the table() function to view how many mushrooms are edible versus poisonous.
Input:
table(mushrooms$type)
Output:
edible poisonous 4208 3916
Input:
plot(mushrooms$type)
Output:
OneR() classification
This classification is comparing the variable of mushroom type, to all predictors within mushrooms. The main predictor used is the mushroom type but with this classification, all of the predictors will be used for against the variable. In the below output, one can see that the odor future feature is selected. This shows that mushrooms with a foul, fishy or musty smell, is most likely poisonous. On the other hand, if the mushroom smells like almond or anise, then the mushroom is edible.
Input:
mushroom_1R <- OneR(type ~ ., data = mushrooms) mushroom_1R
Output:
odor: almond -> edible anise -> edible creosote -> poisonous fishy -> poisonous foul -> poisonous musty -> poisonous none -> edible pungent -> poisonous spicy -> poisonous (8004/8124 instances correct)
Evaluating Model Performance on OneR
As one looks at the below output, the Confusion Matrix will print out details on edible and poisonous. It shows that a = edible and b = poisonous. This shows that OneR did not classify any edible mushrooms as poisonous. However, it did classify 120 poisonous mushrooms as edible, which could cause a very deadly mistake.
Input:
summary(mushroom_1R)
Output:
=== Summary === Correctly Classified Instances 8004 98.5229 % Incorrectly Classified Instances 120 1.4771 % Kappa statistic 0.9704 Mean absolute error 0.0148 Root mean squared error 0.1215 Relative absolute error 2.958 % Root relative squared error 24.323 % Total Number of Instances 8124 === Confusion Matrix === a b <-- classified as 4208 0 | a = edible 120 3796 | b = poisonous
Below is a decision tree based on determining if a mushroom is poisonous or edible by the odor:
Input:
mushrooms_ctree <- ctree(odor ~ type, data = mushrooms) mushrooms_ctree
Output:
Conditional inference tree with 2 terminal nodes Response: odor Input: type Number of observations: 8124 1) type == {edible}; criterion = 1, statistic = 7658.784 2)* weights = 4208 1) type == {poisonous} 3)* weights = 3916
Input:
plot(mushrooms_ctree)
Output:
Improve Model Performance with JRip() Classification
The OneR() and JRip() are very similar in syntax. JRip() will be used for model improvement against the OneR() model. Once you run the JRip() classification, you will then be able to run the classifier mushroom_JRip to review the rules.
Input:
mushroom_JRip <- JRip(type ~ ., data = mushrooms) mushroom_JRip
Output:
JRIP rules: =========== (odor = foul) => type=poisonous (2160.0/0.0) (gill_size = narrow) and (gill_color = buff) => type=poisonous (1152.0/0.0) (gill_size = narrow) and (odor = pungent) => type=poisonous (256.0/0.0) (odor = creosote) => type=poisonous (192.0/0.0) (spore_print_color = green) => type=poisonous (72.0/0.0) (stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (68.0/0.0) (habitat = leaves) and (cap_color = white) => type=poisonous (8.0/0.0) (stalk_color_above_ring = yellow) => type=poisonous (8.0/0.0) => type=edible (4208.0/0.0) Number of Rules : 9
What’s this JRip() Output?
As you can see there are 9 rules from the classifier mushroom_JRip. Upon review, these are very simple to understand with the improvement from JRip(). Below is what we can determine from the JRip() rules:
- If odor is foul, mushroom type is poisonous
- If gill size is narrow and gill color is buff, mushroom type is poisonous
- If gill size is narrow and odor is pungent, mushroom type is poisonous
- If odor is creosote, mushroom is poisonous
- If spore print color is green, mushroom type is poisonous
- If stalk surface below ring is scaly and stalk surface above ring is silky, mushroom is poisonous
- If habitat is leaves and cap color is white, mushroom is poisonous
- If stalk color above ring is yellow, mushroom is poisonous
- All else the mushroom is edible (this is also shown on the above plot)
With the 9 given rules, below is a decision tree using type, color and gill size.
Input:
mushrooms_ctreeall <- ctree(type ~ odor + gill_size , data = mushrooms) plot(mushrooms_ctreeall)
Output:
If you were looking to eat a mushroom, some of the visual observances would be the quickest way to label the mushroom as poisonous or non-poisonous. One of the first predictors to notice would be the gill size and gill color of the mushroom. Also, if the gill size is narrow and the gill color is buff, then this would be a poisonous mushroom. If the gill size is narrow and the odor is pungent, then the mushroom is poisonous. Finally, the biggest predictor of a poisonous mushroom is the odor. As stated earlier, if the odor is foul, fishy or musty, then most likely it’s poisonous. On the other hand, if the mushroom smells like almond or anise, then the mushroom is edible.