In this R tutorial, we will analyze and visualize the Halloween Candy Power Ranking dataset using ggplot(). The data was pulled from a survey online with over 260,000 votes. The data is pulled from Kaggle.com provided the dataset The Ultimate Halloween Candy Power Ranking. The Halloween candy will be analyzed by using functions such as the head(), str(), and summary().
There are many variables in the dataset to compare the types of candy such as chocolate, fruity, caramel, and hard. We will also create plots on the sugar percentage versus the price percentage and how the candy totals to the win percentage.
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
1 2 3 4 5 6 7 8 9 10 11 12 |
install.packages("dplyr") install.packages("ggplot2") install.packages("ggalt") install.packages("gridExtra") install.packages("grid") install.packages("png") library(dplyr) library(ggplot2) library(ggalt) library(gridExtra) library(grid) library(png) |
Download and Load the Kaggle Halloween Candy Power Ranking Dataset
Since we will be using the Halloween Candy Power Ranking dataset, you will need to download this dataset. This dataset is already packaged and available for an easy download from the dataset page or directly from here Halloween Candy Power Ranking Dataset – halloween_candy_power_ranking.csv
Input:
1 |
halloween_candy <- read.csv("halloween_candy_power_ranking.csv", stringsAsFactors = FALSE) |
View the Kaggle Halloween Candy Power Ranking Dataset
Now that our libraries are installed, let’s pull in the data and take a look at the summary of the Halloween Candy Power Ranking dataset. Also, we will be able to able to view the head data with functions head(), str(), and summary().
head() function
Input:
1 |
head(halloween_candy) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar pluribus 1 100 Grand 1 0 1 0 0 1 0 1 0 2 3 Musketeers 1 0 0 0 1 0 0 1 0 3 One dime 0 0 0 0 0 0 0 0 0 4 One quarter 0 0 0 0 0 0 0 0 0 5 Air Heads 0 1 0 0 0 0 0 0 0 6 Almond Joy 1 0 0 1 0 0 0 1 0 sugarpercent pricepercent winpercent 1 0.732 0.860 66.97173 2 0.604 0.511 67.60294 3 0.011 0.116 32.26109 4 0.011 0.511 46.11650 5 0.906 0.511 52.34146 6 0.465 0.767 50.34755 |
str() function
Another way to print the Halloween Candy Power Ranking data is by using the str() function. The str() command displays the internal structure of an R object. This function is an alternative the to summary() function. When using the str() function, only one line for each basic structure will be displayed.
Input:
1 |
str(halloween_candy) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
'data.frame': 85 obs. of 13 variables: $ competitorname : chr "100 Grand" "3 Musketeers" "One dime" "One quarter" ... $ chocolate : int 1 1 0 0 0 1 1 0 0 0 ... $ fruity : int 0 0 0 0 1 0 0 0 0 1 ... $ caramel : int 1 0 0 0 0 0 1 0 0 1 ... $ peanutyalmondy : int 0 0 0 0 0 1 1 1 0 0 ... $ nougat : int 0 1 0 0 0 0 1 0 0 0 ... $ crispedricewafer: int 1 0 0 0 0 0 0 0 0 0 ... $ hard : int 0 0 0 0 0 0 0 0 0 0 ... $ bar : int 1 1 0 0 0 1 1 0 0 0 ... $ pluribus : int 0 0 0 0 0 0 0 1 1 0 ... $ sugarpercent : num 0.732 0.604 0.011 0.011 0.906 ... $ pricepercent : num 0.86 0.511 0.116 0.511 0.511 ... $ winpercent : num 67 67.6 32.3 46.1 52.3 ... |
summary() function
Input:
1 |
summary(halloween_candy) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
competitorname chocolate fruity caramel peanutyalmondy Length:85 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 Mode :character Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Mean :0.4353 Mean :0.4471 Mean :0.1647 Mean :0.1647 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 nougat crispedricewafer hard bar pluribus Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 Median :0.00000 Median :0.00000 Median :0.0000 Median :0.0000 Median :1.0000 Mean :0.08235 Mean :0.08235 Mean :0.1765 Mean :0.2471 Mean :0.5176 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 sugarpercent pricepercent winpercent Min. :0.0110 Min. :0.0110 Min. :22.45 1st Qu.:0.2200 1st Qu.:0.2550 1st Qu.:39.14 Median :0.4650 Median :0.4650 Median :47.83 Mean :0.4786 Mean :0.4689 Mean :50.32 3rd Qu.:0.7320 3rd Qu.:0.6510 3rd Qu.:59.86 Max. :0.9880 Max. :0.9760 Max. :84.18 |
Sugar Percentage and Price Percentage Scatterplot
Below we will create a scatterplot to plot the sugar percentage and price percentage to see how the amount of sugar has on the cost of candy. Each point in the plot is determined by the value of the variable on the x-axis (sugar percentage) and on the y-axis (price percentage).
Input:
1 2 |
ggplot(data = halloween_candy, aes(x = sugarpercent, y = pricepercent)) + geom_point() |
Output:
Sugar Percentage and Price Percentage Scatterplot with Encircling
In some cases, I like to encircle groups of points in a scatterplot to draw attention. We will still be using ggplot and adding geom_circle(). This functionality is part of the ggalt package so please make sure it’s installed. The geom_circle() will automatically enclose points in a polygon.
Note: If you are working with large numbers and looking to disable disabling scientific notation, make sure to run: options(scipen = 999)
Input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
options(scipen = 999) candy_select <- halloween_candy[halloween_candy$sugarpercent > 0.01 & halloween_candy$sugarpercent <= 1 & halloween_candy$pricepercent > 0.01 & halloween_candy$pricepercent <= 0.1, ] ggplot(halloween_candy, aes(x = sugarpercent, y = pricepercent)) + geom_point(aes(col = sugarpercent, size = pricepercent)) + geom_smooth(method ="loess") + xlim(c(0, 1)) + ylim(c(0, 1)) + geom_encircle(aes(x = sugarpercent, y = pricepercent), data = candy_select, color = "purple", size = 3, expand = 0.05, position = "identity") + labs(title = "Halloween Candy Power Ranking Scatterplot with Encircling", subtitle="Sugar Percentage versus Price Percentage", y = "Sugar Percentage", x = "Price Percentage") + theme(plot.title = element_text(hjust = 0.5)) + theme(plot.subtitle = element_text(hjust = 0.5)) |
Output:
Sugar Percentage and Price Percentage Scatterplot with Text
Froma visual perspective, I really like the scatterplot for this tutorial. However, I believe adding the Halloween Candy names to the plot will provide even more benefit to the analysis. We can add the geom_text() and input a few methods to create the plot.
Input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
ggplot(data = halloween_candy, aes(x = sugarpercent, y = pricepercent, label = competitorname)) + geom_point(color ="orange") + geom_smooth(method = "lm") + geom_text(check_overlap = T, vjust = "bottom", nudge_y = 0.01, angle = 35, size = 2, color ="purple") + labs(title = "Halloween Candy Power Ranking Scatterplot with Text", y = "Sugar Percentage", x = "Price Percentage") + theme(plot.title = element_text(hjust = 0.5)) |
Output:
Halloween Candy Features
The variables in the Halloween Candy Power Ranking dataset include various attributes that help create the rankings for each candy. Let’s take chocolate for an example, it will be either be 1 (TRUE) or 0 (FALSE). A piece of candy can have always more than one attribute. For example, the 100 Grand candy bar will be 1 (True) for chocolate and 1 (True) for caramel.
Halloween Candy Chocolate Bar Chart
Let’s start off by creating a simple bar chart of chocolate candy.
Input:
1 2 |
ggplot(candy_features, aes(x = chocolate)) + geom_bar() |
Output:
Halloween Candy Chocolate and Caramel Bar Chart
Previously I brought up the fact that candy has more than one feature and I used the example of chocolate and caramel. Let’s create a bar chart with chocolate and caramel. The first step we will take is creating a variable to pull data from 2:10. This will leave out the competitorname, sugarpercent, pricepercent, and winpercent. Secondly, we must make a variable to apply all features as logical. the lapply() function returns a list of the same length as X. Each of these elements of which is the result of applying FUN to the corresponding element of X.
Input:
1 2 3 |
candy_features <- halloween_candy %>% select(2:10) candy_features[] <- lapply(candy_features, as.logical) |
Now let’s run ggplot a fill of caramel and the new variable, candy_features.
Input:
1 2 |
ggplot(candy_features, aes(x = chocolate, fill = caramel)) + geom_bar() |
Output:
Halloween Chocolate Candy Features Grid Arrange
First we must create variables for each chocolate and feature as show below:
- chocolate and bar
- chocolate and caramel
- chocolate and crispedricewafer
- chocolate and fruity
- chocolate and hard
- chocolate and nougat
- chocolate and peanutyalmondy
- chocolate and pluribus
Input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
chocolate_bar <- ggplot(candy_features, aes(x = chocolate, fill = bar)) + geom_bar() + scale_fill_manual(values = c('navy', 'orangered2')) chocolate_caramel <- ggplot(candy_features, aes(x = chocolate, fill = caramel)) + geom_bar() + scale_fill_manual(values = c('navy', 'orangered2')) chocolate_crispedricewafer <- ggplot(candy_features, aes(x = chocolate, fill = crispedricewafer)) + geom_bar() + scale_fill_manual(values = c('navy', 'orangered2')) chocolate_fruity <- ggplot(candy_features, aes(x = chocolate, fill = fruity)) + geom_bar() + scale_fill_manual(values = c('navy', 'orangered2')) chocolate_hard <- ggplot(candy_features, aes(x = chocolate, fill = hard)) + geom_bar() + scale_fill_manual(values = c('navy', 'orangered2')) chocolate_nougat <- ggplot(candy_features, aes(x = chocolate, fill = nougat)) + geom_bar() + scale_fill_manual(values = c('navy', 'orangered2')) chocolate_peanutyalmondy <- ggplot(candy_features, aes(x = chocolate, fill = peanutyalmondy)) + geom_bar() + scale_fill_manual(values = c('navy', 'orangered2')) chocolate_pluribus <- ggplot(candy_features, aes(x = chocolate, fill = pluribus)) + geom_bar() + scale_fill_manual(values = c('navy', 'orangered2')) |
The grid.arrange() function will allow plotting with multiple plots, so let’s make a grid.arrange() with the variables made.
Input:
1 2 3 4 5 6 7 8 |
chocolate_features_grid <- grid.arrange(chocolate_bar, chocolate_caramel, chocolate_crispedricewafer, chocolate_fruity, chocolate_hard, chocolate_nougat, chocolate_peanutyalmondy, chocolate_pluribus, top = "Halloween Chocolate Candy Features Grid Arrange", ncol = 2, nrow = 4) chocolate_features_grid |
Output:
Halloween Candy Power Ranking with Lollipop Chart
The next section will focus on the win percentage to see what candy is the most favorite. The first step is to order the winpercentage column from the highest percentage to the lowest percentage.
Input:
1 2 3 |
halloween_candy_order <- halloween_candy[order(halloween_candy$winpercent, rev(halloween_candy$winpercent), decreasing = TRUE), ] head(halloween_candy_order) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar 53 Reeses Peanut Butter cup 1 0 0 1 0 0 0 0 52 Reeses Miniatures 1 0 0 1 0 0 0 0 80 Twix 1 0 1 0 0 1 0 1 29 Kit Kat 1 0 0 0 0 1 0 1 65 Snickers 1 0 1 1 1 0 0 1 54 Reeses pieces 1 0 0 1 0 0 0 0 pluribus sugarpercent pricepercent winpercent 53 0 0.720 0.651 84.18029 52 0 0.034 0.279 81.86626 80 0 0.546 0.906 81.64291 29 0 0.313 0.511 76.76860 65 0 0.546 0.651 76.67378 54 1 0.406 0.651 73.43499 |
As we can see from the above, the Reese’s Peanut Butter Cup has the highest win percentage of 84.18029. Now let’s create a variable of the top 25 Halloween Candy.
Input:
1 2 3 |
halloween_candy_top_25 <- head(halloween_candy_order, 25) head(halloween_candy_top_25) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
competitorname chocolate fruity caramel peanutyalmondy nougat 53 Reeses Peanut Butter cup 1 0 0 1 0 52 Reeses Miniatures 1 0 0 1 0 80 Twix 1 0 1 0 0 29 Kit Kat 1 0 0 0 0 65 Snickers 1 0 1 1 1 54 Reeses pieces 1 0 0 1 0 crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent 53 0 0 0 0 0.720 0.651 84.18029 52 0 0 0 0 0.034 0.279 81.86626 80 1 0 1 0 0.546 0.906 81.64291 29 1 0 1 0 0.313 0.511 76.76860 65 0 0 1 0 0.546 0.651 76.67378 54 0 0 0 1 0.406 0.651 73.43499 |
Note: Read the tutorial How to add a Background Image in ggplot to learn how to add a picture to the background of ggplot.
Input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
reeses_background <- png::readPNG("halloween_candy_power_ranking_reeses_peanut_butter_cup.png") ggplot(halloween_candy_top_25, aes(x = reorder(competitorname, sort(as.numeric(winpercent))), y = winpercent)) + annotation_custom(rasterGrob(reeses_background, width = unit(1,"npc"), height = unit(1,"npc")), -Inf, Inf, -Inf, Inf) + geom_point(size = 4) + geom_segment(aes(x = competitorname, xend = competitorname, y = 0, yend = winpercent)) + labs(title = "Halloween Candy Power Ranking with Lollipop Chart", y = "Win Percentage", x = "Halloween Candy Names", subtitle = "Top 25 Halloween Candy Power Rankings") + theme(axis.text.x = element_text(angle=90, vjust=0.6)) + theme(plot.title = element_text(hjust = 0.5)) + theme(plot.subtitle = element_text(hjust = 0.5)) |
Output:
Hope you enjoyed this tutorial on Halloween Candy Power Rankings!