Table of Contents
In this R Tutorial, we will complete Powerball data analysis based on the total number of occurrences of Powerball numbers. The Powerball data will be analyzed based on total occurrences of the top 5 Powerball winning numbers and the top Powerball number since 2010.
Install and Load Packages
Below are the packages and libraries that must be installed to complete this R tutorial.
Input:
install.packages("plyr") install.packages("dplyr") install.packages("ggplot2") library(plyr) library(dplyr) library(ggplot2)
Download and Load Powerball Datasets
This data is already packaged and is available for download from Lottery Powerball Winning Numbers: Beginning 2010. Or you can easily download the earthquake dataset from our dataset page or directly from the below:
- Powerball Numbers Since 2010 (non-formatted) – powerball_lottery_numbers.csv
- Powerball Numbers Since 2010 (formatted) – powerball_lottery_numbers_reformatted.csv
Upon taking a look at the downloaded dataset, we can see that all 5 winning numbers (white balls) are within the cell. Below are the steps that we must complete in Microsoft Excel.
- Remove header row
- Create five columns the second and the last column (between Winning Numbers and Multiplier were)
- Add header Row
- Label Columns as W1, W2, W3, W4, W5, and Powerball
Steps 1 and 2
Steps 3 and 4
Now that the data is organized, we can pull the data in.
Input:
powerball_nummbers_2010 <- read.csv("powerball_lottery_numbers_reformatted.csv", stringsAsFactors = FALSE) head(powerball_nummbers_2010)
Output:
Date W1 W2 W3 W4 W5 Powerball Multiplier 1 2/3/2010 17 22 36 37 52 24 2 2 2/6/2010 14 22 52 54 59 4 3 3 2/10/2010 5 8 29 37 38 34 5 4 2/13/2010 10 14 30 40 51 1 4 5 2/17/2010 7 8 19 26 36 15 3 6 2/20/2010 13 27 37 41 54 32 2
In a previous R tutorial, Odds of Winning Powerball Grand Prize with R, we worked on the odds of winning the Powerball. The chances of having a count of greater than 1 within this Powerball data would be slim to none. Match the 5 white ball and the 1 red ball would be a 1 in 292,201,338 chance and there are only 839 results within this dataset.
What are the numbers with the most occurrences and have there ever been a set of numbers that have won more than once?
Odds of Duplicate Powerball Lines
ddply() function
Some may use the plyr package because it offers the ddply() function that will split a data frame and return results into another data frame. So for each subset of a data frame, the apply function will combine the results into a separate data frame.
Input:
powerball_nummbers_2010 <- read.csv("powerball_lottery_numbers_reformatted.csv", stringsAsFactors = FALSE) head(powerball_numbers_2010, n = 10)
Output:
Date W1 W2 W3 W4 W5 Powerball Multiplier 1 2/3/2010 17 22 36 37 52 24 2 2 2/6/2010 14 22 52 54 59 4 3 3 2/10/2010 5 8 29 37 38 34 5 4 2/13/2010 10 14 30 40 51 1 4 5 2/17/2010 7 8 19 26 36 15 3 6 2/20/2010 13 27 37 41 54 32 2 7 2/24/2010 4 17 35 50 57 12 2 8 2/27/2010 18 47 51 53 58 30 2 9 3/3/2010 7 9 14 45 49 23 4 10 3/6/2010 10 29 33 41 59 15 2
This looks as expected but we can clean up the data frame by removing Date and Multiplier.
Input:
num_l_c <- data.frame(powerball_numbers_2010$W1, powerball_numbers_2010$W2, powerball_numbers_2010$W3, powerball_numbers_2010$W4, powerball_numbers_2010$W5, powerball_numbers_2010$Powerball) head(num_l_c, n = 10)
Output:
powerball_numbers_2010.W1 powerball_numbers_2010.W2 powerball_numbers_2010.W3 1 17 22 36 2 14 22 52 3 5 8 29 4 10 14 30 5 7 8 19 6 13 27 37 7 4 17 35 8 18 47 51 9 7 9 14 10 10 29 33 powerball_numbers_2010.W4 powerball_numbers_2010.W5 powerball_numbers_2010.Powerball 1 37 52 24 2 54 59 4 3 37 38 34 4 40 51 1 5 26 36 15 6 41 54 32 7 50 57 12 8 53 58 30 9 45 49 23 10 41 59 15
The below will count each Powerball of 5 white balls and the Powerball for each series.
Input:
num_l_c <- ddply(num_l_c, .(powerball_numbers_2010.W1, powerball_numbers_2010.W2, powerball_numbers_2010.W3, powerball_numbers_2010.W4, powerball_numbers_2010.W5, powerball_numbers_2010.Powerball), count) head(num_l_c, n = 10)
Output:
powerball_numbers_2010.W1 powerball_numbers_2010.W2 powerball_numbers_2010.W3 1 17 22 36 2 14 22 52 3 5 8 29 4 10 14 30 5 7 8 19 6 13 27 37 7 4 17 35 8 18 47 51 9 7 9 14 10 10 29 33 powerball_numbers_2010.W4 powerball_numbers_2010.W5 powerball_numbers_2010.Powerball freq 1 37 52 24 1 2 54 59 4 1 3 37 38 34 1 4 40 51 1 1 5 26 36 15 1 6 41 54 32 1 7 50 57 12 1 8 53 58 30 1 9 45 49 23 1 10 41 59 15 1
The data names need to be cleaned up for easier reading.
Input:
names(num_l_c) <- c("W1", "W2", "W3", "W4", "W5", "Powerball", "Frequency") head(num_l_c, n = 10)
Output:
W1 W2 W3 W4 W5 Powerball Frequency 1 17 22 36 37 52 24 1 2 14 22 52 54 59 4 1 3 5 8 29 37 38 34 1 4 10 14 30 40 51 1 1 5 7 8 19 26 36 15 1 6 13 27 37 41 54 32 1 7 4 17 35 50 57 12 1 8 18 47 51 53 58 30 1 9 7 9 14 45 49 23 1 10 10 29 33 41 59 15 1
count() function
Now we can use the count() function to return the same results.
Input:
line_count <- count(powerball_numbers_2010, vars = c("powerball_numbers_2010$W1", "powerball_numbers_2010$W2", "powerball_numbers_2010$W3", "powerball_numbers_2010$W4", "powerball_numbers_2010$W5", "powerball_numbers_2010$Powerball")) names(line_count) <- c("W1", "W2", "W3", "W4", "W5", "Powerball", "Frequency") head(line_count, n = 10)
Output:
W1 W2 W3 W4 W5 Powerball Frequency 1 17 22 36 37 52 24 1 2 14 22 52 54 59 4 1 3 5 8 29 37 38 34 1 4 10 14 30 40 51 1 1 5 7 8 19 26 36 15 1 6 13 27 37 41 54 32 1 7 4 17 35 50 57 12 1 8 18 47 51 53 58 30 1 9 7 9 14 45 49 23 1 10 10 29 33 41 59 15 1
As you can see both functions create the same output in the end. This is all personal preference but I always prefer time and efficiency.
Top Powerball Number Occurrences
For this task, we will need to merge columns W1, W2, W3, W4, and W5 into a single column to count the top occurring numbers.
Input:
powerball_numbers_2010 <- read.csv("powerball_lottery_numbers_reformatted.csv", stringsAsFactors = FALSE)
Since we will be merging the 5 columns into 1, we will have a total of 4200 objects in the column.
Let’s create a variable that will have all 4200 objects.
Input:
all_white_balls <- c(powerball_numbers_2010$W1, powerball_numbers_2010$W2, powerball_numbers_2010$W3, powerball_numbers_2010$W4, powerball_numbers_2010$W5) head(all_white_balls, n = 10)
Output:
[1] 17 14 5 10 7 13 4 18 7 10
The next step will be to use this variable and create a data.frame.
Input:
all_white_balls <- data.frame(all_white_balls) names(all_white_balls) <- c("Numbers") head(all_white_balls, n = 10)
Output:
Numbers 1 17 2 14 3 5 4 10 5 7 6 13 7 4 8 18 9 7 10 10
Now let’s pull the top 10 numbers that occur by loading the dplyr package (as stated earlier).
Input:
top_10 <- all_white_balls %>% group_by(Numbers) %>% tally() %>% ungroup %>% top_n(10) top_10
Output:
# A tibble: 10 x 2 Numbers n <int> <int> 1 10 77 2 11 76 3 12 76 4 23 83 5 28 84 6 32 81 7 39 78 8 41 81 9 52 78 10 54 78
The next two steps will be renaming the n column, converting the column from integers to numeric and modify the Count column from the most common occurrence to the least occurring.
Input:
names(top_10) <- c("Numbers", "Count") top_10$Numbers <- as.numeric(top_10$Numbers) top_10$Count <- as.numeric(top_10$Count) top_10 <- top_10[order(top_10$Count, -rank(top_10$Numbers), decreasing = TRUE), ]
Output:
# A tibble: 10 x 2 Numbers Count <dbl> <dbl> 1 28.0 84.0 2 23.0 83.0 3 32.0 81.0 4 41.0 81.0 5 39.0 78.0 6 52.0 78.0 7 54.0 78.0 8 10.0 77.0 9 11.0 76.0 10 12.0 76.0
Powerball Top 10 Numbers by Data Visualization
Now that we have cleaned up the data, we can start to create graphs with the data. We will create two graphs with the top 10 Powerball numbers occurrences.
Graph type for data visualization
- Plain graphic with a point for each number – ggplot() + geom_point()
- Colorful graphic with a bar for each number- ggplot + geom_bar()
ggplot() + geom_point()
Input:
ggplot(top_10, aes(Numbers, Count)) + geom_point() + geom_text(aes(label = Numbers),hjust=0, vjust=0) + xlab("Powerball Numbers") + ylab("Total Number Count") + ggtitle("Top 10 Powerball Numbers") + theme(axis.text.x = element_text(angle=45, hjust = 1)) + theme(plot.title = element_text(hjust = 0.5))
Output:
ggplot + geom_bar()
Input:
ggplot(data=top_10, aes(x = Numbers, y = Count, fill = Numbers)) + geom_bar(stat = "identity", width = .5) + xlab("Powerball Numbers") + ylab("Total Number Occurences") + ggtitle("Top 10 Powerball Numbers") + scale_fill_gradient(low="blue", high="red") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme(plot.title = element_text(hjust = 0.5))
Output:
All Powerball Numbers by Data Visualization
Now let’s run a full data analysis on all of the Powerball numbers and data visualize with ggplot(). We will repeat a few of the steps above and format from the most occurring to the least. We must remember that there are 69 white balls in Powerball and will use the top_n(69) to count all of the numbers.
Input:
top_69 <- all_white_balls %>% group_by(Numbers) %>% tally() %>% ungroup %>% top_n(69) top_69
Output:
# A tibble: 69 x 2 Numbers n <int> <int> 1 1 66 2 2 60 3 3 70 4 4 60 5 5 67 6 6 54 7 7 71 8 8 68 9 9 70 10 10 77 # ... with 59 more rows
The next two steps will be renaming the n column, converting the column from integers to numeric and modify the Count column from the most common occurrence to the least occurring.
Input:
names(top_69) <- c("Numbers", "Count") top_69$Numbers <- as.numeric(top_69$Numbers) top_69$Count <- as.numeric(top_69$Count) top_69 <- top_69[order(top_69$Count, -rank(top_69$Numbers), decreasing = TRUE), ] top_69
Output:
# A tibble: 69 x 2 Numbers Count <dbl> <dbl> 1 28.0 84.0 2 23.0 83.0 3 32.0 81.0 4 41.0 81.0 5 39.0 78.0 6 52.0 78.0 7 54.0 78.0 8 10.0 77.0 9 11.0 76.0 10 12.0 76.0 # ... with 59 more rows
ggplot + geom_bar()
ggplot(data=top_69, aes(x = Numbers, y = Count, fill = Numbers)) + geom_bar(stat = "identity", width = .8) + xlab("Powerball Numbers") + ylab("Total Number Occurences") + ggtitle("All Powerball Numbers Occurences") + scale_fill_gradient(low="red", high="blue") + theme(axis.text.x = element_text(face="bold", color="#FF33C1", size = 10, angle = 45, hjust = 1), axis.text.y = element_text(face="bold", color="#FF33C1", size=10, angle=45)) + theme(plot.title = element_text(hjust = 0.5))
Output:
Top Red Ball Powerball Number Occurrences
Now that we have analyzed the top number for the white balls in Powerball, what about the most occurring Powerball (red balls)? Below we will follow the same as before.
Input:
powerball_numbers_2010 <- read.csv("powerball_lottery_numbers_reformatted.csv", stringsAsFactors = FALSE) all_powerballs <- c(powerball_numbers_2010$Powerball) all_powerballs <- data.frame(all_powerballs) names(all_powerballs) <- c("Powerballs") head(all_powerballs, n = 10)
Output:
Powerballs 1 24 2 4 3 34 4 1 5 15 6 32 7 12 8 30 9 23 10 15
26 Red Powerballs Analysis
We must remember that there are 26 red balls in Powerball and will use the top_n(26) to count all of the numbers.
Input:
top_26 <- all_powerballs %>% group_by(Powerballs) %>% tally() %>% ungroup %>% top_n(40) top_26
Output:
# A tibble: 39 x 2 Powerballs n <int> <int> 1 1 24 2 2 23 3 3 23 4 4 23 5 5 27 6 6 28 7 7 27 8 8 27 9 9 25 10 10 27 # ... with 29 more rows
The next step is to remove all numbers over 26 showing on the data. We can easily do this by running the head() function and n = 26.
Input:
top_26 <- head(top_26, n = 26) top_26
Output:
# A tibble: 26 x 2 Powerballs n <int> <int> 1 1 24 2 2 23 3 3 23 4 4 23 5 5 27 6 6 28 7 7 27 8 8 27 9 9 25 10 10 27 # ... with 16 more rows
As completed before with the white balls, the next two steps will be renaming the n column, converting the column from integers to numeric and modify the Count column from the most common occurrence to the least occurring.
Input:
names(top_26) <- c("Powerballs", "Count") top_26$Powerballs <- as.numeric(top_26$Powerballs) top_26$Count <- as.numeric(top_26$Count) top_26 <- top_26[order(top_26$Count, -rank(top_26$Powerballs), decreasing = TRUE), ] top_26
Output:
# A tibble: 26 x 2 Powerballs Count <dbl> <dbl> 1 24.0 32.0 2 15.0 29.0 3 17.0 29.0 4 25.0 29.0 5 6.00 28.0 6 19.0 28.0 7 5.00 27.0 8 7.00 27.0 9 8.00 27.0 10 10.0 27.0 # ... with 16 more rows
From the above, we can confirm that the most occurring red Powerball is the number 24 with 32 occurrences.
ggplot + geom_bar()
> ggplot(data=top_26, aes(x = Powerballs, y = Count, fill = Powerballs)) + geom_bar(stat = "identity", width = .8) + xlab("Powerball Numbers") + ylab("Total Number Occurences") + ggtitle("Red Powerball Numbers Occurences") + scale_fill_gradient(low="orange", high="red") + theme(axis.text.x = element_text(face="bold", color="#EC0505", size = 10, angle = 45, hjust = 1), axis.text.y = element_text(face="bold", color="#EC8705", size=10, angle=45)) + theme(plot.title = element_text(hjust = 0.5))
Output:
Top Powerball Numbers Occurring
From the data analysis and data visualization of the Powerball numbers, we can conclude that the top 5 white Powerball numbers are 28 with 84 occurrences, 23 with 83 occurrences, 32 and 41 with 81 occurrences, and 39, 52, and 54 with 78 occurrences. Also, the occurring red Powerball Number is 24 with 32 occurrences.