In this R tutorial, we will be using the highway mpg dataset. In this R tutorial, we will use a variety of scatterplots and histograms to visualize the data. Scatterplots will be used to create points between cyl vs. hwy and cyl vs. cty. Once these are created, we can visually see the top choices for city and highway driving for the best mpg among 4, 6 and 8 cylinder vehicles. Histograms will be used for the use of different types of drives. This data will be broken up into subsets and the classes will be identified for the three types of drives. In addition, there will be two additional histograms that will be broken into subsets against cyl vs. drv and cyl vs. class.
Install and Load Packages
Before we load the data, we will need to load the appropriate libraries for this R tutorial.
Input:
1 2 3 4 5 6 7 8 9 10 | install.packages("C50") install.packages("gmodels") install.packages("party") install.packages("car") install.packages("ggplot2") library(C50) library(gmodels) library(party) library(car) library(ggplot2) |
Download and Load the Highway MPG Dataset
This dataset is already packaged and available for an easy download from the dataset page or directly from here Highway MPG Dataset – highway_mpg.csv
Output:
1 | mpg <-read.csv("highway_mpg.csv", stringsAsFactors = FALSE) |
There are multiple features for ggplot that will be used to view data for mpg differently.
View the Highway MPG Dataset
Let’s take a quick look at the data by using the str() and summary() functions.
str() function
The str() command displays the internal structure of an R object. This function is an alternative to summary(). When using the str() function, only one line for each basic structure will be displayed.
Input:
1 | str(mpg) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 | 'data.frame': 234 obs. of 11 variables: $ manufacturer: chr "audi" "audi" "audi" "audi" ... $ model : chr "a4" "a4" "a4" "a4" ... $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ... $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ... $ cyl : int 4 4 4 4 6 6 6 4 4 4 ... $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ... $ drv : chr "f" "f" "f" "f" ... $ cty : int 18 21 20 21 16 18 18 18 16 20 ... $ hwy : int 29 29 31 30 26 26 27 26 25 28 ... $ fl : chr "p" "p" "p" "p" ... $ class : chr "compact" "compact" "compact" "compact" ... |
summary() function
The summary() function is a basic function that issued to produce the result summary of various model functions.
Input:
1 | summary(mpg) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | manufacturer model displ year cyl Length:234 Length:234 Min. :1.600 Min. :1999 Min. :4.000 Class :character Class :character 1st Qu.:2.400 1st Qu.:1999 1st Qu.:4.000 Mode :character Mode :character Median :3.300 Median :2004 Median :6.000 Mean :3.472 Mean :2004 Mean :5.889 3rd Qu.:4.600 3rd Qu.:2008 3rd Qu.:8.000 Max. :7.000 Max. :2008 Max. :8.000 trans drv cty hwy fl Length:234 Length:234 Min. : 9.00 Min. :12.00 Length:234 Class :character Class :character 1st Qu.:14.00 1st Qu.:18.00 Class :character Mode :character Mode :character Median :17.00 Median :24.00 Mode :character Mean :16.86 Mean :23.44 3rd Qu.:19.00 3rd Qu.:27.00 Max. :35.00 Max. :44.00 class Length:234 Class :character Mode :character |
Scatterplot
A scatterplot is a comparison between 2 variables. The x-axis and y-axis show an observation between the 2 variables.Below are a series of scatter plots for visual comparison of mpg comparison of cyl versus cty and hwy. Also, the below scatter plots use a regression line to see the decline of mpg from the 4 cylinder vehicles to the 8 cylinder vehicles.
As one can see from the below, the vehicle classes that do well in city; also do well on the highway. The vehicle class that does the best in the subcompact class. The subcompact class does exceptionally well in both the city and highway mpg.
Can you see how the mpg goes down with 4, 6 and 8 cylinders in the city and highway mpg?
cyl vs cty Scatterplot
Scatterplot for cyl vs cty with mapping class as a color aesthetic with regression line
1 2 3 | mpg_stat <- ggplot(mpg, aes(x = cyl,y = cty)) mpg_stat + geom_point(aes(color = class)) + stat_smooth(method = "lm") |
Output:
cyl vs hwy Scatterplot
Scatterplot for cyl vs hwy with mapping class as a color aesthetic with regression line.
Input:
1 2 3 | mpg_stat <- ggplot(mpg, aes(x = cyl,y = hwy)) mpg_stat + geom_point(aes(color = class)) + stat_smooth(method = "lm") |
Output:
Additional Scatterplots for MPG
The above scatter plots give a great view of how mpg decrease with larger cylinder vehicles. However, many points are plotted in the same location, and it’s difficult to see the distribution.
Scatterplot Cyl vs. Hwy
Input:
1 2 | ggplot(mpg, aes(cyl, hwy)) + geom_jitter() |
Output:
Scatterplot Cyl vs. Cty
Input:
1 2 | ggplot(mpg, aes(cyl, cty)) + geom_jitter() |
Output:
The below is a simple plot used to distinguish the highway and city mpg against cylinders.
Notice how the 4 cyl is darker, and 8 cyl is lighter?
Input:
1 2 | mpg_gg <-ggplot(mpg) mpg_gg + geom_point(aes(x = hwy, y = cty, color = cyl)) |
Output:
Below is a bit more complicated but it’s a scatter plot matrix with the histogram for cyl, cty and hwy.
Scatterplot Matrix with Histogram
Below is a bit more complicated but it’s a scatter plot matrix with the histogram for cyl, cty and hwy.
Input:
1 2 | scatterplotMatrix(~ cyl + cty + hwy, data = mpg, spread = FALSE, diagonal = "histogram", lty = 1, main = "Scatterplot Matrix of Cyl and Cty vs Hwy MPG") |
Output:
Scatterplot + Facet Grid
The below scatterplot will show a comparison between class, cyl, cty and hwy mpg.
Input:
1 2 | scatterplot_class <-ggplot(mpg, aes(x = cty, y = hwy)) + geom_point() scatterplot_class + facet_grid(cyl ~ class) |
Output:
Histogram with Drives
Below are histograms with drives (drv) and also the data is broken up into subsets of classes. In addition, we will identify the classes with the highest number of vehicles with 4, front, and rear wheel drives.
Input:
1 2 | mpg_gg + geom_bar(aes(x = drv,fill = factor(drv)), position = "dodge") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) |
Output:
MPG Class and Cycle Table
Input:
1 2 | drv_cyl <- table(mpg$class, mpg$cyl) drv_cyl |
Output:
1 2 3 4 5 6 7 8 | 4 5 6 8 2seater 0 0 0 5 compact 32 2 13 0 midsize 16 0 23 2 minivan 1 0 10 0 pickup 3 0 10 20 subcompact 21 2 7 5 suv 8 0 16 38 |
Subsets
The below histograms will compare cyl, class and drv across all totals. There are a total of 3 histograms for a complete visual of each.
Subset of Class with a factor of drive with binwidth
In addition, the histograms will identify the classes with the highest number of vehicles with 4, front, and rear wheel drives. Also, a shaded total of each drive in each class with cylinder totals of 4, 6 and 8 within each class.
Input:
1 2 3 4 | ggplot(mpg, aes(x = cyl, fill = drv)) + geom_histogram(binwidth = 20, alpha = .5, position = "identity") + facet_wrap(~ class) + stat_bin(na.rm = FALSE) |
Output:
Subset of class with a factor of drive
Below will give a total of classes among the drv type, such as 4, r and f.
Input:
1 2 3 | mpg_gg + geom_bar(aes(x = drv,fill = factor(drv)), position = "dodge") + facet_wrap(~ class, scales = "free_y") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) |
Output:
Input:
1 2 | drv_class <- table(mpg$class, mpg$drv) drv_class |
Output:
1 2 3 4 5 6 7 8 | 4 f r 2seater 0 0 5 compact 12 35 0 midsize 3 38 0 minivan 0 11 0 pickup 33 0 0 subcompact 4 22 9 suv 51 0 11 |
After reviewing the data and using graphical analysis, the subcompact class would be my top vehicle class choice. The reason being is that this class does well in the city and on the highway. Also, as you can see from the above, the midsize class has the most front-wheel drive vehicles. The pickup has the most four-wheel drive vehicles and the suv class has the most rear-wheel drive vehicles.