Table of Contents

In this R tutorial, we will learn some basic functions with the used car’s data set. Within this dataset, we will learn how the mileage of a car plays into the final price of a used car with data analysis.

## Install and Load Packages

Below are the packages and libraries that we will need to load to complete this tutorial.

**Input:**

install.packages("ggplot2) library(ggplot2)

## Download and Load the Used Cars Dataset

Since we will be using the used cars dataset, you will need to download this dataset. This dataset is already packaged and available for an easy download from the dataset page or directly from here Used Cars Dataset – usedcars.csv

**Input:**

usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE)

## View the Used Cars Dataset Data

Once the data is imported, you can run a series of commands to see sample data of the used cars.

A few that I chose to use are below:

str() summary() range() diff()

### str(usedcars)

The **str()** command displays the internal structure of an R object. This function is an alternative to **summary()**. When using the **str()** function, only one line for each basic structure will be displayed.

**Input:**

str(usedcars)

**Output:**

'data.frame': 150 obs. of 6 variables: $ year : int 2011 2011 2011 2011 2012 2010 2011 2010 2011 2010 ... $ model : chr "SEL" "SEL" "SEL" "SEL" ... $ price : int 21992 20995 19995 17809 17500 17495 17000 16995 16995 16995 ... $ mileage : int 7413 10926 7351 11613 8367 25125 27393 21026 32655 36116 ... $ color : chr "Yellow" "Gray" "Silver" "Gray" ... $ transmission: chr "AUTO" "AUTO" "AUTO" "AUTO" ...

### summary(usedcars)

The **summary()** function is a basic function that issued to produce the result summary of various model functions.

**Input:**

summary(usedcars)

**Output:**

year model price mileage Min. :2000 Length:150 Min. : 3800 Min. : 4867 1st Qu.:2008 Class :character 1st Qu.:10995 1st Qu.: 27200 Median :2009 Mode :character Median :13592 Median : 36385 Mean :2009 Mean :12962 Mean : 44261 3rd Qu.:2010 3rd Qu.:14904 3rd Qu.: 55125 Max. :2012 Max. :21992 Max. :151479 color transmission Length:150 Length:150 Class :character Class :character Mode :character Mode :character

In addition, you can print only one column of the used cars dataset. For example, lets complete a summary of only the year of the used cars.

**Input:**

summary(usedcars$year)

**Output:**

Min. 1st Qu. Median Mean 3rd Qu. Max. 2000 2008 2009 2009 2010 2012

### range()

The **range()** function returns a vector containing the maximum and minimum of all the given arguments.

**Input:**

range(usedcars$price)

**Output:**

[1] 3800 21992

In addition, you can use the **diff()** function on the range() function to return suitably lagged and iterated differences.

**Input:**

diff(range(usedcars$price))

**Output:**

[1] 18192

## Quantile Function of Probabilities

The **quantile()** function produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

spealial cases of statistics - quantiles tertiles - three parts quintiles - 5 parts deciles - 10 parts percentiles - 100 parts

The difference *between* q1 and q3 is known as **Interquartile Range(IQR)**.

**Input:**

IQR(usedcars$price)

**Output:**

[1] 3909.5

The probs parameter using methods to handle ties among values and data sets with no middle values.

**Input:**

quantile(usedcars$price, probs = c(0.01, 0.99))

**Output:**

1% 99% 5428.69 20505.00

### seq()

The **seq()** function is used to generate vectors of evenly-spaced values.

**Input:**

quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))

**Output:**

0% 20% 40% 60% 80% 100% 3800.0 10759.4 12993.8 13992.0 14999.0 21992.0

## Used Car Boxplots

The **boxplot** is for common visualization of the five-number summary. In addition, the boxplot produces box-and-whisker plot(s) of the given (grouped) values. Which you will see below, the median is the dark line in the plot

In addition, you can add extra parameters such as main and **ylab** to add a title to the figure and label the **y-axis(vertical axis)**.

### Boxplot of Used Car Prices

**Input:**

boxplot(usedcars$price, main="Boxplot of Used Car Prices", ylab="Price ($)")

**Output:**

### Boxplot of Used Car Mileage

**Input:**

boxplot(usedcars$mileage, main="Boxplot of Used Car Mileage", ylab="Odometer (mi.)")

**Output:**

## Used Car Histograms

Histograms are another way to graphically depict the spread of a numeric variable. Similar to a boxplot in a way that it divides the variables values into a predefined. Also, the number of portions called bins that act as containers for values.

### Histogram of Used Car Mileage

**Input:**

hist(usedcars$price, main = "Histogram of Used Car Prices", xlab = "Price ($)")

**Output:**

### Histogram of Used Car Mileage

**Input:**

hist(usedcars$mileage, main = "Histogram of Used Car Mileage", xlab = "Odometer (mi.)")

**Output:**

## Table

The **table()** function uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

**Input:**

table(usedcarsmodel) prop.table(model_table)

**Output:**

Black Blue Gold Gray Green Red Silver White Yellow 35 17 1 16 5 25 32 16 3 SE SEL SES 0.5200000 0.1533333 0.3266667 Black Blue Gold Gray Green Red Silver White Yellow 23.3 11.3 0.7 10.7 3.3 16.7 21.3 10.7 2.0

## Scatterplot

The scatterplot pairs up values of two quantitative variables in a data set and display them as geometric points inside a Cartesian diagram.

**Input:**

plot(x = usedcars$mileage, y = usedcars$price, main = "Scatterplot of Price vs. Mileage", xlab = "Used Car Odometer (mi.)", ylab = "Used Car Price ($)")

**Output:**

## Value Matching

Let’s say you wanted a vehicle in a specific color and only wanted to return the colors that matched. The match returns a vector of the positions of (first) matches of its first argument in its second.

### %in%

%in% is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.

**Input:**

usedcars$conservative <- usedcars$color %in% c("Black", "Gray", "Silver", "White") table(usedcars$conservative)

**Output:**

FALSE TRUE 51 99

As we can see from the above output, there are 99 cars that are **TRUE** for Black, Gray, Silver, and White. However, there are 51 cars that do not meet the color criteria of choice.