This week on RCloud: https://rstudio.cloud/project/1074544



Datasets for this class:

A random sample of 1,000 federal personnel records for March 1994:



Load Libraries

library(dplyr)
library(ggplot2)


PREDICTING/COMPARING CAR MPG (AMERICAN VS FOREIGN CARS)

Cars93 <- MASS::Cars93  # Load the dataset from package MASS
names(Cars93)           # Variable names
##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"


Are American cars more or less fuel efficient than foreign cars?


American vs. Foreign Cars: Comparing Distributions of MPG.city using boxplots:

Cars93 %>% ggplot(mapping = aes(x = Origin, y = MPG.city)) + geom_boxplot()

Let’s calculate mean MPG.city for the two groups of cars:

Cars93 %>% select(MPG.city, Origin) %>% group_by(Origin) %>% summarize(Mean.MPG.city = mean(MPG.city, na.rm = T))
## # A tibble: 2 x 2
##   Origin  Mean.MPG.city
##   <fct>           <dbl>
## 1 USA              21.0
## 2 non-USA          23.9

Bivariate regression: MPG.city ~ Origin

lm(MPG.city ~ Origin, data = Cars93) %>% summary()
## 
## Call:
## lm(formula = MPG.city ~ Origin, data = Cars93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8667 -3.8667 -0.9583  2.0417 22.1333 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    20.9583     0.7875  26.612   <2e-16 ***
## Originnon-USA   2.9083     1.1322   2.569   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.456 on 91 degrees of freedom
## Multiple R-squared:  0.06761,    Adjusted R-squared:  0.05737 
## F-statistic: 6.599 on 1 and 91 DF,  p-value: 0.01183


Do American cars in the sample tend to have larger engines?


Boxplot for EngineSize:

Cars93 %>% ggplot(mapping = aes(x = Origin, y = EngineSize)) + geom_boxplot()

Mean EngineSize:

Cars93 %>% dplyr::select(EngineSize, Origin) %>% group_by(Origin) %>% summarize(Mean.EngineSize = mean(EngineSize))
## # A tibble: 2 x 2
##   Origin  Mean.EngineSize
##   <fct>             <dbl>
## 1 USA                3.07
## 2 non-USA            2.24


Multiple Regression

Modeling MPG.city based on car Origin and EngineSize:

lm(MPG.city ~ Origin + EngineSize, data = Cars93) %>% summary()
## 
## Call:
## lm(formula = MPG.city ~ Origin + EngineSize, data = Cars93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5478  -2.6409  -0.5944   1.9210  17.2802 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    32.9393     1.4629  22.517  < 2e-16 ***
## Originnon-USA  -0.3126     0.9050  -0.345    0.731    
## EngineSize     -3.9068     0.4383  -8.913 5.22e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.999 on 90 degrees of freedom
## Multiple R-squared:  0.5048, Adjusted R-squared:  0.4938 
## F-statistic: 45.87 on 2 and 90 DF,  p-value: 1.848e-14