For those who prefer to work with RCloud, a project with the same materials can be accessed using the following link:
Datasets for this class:
mtcars
from package datasets
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
To load, run: data(mtcars)
To get more info about the dataset, run: ?mtcars
Check all the built-in dataset by running: library(help = "datasets")
library(dplyr) # for maipultaing the dataset using commands %>%, select(), filter() etc.
library(ggplot2) # graphics
setwd(".")
mtcars
datasetdata(mtcars)
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
mtcars
datasetScatterplot for car mpg against weight (wt ~ mpg) using base graphics:
plot(x = mtcars$wt, y = mtcars$mpg) # or you can type: plot(mtcars$mpg ~ mtcars$wt)
Scatterplot for car mpg against weight (wt ~ mpg) using ggplot:
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg )) + geom_point()
Adding another dimension: same scatterplot broken down by am
(transmission type: 0 = auto, 1 = manual)
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg, col = factor(am))) + geom_point()
Scatterplot for car mpg
against am
:
ggplot(data = mtcars) + geom_point(mapping = aes(x = factor(am), y = mpg, col = factor(am) ))
Boxplot for car mpg
against am
:
ggplot(data = mtcars, mapping = aes(x = factor(am), y = mpg, col = factor(am))) + geom_boxplot()
mtcars
DatasetBasic correlation matrix using cor()
for select variables using select()
:
mtcars %>% select(mpg, cyl, disp, hp, wt, am) %>% cor(use = "pairwise.complete.obs") %>% round(2)
## mpg cyl disp hp wt am
## mpg 1.00 -0.85 -0.85 -0.78 -0.87 0.60
## cyl -0.85 1.00 0.90 0.83 0.78 -0.52
## disp -0.85 0.90 1.00 0.79 0.89 -0.59
## hp -0.78 0.83 0.79 1.00 0.66 -0.24
## wt -0.87 0.78 0.89 0.66 1.00 -0.69
## am 0.60 -0.52 -0.59 -0.24 -0.69 1.00
A more advanced solution using ggplot
family of libraries (GGally
):
#install.packages("GGally")
library(GGally)
mtcars %>% select(mpg, cyl, disp, hp, wt, am) %>% ggpairs()
OPM94
datasetload("Datasets/OPM94.RData")
str(opm94)
## 'data.frame': 1000 obs. of 23 variables:
## $ x : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sal : int 26045 37651 64926 18588 19573 28648 27805 16560 40440 24285 ...
## $ grade : int 7 9 14 4 3 9 7 3 11 6 ...
## $ patco : Factor w/ 5 levels "Administrative",..: 1 4 4 2 2 4 5 2 1 2 ...
## $ major : Factor w/ 23 levels " ","AGRIC",..: 16 11 10 1 1 11 1 1 1 6 ...
## $ age : int 52 34 37 26 51 44 50 37 59 57 ...
## $ male : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
## $ vet : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ handvet : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ hand : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ yos : int 6 4 3 6 14 1 7 5 13 6 ...
## $ edyrs : int 16 16 16 12 12 16 14 12 12 14 ...
## $ promo : Factor w/ 2 levels "no","yes": 2 1 1 1 NA 1 1 1 1 1 ...
## $ exit : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ supmgr : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ race : Factor w/ 5 levels "American Indian",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ minority: int 1 1 1 1 1 1 1 1 1 1 ...
## $ grade4 : Factor w/ 4 levels "grades 1 to 4",..: 3 4 2 1 1 4 3 1 4 3 ...
## $ promo01 : num 1 0 0 0 NA 0 0 0 0 0 ...
## $ supmgr01: num 0 0 0 0 0 0 0 0 0 0 ...
## $ male01 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exit01 : num 0 0 0 0 1 0 0 0 0 0 ...
## $ vet01 : num 0 0 0 0 0 0 0 0 1 0 ...
Correlation matrix for select interval level variables:
opm94 %>% select(sal, grade, edyrs, yos ) %>% cor(use = "pairwise.complete.obs") %>% round(2)
## sal grade edyrs yos
## sal 1.00 0.91 0.59 0.40
## grade 0.91 1.00 0.61 0.31
## edyrs 0.59 0.61 1.00 0.01
## yos 0.40 0.31 0.01 1.00
Correlation matrix with binary variables:
opm94 %>% select(sal, male01, vet01, promo01, supmgr01, minority) %>% cor(use = "pairwise.complete.obs") %>% round(2)
## sal male01 vet01 promo01 supmgr01 minority
## sal 1.00 0.36 0.14 -0.15 0.52 -0.23
## male01 0.36 1.00 0.42 -0.07 0.18 -0.12
## vet01 0.14 0.42 1.00 -0.07 0.11 -0.02
## promo01 -0.15 -0.07 -0.07 1.00 -0.08 0.04
## supmgr01 0.52 0.18 0.11 -0.08 1.00 -0.09
## minority -0.23 -0.12 -0.02 0.04 -0.09 1.00
Salary ~ Grade:
ggplot(data = opm94, aes(x = grade, y = sal)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(data = opm94, aes(x = grade, y = sal, color = male)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
Salary ~ edyrs:
ggplot(data = opm94, aes(x = edyrs, y = sal)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(data = opm94, aes(x = edyrs, y = sal, color = male)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
Salary ~ yos:
ggplot(data = opm94, aes(x = yos, y = sal)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(data = opm94, aes(x = yos, y = sal, color = male)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
Salary ~ male:
ggplot(data = opm94, aes(x = male, y = sal)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(data = opm94, aes(x = male, y = sal)) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
Salary ~ supmgr:
ggplot(data = opm94, aes(x = supmgr, y = sal)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(data = opm94, aes(x = supmgr, y = sal)) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
Salary ~ minority:
ggplot(data = opm94, aes(x = factor(minority), y = sal)) + geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(data = opm94, aes(x = factor(minority), y = sal)) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
Yuriy Davydenko 2020