Statistical inference with the GSS data

Setup

setwd("E:/Coursera/Statistics with R - Duke University/02_Inferential Statitstics/FinalProject")

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

library(statsr)
library(foreign)

Load data

load("gss.Rdata")

Part 1: Data

The data for this project come from the General Social Survey (GSS). According to the source:

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. … The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

The variant of the GSS dataset provided for this project contains data on 114 following variables:

length(names(gss))

## [1] 114

names(gss)

##   [1] "caseid"   "year"     "age"      "sex"      "race"     "hispanic"
##   [7] "uscitzn"  "educ"     "paeduc"   "maeduc"   "speduc"   "degree"  
##  [13] "vetyears" "sei"      "wrkstat"  "wrkslf"   "marital"  "spwrksta"
##  [19] "sibs"     "childs"   "agekdbrn" "incom16"  "born"     "parborn" 
##  [25] "granborn" "income06" "coninc"   "region"   "partyid"  "polviews"
##  [31] "relig"    "attend"   "natspac"  "natenvir" "natheal"  "natcity" 
##  [37] "natcrime" "natdrug"  "nateduc"  "natrace"  "natarms"  "nataid"  
##  [43] "natfare"  "natroad"  "natsoc"   "natmass"  "natpark"  "confinan"
##  [49] "conbus"   "conclerg" "coneduc"  "confed"   "conlabor" "conpress"
##  [55] "conmedic" "contv"    "conjudge" "consci"   "conlegis" "conarmy" 
##  [61] "joblose"  "jobfind"  "satjob"   "richwork" "jobinc"   "jobsec"  
##  [67] "jobhour"  "jobpromo" "jobmeans" "class"    "rank"     "satfin"  
##  [73] "finalter" "finrela"  "unemp"    "govaid"   "getaid"   "union"   
##  [79] "getahead" "parsol"   "kidssol"  "abdefect" "abnomore" "abhlth"  
##  [85] "abpoor"   "abrape"   "absingle" "abany"    "pillok"   "sexeduc" 
##  [91] "divlaw"   "premarsx" "teensex"  "xmarsex"  "homosex"  "suicide1"
##  [97] "suicide2" "suicide3" "suicide4" "fear"     "owngun"   "pistol"  
## [103] "shotgun"  "rifle"    "news"     "tvhours"  "racdif1"  "racdif2" 
## [109] "racdif3"  "racdif4"  "helppoor" "helpnot"  "helpsick" "helpblk"

Generalizability

As follows from the Code Book for General Social Surveys, 1972-2016, the data in this survey were collected using random sampling techniques. The sampling desings for the GSS survey has been modified several times between 1972 and 2016. Overall, the survey has relied upon a stratified, multistage area national probability sample of clusters of households in the continental United States. According to the documentation, “at the block level, however, quota sampling is used with quotas based on sex, age, and employment status.”

Given the random sampling design, the sample must be representative of the U.S. population, and inferences derived from the analysis based on this sample can be extened to the larger population of all Americans (estimates of means, proportions, and differennces in means and proportions). Also, the overall sample size (57061 observations overall) is large enough to achive extremely small margin of errors. Given that the data have been collected in different years, it is worth noting that the largest sample size for a particular year was 4510 observations (in 2006) and smallest of 1372 (in 1990), which still can provide reasonably small margins of error.

nrow(gss)

## [1] 57061

max(table(gss$year))

## [1] 4510

min(table(gss$year))

## [1] 1372

Causal inference

The sample contains only observational data, and, although many variables in this dataset have been tracked since 1972, this is not a panel data set. Therefore, for many purposes (e.g. studying individual/household attitudes), the data are equivalent to cross-sectional data. Because this is not an experimetnal desing (no random assignment of individuals in groups was used), the scope of analysis will be limited to determining associations between the explanatory and response variables without the possibility to establish causal relationships.

Part 2: Research question

The issue of pregnancy abortions has long been very controvertial and contentious in the U.S. and across the globe. Today, hundreds of thousands abortions are carried out annually. At the same time, in many jurisdictions abortions are illegal. Whether abortions should be legal or not is a complex and consequential issue with many arguments on both sides. Therefore, the decisions on such a sensitive and complex matter should be well informed and justified. The purpose of this analysis is to find out whether there are education-based differences in people’s attitudes towards abortions. Hence, this research asks the following question:

Do Americans with different levels of education differ with respect to their attitudes towards legal abortions?

My expectation here is that in the U.S. population, indivuduals who have more education are more likely to approve legal abortions than less educated Americans.

Part 3: Exploratory data analysis

To answer the reserch question, I will use the abany and educ varibles from the GSS dataset:

abany is a binary level variable based on the following question: Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if the woman wants it for any reason? (Yes/No)
educ: is a numeric variable that measures the highest years of school completed by a respondent.

The following section describes each variable:

Years of education (educ):

summary(gss$educ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.00   12.00   12.75   15.00   20.00     164

qplot(educ, data = gss)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 164 rows containing non-finite values (stat_bin).

The statistical summary and histogram above show that the lowest amount of education in the sample is zero years and highest is twenty years of school. The median peson in the dataset has twelve years of school. The distribution is roughly symmetric. Also, there are 164 missing values on this variable in the dataset.

Abortion for any reason (abany):

str(gss$abany)

##  Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA NA NA NA NA ...

table(gss$abany)

## 
##   Yes    No 
## 12887 18920

prop.table(table(gss$abany))

## 
##       Yes        No 
## 0.4051624 0.5948376

gss$abany %>% is.na() %>% sum()

## [1] 25254

The summaries for abany show that only 40.5 percent of the sample approve of abortion if the woman wants for any reason, whereas 59.5 percent of the respondents in the sample don’t. There are 25,254 missing responses on this variable in the sample.

Before proceeding to the rest of the analysis, I am going to subset the data so that there are only the two needed variables in the dataset and the missing values are romoved:

my_data <- gss %>% select(educ, abany) %>% filter(!is.na(educ)) %>% filter(!is.na(abany))

## Warning: package 'bindrcpp' was built under R version 3.3.3

str(my_data)

## 'data.frame':    31732 obs. of  2 variables:
##  $ educ : int  8 4 8 16 17 14 12 10 15 9 ...
##  $ abany: Factor w/ 2 levels "Yes","No": 1 1 2 1 1 1 2 2 1 2 ...

One way to answer the research question in this analysis is to split the sample in two groups based on education - one representing the bottom 50 percent individuals on the measure of the years of education and the other one representing the individuals with the years of school above the median person. Then, the proportions of those who support or disapprove abortion for any reason can be compared for the two groups.

To accomplish that, I am going to create a new, binary variable educ_level that indicates wheter an individual in the sample has more or less years of education than the median value:

my_data <- my_data %>% mutate(educ_level = ifelse(educ <= median(educ), "Below median educ", "Above median educ"))

Now we can see how the attitudes toward abortion for any reason varie across the two education groups:

my_data %>% ggplot(aes(x = educ_level, fill = abany)) + geom_bar(position = "fill") + labs(y = "Proportion")

t <- table(my_data$abany, my_data$educ_level)
t

##      
##       Above median educ Below median educ
##   Yes              7387              5478
##   No               7289             11578

prop.table(t, 2)

##      
##       Above median educ Below median educ
##   Yes         0.5033388         0.3211773
##   No          0.4966612         0.6788227

diff <- my_data %>% group_by(educ_level) %>% summarize(Count = n(), PercentYes = sum(abany == "Yes")/n(), PercNo = sum(abany == "No")/n())
diff$PercentYes[2] - diff$PercentYes[1]

## [1] -0.1821615

As we can see from the graph and the crosstabs, in this sample, 50.3 of individuals with more than 12 years of school approve abortion if woman wants for any reason; at the same time, only 32.1 percent of those with 12 and less years of education in the sample approve abortion. Clearly, there is a substanial differentce in attitudes towards abotion between the two groups with different levels of education in this sample: more educated individuals are 18.2 percentage points more likely to say “Yes” to abortion if a women wants it for an reason.

Part 4: Inference

The fact that there is a difference between the two groups in the sample doesn’t yet mean there is a differnce in attitudes towards abortion in the U.S. population. To make sure relationship holds in the larger population, we need to calculate the confidence interval for the difference in proportions for the two groups with different education levels or test the hypothesis that the differnece in proportion is not equal to zero. The parameter of interest is the difference between the proportions of all Americans with much education and all Americans with little education who believe that a woman should have an abortion if she wants for any reason: p_High.Edu - p_Low.Edu

Before proceeding to the hypothesis test or calculating a confidence interval, the following checks whether the conditions for inference are satisfied:

1. Independence.
- Within groups. Because the GSS sample was collected using random sampling technique and 31,737 observations is less than 10 percent of the U.S. population, the observetions are independent of each other.
- Between groups. The data are non-paired, so there are no reason to expect that they might be dependent.
1. Sample size/skew. Each sample should meet the success-failure condition:
- n1p1 >= 10 and n1(1-p1) >= 10
- n2p1 >= 10 and n2(1-p2) >= 10

diff$Count[1]*diff$PercentYes[1]

## [1] 7387

diff$Count[1]*(1-diff$PercentYes[1])

## [1] 7289

diff$Count[2]*diff$PercentYes[2]

## [1] 5478

diff$Count[2]*(1-diff$PercentYes[2])

## [1] 11578

As the calculations above show, the success-failure condition holds too. Given that all the conditions for inference hold, the inference based on the Central Limit Theorem can be done.

Hypothesis test: The null hypothesis states that the difference between the two population proportions equals zero. The alternative hypothesis is that the differenc betwee the two population proportions does not equal zero.

inference(y = abany, x = educ_level, data = my_data, statistic = "proportion", type = "ht", method = "theoretical", success = "Yes", null = 0, alternative = "twosided", sig_level = 0.01)

## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_Above median educ = 14676, p_hat_Above median educ = 0.5033
## n_Below median educ = 17056, p_hat_Below median educ = 0.3212
## H0: p_Above median educ =  p_Below median educ
## HA: p_Above median educ != p_Below median educ
## z = 32.9527
## p_value = < 0.0001

Given the significance level of 0.01, we obtain z = 32.95 and p_value = < 0.0001, wich allows as to confidently reject the null hypothesis that the difference in the proportions in the two populations equals zero. Therefore, we can reject the null hypothesis and conclude that there is a difference bewtween the proportions of all Americans with above median years of education and all Americans with below median years of education with respect to their attitudes towards legal abortions.

To have a quantitative estimate of the difference in the population proportions, We can also calculate a confidence interval:

inference(y = abany, x = educ_level, data = my_data, statistic = "proportion", type = "ci", method = "theoretical", success = "Yes", conf_level = 0.99)

## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_Above median educ = 14676, p_hat_Above median educ = 0.5033
## n_Below median educ = 17056, p_hat_Below median educ = 0.3212
## 99% CI (Above median educ - Below median educ): (0.1681 , 0.1962)

The results show that the 99% confidence interval for the difference in the poppulation proportions is (0.1681 , 0.1962). This tells us that, we can be 99 percent confident, that in the the proportion of the Americans with above the median years of education who think it should be possible for a pregnant woman to obtain a legal abortion if the woman wants it for any reason, is 16.8 to 19.6 percentage points higher than the proportion of Americans with below the median education who think so. In other words, Americans with more education are more likely to approve legal abortion than less educated Americans.