Identify a research question (set of questions) similar to questions we’ve talked about in this class and a dataset that can be used to answer the question.

Your project will consist of the following parts:

1. Executive summary.

A short executive summary describing your research question(s), data, and conclusions.

2. Research question(s).

Come up with one or more research questions involving at least three variables and at least one relationship between variables. Briefly discuss why this question is of interest to you and/or your audience. One option here is to perform a multiple regression analysis focusing on a numeric response variable, an explanatory variable, and a set of control variables. Another option is to compare categorical outcomes of interest across different groups. Examples of similar work are provided.

3. Data.

For this project, you can use one of the datasets provided for this course or find one from an external source. Describe the dataset, including the variables and how the sample was collected. To do that, use the documentation for the dataset. Briefly discuss the implications of the data collection method on the scope of inference (generalizability and causality).

4. Exploratory Data Analysis.

Conduct an exploratory data analysis to address the question(s) of interest. The EDA should include graphs and numerical summaries. Provide interpretation of each output.

5. Inference

Conduct inferential analysis using an appropriate technique.

6. Results.

Describe your findings and conclusions in the context of your research question.

The analysis must be completed using R via RStudio, and your work must be in an R Markdown document. To help you get started, I can provide a template Rmd file.


GROUP WORK

It is not required, but this assignment can be done as a group project. Feel free to team up with your classmates to work on one project in a group of up to three people. Every member of a group should contribute to the group’s work.


TIMELINE


CONSULTATIONS

To receive help and successfully complete this project, consult with the instructor frequently!!!


Grading Criteria

RQ

  1. The research question(s) are well defined / not vague (Yes-1 point, No-0 points) (“Well defined” means it is obvious from the research questions which variables will be involved in the analysis.)

  2. It is clear why the research question(s) is(are) of interest to the author and/or the reader (Yes-1 point, No-0 points)

  3. The question(s) involve three or more variables (Yes-1 point, No-0 points)

  4. The author described the sampling method, potential sources of bias, and explain its implications for generalizability (Yes-1 point, No-0 points)

  5. The author discussed whether causality can be inferred from the analysis (Yes-1 point, No-0 points)

EDA

  1. The author did an exploratory data analysis (Yes-2 points, Needs more work - 1 point, No-0 points)

  2. The plots are constructed correctly and address the research question (Yes-2 points, Needs more work, No-0 points)

  3. The summary statistics are calculated correctly and address the research question (Yes-2 points, Needs more work - 1 point, No-0 points)

  4. Each exploratory output accompanied by a narrative which interpret the visuals and summary statistics correctly (Yes-2 pts, Needs more work-1pt., No-0 pts)

INFERENCE

  1. The author performed an inference correctly (Yes-2 pts, Needs more work-1pt., No-0 pts)

OVERALL

  1. The author answered the research questions or explained why the question(s) is (are) not answerable (Yes-2 points, Needs more work - 1 point, No-0 points)

  2. The report included an executive summary (Yes-1 point, No-0 points)

  3. The report is done in Rmd (knitr) and submitted as an HTML/pdf document (Yes-1 point, No-0 points)

  4. The report is clear, readable, well organized (Yes-1 point, No-0 points)

  5. Does this work deseve up to +2 extra points for effort? Does the report show an exercise in trying to meet the bare minimum standards for this assignment (+0 pts) or was it a good faith effort at honoring the intent of the assignment? Please explain.




AVAILABLE DATASETS




DEFAULT PROJECT SCENARIO

Instead of coming up with a unique topic of your own, you can choose to proceed with the default project scenario described below.

Social Justice & Pay Equity

The following is an extract from a 2015 White House brief on gender pay gap:

Over the past century, American women have made tremendous strides in increasing their labor market experience and their skills. Today, women account for 47 percent of the labor force and they hold 49.3 percent of jobs. Women’s share of the labor force has been rising for more than 50 years and is continuing to increase. Today more households than ever have a woman as the primary or equal breadwinner in the household. … On Equal Pay Day, however, we focus on a stubborn and troubling fact: Despite women’s gains a large gender pay gap still exists.

Barack Obama in a State of the Union address:

“Today, women make up about half our workforce. But they still make 77 cents for every dollar a man earns. That is wrong. And in 2014, it’s an embarrassment. Women deserve equal pay for equal work.”

Does that mean that women are receiving lower pay for equal work irrespective of the level of qualification? You work for a federal agency and want to find out if this is possibly the case in your agency. Use the provided random sample of federal personnel records from 2008 to answer this or similar pay equity-related questions. For example, you can address differences in pay based on gender, race, ethnicity, and/or other employee characteristics. You can analyze the data to find out whether there are differences in pay across the groups of interest, what could explain those differences, and what can bridge the discovered pay gaps. Choose an agency of your interest out of those represented in the dataset, or analyze the federal government as a whole. You can also focus on comparing the differences of interest between 2008 and 1994 using the federal employee records 1994-95 Dataset ‘OPM94’).

A random sample of 9,000 federal personnel records for 2008:

Variables:

Another default alternative for the project is to compare changes in Americans’ attitudes on a range of issues between 1998 and 2016 using the data from the familiar to you General Social Survey. I will provide the 2016 dataset upon your request.




Data Analysis Project Examples

These examples might help you get an idea what a project report should look like. Your project doesn’t have to be as big and detailed as those provided in the examples and you are not required to use all the techniques provided in the examples.

Regression Models Project: Do Manual or Automatic Transmissions Provide better MPG?

Exploring links between sleep duration and individual health and behaviors.

Statistical inference with the GSS data

Predicting movie ratings