Modeling and prediction for movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

library(statsr)

Load data

setwd("E:/Coursera/Statistics with R - Duke University/03_LinearRegression/Project")
load("movies.Rdata")

Part 1: Data

According to the project documentation, the data for this project come from Rotten Tomatoes and IMDB and include 651 randomly sampled movies produced and released before 2016. Those observational data describe 32 movie characteristics, which are provided in the codebook. The following variables are available in the dataset:

names(movies)

##  [1] "title"            "title_type"       "genre"           
##  [4] "runtime"          "mpaa_rating"      "studio"          
##  [7] "thtr_rel_year"    "thtr_rel_month"   "thtr_rel_day"    
## [10] "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
## [16] "critics_score"    "audience_rating"  "audience_score"  
## [19] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [22] "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"          
## [28] "actor3"           "actor4"           "actor5"          
## [31] "imdb_url"         "rt_url"

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

As follows from the information provided above, the data for the project are collected using random sampling techniques. Given the random sampling design, the sample must be representative of the whole population of movies presented at IMDB and Rotten Tomatoes, and conclusions derived from the analysis based on this sample can be extened to that larger population. Also, the sample size (n=651) is large enough to achieve reasonably small margin of errors.

Because this is not an experimetnal desing and the sample contains only observational data, the scope of analysis will be limited to determining associations between the explanatory and response variables without the possibility to establish causal relationships. Still, a model developed using such data could be suitable for the purpose of making predictions.

Part 2: Research question

Among the available variables in the dataset, the data include IMDB ratings of the movies and the number of votes each movie gets on IMDB. The following gives a brief summary for those two numeric variables:

summary(movies$imdb_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.600   6.493   7.300   9.000

summary(movies$imdb_num_votes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     180    4546   15120   57530   58300  893000

According to their website, IMDb is now the world’s most popular and authoritative source for movie, TV and celebrity content.. Therefore, IMDB Ratings and Number of Votes could be used in this analysis as measure of movie popularity.

It would be interesting to find out whether other movie characteristics presented in the dataset could be used to predict movie poplarity. Such predictions could be useful for movie lovers and businesses that show movies to select the movies with the highest potential. Therefore, this project focuses on the fllowing two questions:

Research quesion 1: Is there a relationship between the IMDB Rating and other movie characteristics, such as the studio that produced the movie, the critics rating of the movie, the audience rating of the movie, whether the movie won a best picture Oscar, and whether the disrector of the movie has an Oscar. In other words, can the aforementioned variables be used to predict the IMDB Rating of a movie?

Research quesion 2: Is there a relationship between the Number of votes a movie gets on the IMDB and other movie characteristics, such as the studio that produced the movie, the critics rating of the movie, the audience rating of the movie, whether the movie won a best picture Oscar, and whether the disrector of the movie has an Oscar. In other words, can the aforementioned variables be used to predict the number of votes a movie gets on IMDB?

Part 3: Exploratory data analysis

The following analysis will explore the pattern in the reletionships between the two response variables specified in the research questions and several movie characteristics to get better understanding of the data and guide the development of a predictive model.

First, it is reasonable to assume that different movie studios might produce movies of different quality and, as the result, of different popularity. More specifically, it can be argued that studios that the more movies a studio produces, the more production experience it has, and the more popular movies it can create. Therefore, it is interesting to see how the studios in the dataset differ in terms of their productivity and how the movie popularity varies across studios that produced different numbers of movies in the given period of time.

The following summary shows that indeed, different studios vary greatly in terms of how many movies they have produced and released between 1970 and 2014:

summary(movies$thtr_rel_year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1970    1990    2000    1998    2007    2014

head(movies %>% group_by(studio) %>% summarize(nmovies=n()) %>% arrange(desc(nmovies)))

## # A tibble: 6 x 2
##                             studio nmovies
##                             <fctr>   <int>
## 1               Paramount Pictures      37
## 2            Warner Bros. Pictures      30
## 3 Sony Pictures Home Entertainment      27
## 4               Universal Pictures      23
## 5                Warner Home Video      19
## 6                 20th Century Fox      18

tail(movies %>% group_by(studio) %>% summarize(nmovies=n()) %>% arrange(desc(nmovies)))

## # A tibble: 6 x 2
##                      studio nmovies
##                      <fctr>   <int>
## 1      Warner Bros Pictures       1
## 2        Warner Independent       1
## 3    Warners Bros. Pictures       1
## 4                   Winstar       1
## 5 Yari Film Group Releasing       1
## 6           Zeitgeist Films       1

As we can see, some studios have produces tens of movies, while many have released only one movie. Thus, the studios have indeed varied in their productivity.

The explore whether there is a relationship between movie popularity and the productivity of its studio, a new varible that describes the productivity is needed. The following code creates such a variable by counting the number of movies each studio produced and adding those data to the dataset:

studio_counts <- movies %>% group_by(studio) %>% summarize(nmovies=n())
summary(studio_counts$nmovies)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   3.071   3.000  37.000

hist(studio_counts$nmovies, breaks = 0:40)

movies2 <- movies %>% left_join(studio_counts, by = "studio")

The variable that describes the number of movies produced by each studio is called nmovies and the graph above show its distribution. The new variable is contained in the newly created dataset called movies2.

Because the distribution of the number of the movied produced by the studios is broad and heavily skewed to the right, it’s reasonable to transform this variable into a factor with three levels: (1) studios that have produced only one movie, (2) studios that have produced two to ten movies, and (3) studios thta have produced more than ten movies. The variable will be called productivity:

movies2 <- movies2 %>% mutate(productivity = cut(movies2$nmovies, breaks = c(0, 1.5, 10.5, 40), labels = c("only1", "2-10", "11plus")))

## Warning: package 'bindrcpp' was built under R version 3.3.3

Now, the following numeric and graphical summaries show the relationship between the studio productivity and IMDB Rating:

movies2 %>% group_by(productivity) %>% summarise(Mean_imdb_rating = mean(imdb_rating), Median_imdb_rating = median(imdb_rating))

## # A tibble: 3 x 3
##   productivity Mean_imdb_rating Median_imdb_rating
##         <fctr>            <dbl>              <dbl>
## 1        only1         6.662879                6.9
## 2         2-10         6.507168                6.6
## 3       11plus         6.383333                6.4

ggplot(movies2, aes(productivity, imdb_rating)) + geom_boxplot() + labs(y="IMDB Rating")

The summarise below describe the variation of the Number of votes on IMDB across movies that are produced by different studios:

movies2 %>% group_by(productivity) %>% summarise(Mean_imdb_num_votes = mean(imdb_num_votes), Median_imdb_num_votes = median(imdb_num_votes))

## # A tibble: 3 x 3
##   productivity Mean_imdb_num_votes Median_imdb_num_votes
##         <fctr>               <dbl>                 <dbl>
## 1        only1            28647.57                7450.0
## 2         2-10            53625.39               15714.0
## 3       11plus            77962.53               24918.5

ggplot(movies2, aes(productivity, imdb_num_votes)) + geom_boxplot() + labs(y="Number of Votes on IMDB")

As we can see, in this sample, the movies that are produced by highly productive studios get, on average, higher numbers of votes on IMDB. Interestingly, however, they also tend to have lower average IMDB ratings. As we can see, for some reason more experienced studio produce movies that get lower average critic and audience scores on Rotten Tomatoes:

movies2 %>% group_by(productivity) %>% summarise(Mean_critics_score = mean(critics_score), Median_critics_score = median(critics_score))

## # A tibble: 3 x 3
##   productivity Mean_critics_score Median_critics_score
##         <fctr>              <dbl>                <dbl>
## 1        only1           64.06818                   72
## 2         2-10           58.83513                   63
## 3       11plus           52.84583                   54

ggplot(movies2, aes(productivity, critics_score)) + geom_boxplot() + labs(y="Critics Score on Rotten Tomatoes")

movies2 %>% group_by(productivity) %>% summarise(Mean_audience_score = mean(audience_score), Median_audience_score = median(audience_score))

## # A tibble: 3 x 3
##   productivity Mean_audience_score Median_audience_score
##         <fctr>               <dbl>                 <dbl>
## 1        only1            66.21970                    72
## 2         2-10            62.87455                    65
## 3       11plus            59.64583                    61

ggplot(movies2, aes(productivity, audience_score)) + geom_boxplot() + labs(y="Audience Score on Rotten Tomatoes")

At the same time, a higher propotion (4.6 percent vs. 1.4 percent vs. 0.0 percent) of movies produced by more higly experienced studios show up in the Top 200 Box Office list on BoxOfficeMojo, in other words generate more revenue:

movies2 %>% group_by(productivity) %>% summarise(n.Count=n(), top200box.Yes.Proportion = sum(top200_box == "yes")/n())

## # A tibble: 3 x 3
##   productivity n.Count top200box.Yes.Proportion
##         <fctr>   <int>                    <dbl>
## 1        only1     132               0.00000000
## 2         2-10     279               0.01433692
## 3       11plus     240               0.04583333

movies2 %>% ggplot(aes(x = productivity, fill = top200_box)) + geom_bar(position="fill") + labs(y = "Proportion")

Besides the experience a studio has, a popularity of a movie may also depend on its and its director success winning an Oscar. The following boxplots clearly show that in this sample there are relationships between the IMDB Rating and Whether or not the movie was nominated for a best picture Oscar, whether or not the movie won a best picture Oscar, and whether or not the director of the movie ever won an Oscar:

ggplot(movies2, aes(best_pic_win, imdb_rating)) + geom_boxplot() + labs(x="Movie won a best picture Oscar")

ggplot(movies2, aes(best_pic_nom, imdb_rating)) + geom_boxplot() + labs(x="Movie nominated for a best picture Oscar")

ggplot(movies2, aes(best_dir_win, imdb_rating)) + geom_boxplot() + labs(x="Director of the movie ever won an Oscar ")

Finally, it’s important to explore how critics and audience ratings are related to movie popularity:

qplot(critics_score, imdb_rating, data = movies ) + geom_smooth(method = "lm")

qplot(audience_score, imdb_rating, data = movies ) + geom_smooth(method = "lm")

qplot(critics_score, imdb_num_votes, data = movies ) + geom_smooth(method = "lm")

qplot(audience_score, imdb_num_votes, data = movies ) + geom_smooth(method = "lm")

As the scatterplots above show, there are apparent relationships between the score variables and movie popularity in this sample: the movies that have higher critic and audience scores tend to be more popular.

Part 4: Modeling

The exploratory data analysis indicates that the following variables might be potential prdictors of the IMDB Rating and IMDB Number of Votes:
productivity: Studio productivity
genre: Genre of movie
best_pic_nom: Whether or not the movie was nominated for a best picture Oscar
best_pic_win: Whether or not the movie won a best picture Oscar
best_dir_win: Whether or not the director of the movie ever won an Oscar
critics_rating: Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)

Other variables are not includied in the model as there is no theoretical justification to expect that they could predict movie popularity. For example, Runtime of movie (in minutes), MPAA rating of the movie (G, PG, PG-13, R, Unrated), or Year the movie is released in theaters should all be irrelevant to the quality and popularity of a movie. Also from each pair of the variables that describe critic as well as audience rating and score, only one variable was seleced (the one that generated the highest adjusted R-squared), because both variables from each pair would inroduce high collinearity.

The analysis will rely on backwards elimination for stepwise model selection process. All predictors will be included in the model at the beginning and then predictors will be dropped one at a time until parsimonius model is reached. The model with the highest adjusted R-squared will be reported to give it the highest predictive capability and avoid arbitrarines associated with using p-values.

The full model specification and summary for each of the two measures of movie popularity is as follows:

model_imdb_rating <- lm(imdb_rating ~ productivity + genre + best_pic_nom + best_pic_win + best_dir_win + critics_rating + audience_score, data = movies2)
summary(model_imdb_rating)

## 
## Call:
## lm(formula = imdb_rating ~ productivity + genre + best_pic_nom + 
##     best_pic_win + best_dir_win + critics_rating + audience_score, 
##     data = movies2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6426 -0.1997  0.0470  0.2830  1.1781 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.030499   0.132986  30.308  < 2e-16 ***
## productivity2-10                0.067078   0.054556   1.230 0.219330    
## productivity11plus              0.126867   0.057373   2.211 0.027376 *  
## genreAnimation                 -0.480665   0.179026  -2.685 0.007446 ** 
## genreArt House & International  0.206607   0.148889   1.388 0.165730    
## genreComedy                    -0.180184   0.082365  -2.188 0.029062 *  
## genreDocumentary                0.393600   0.102183   3.852 0.000129 ***
## genreDrama                      0.152302   0.070565   2.158 0.031278 *  
## genreHorror                     0.118966   0.122651   0.970 0.332440    
## genreMusical & Performing Arts  0.155128   0.161159   0.963 0.336128    
## genreMystery & Suspense         0.355421   0.091235   3.896 0.000108 ***
## genreOther                      0.022752   0.142035   0.160 0.872784    
## genreScience Fiction & Fantasy -0.179156   0.179572  -0.998 0.318816    
## best_pic_nomyes                 0.090148   0.128727   0.700 0.483993    
## best_pic_winyes                 0.107300   0.228666   0.469 0.639058    
## best_dir_winyes                 0.147848   0.085579   1.728 0.084544 .  
## critics_ratingFresh            -0.082318   0.058579  -1.405 0.160435    
## critics_ratingRotten           -0.355665   0.065045  -5.468 6.56e-08 ***
## audience_score                  0.039341   0.001316  29.893  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5015 on 632 degrees of freedom
## Multiple R-squared:  0.7922, Adjusted R-squared:  0.7862 
## F-statistic: 133.8 on 18 and 632 DF,  p-value: < 2.2e-16

model_imdb_num_votes <- lm(imdb_num_votes ~ productivity + genre + best_pic_win +best_pic_nom + best_dir_win + critics_rating + audience_score + imdb_rating, data = movies2)
summary(model_imdb_num_votes)

## 
## Call:
## lm(formula = imdb_num_votes ~ productivity + genre + best_pic_win + 
##     best_pic_nom + best_dir_win + critics_rating + audience_score + 
##     imdb_rating, data = movies2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -279368  -40618  -11143   23865  628159 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -88411.4    37432.0  -2.362 0.018483 *  
## productivity2-10                 15206.0     9815.5   1.549 0.121840    
## productivity11plus               30075.1    10349.9   2.906 0.003791 ** 
## genreAnimation                  -36150.8    32354.3  -1.117 0.264273    
## genreArt House & International  -87096.1    26796.3  -3.250 0.001214 ** 
## genreComedy                     -29155.9    14857.1  -1.962 0.050152 .  
## genreDocumentary               -120308.0    18576.8  -6.476 1.89e-10 ***
## genreDrama                      -46712.7    12727.3  -3.670 0.000263 ***
## genreHorror                     -35307.4    22057.0  -1.601 0.109936    
## genreMusical & Performing Arts -102693.8    28981.7  -3.543 0.000424 ***
## genreMystery & Suspense         -12582.8    16590.8  -0.758 0.448483    
## genreOther                       13108.6    25524.4   0.514 0.607732    
## genreScience Fiction & Fantasy   15673.4    32294.8   0.485 0.627616    
## best_pic_winyes                 170450.0    41098.9   4.147 3.82e-05 ***
## best_pic_nomyes                  61796.6    23141.4   2.670 0.007771 ** 
## best_dir_winyes                  16138.6    15415.0   1.047 0.295528    
## critics_ratingFresh             -83429.9    10543.1  -7.913 1.13e-14 ***
## critics_ratingRotten            -59581.1    11962.0  -4.981 8.18e-07 ***
## audience_score                     115.9      367.4   0.315 0.752595    
## imdb_rating                      32729.4     7148.2   4.579 5.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90130 on 631 degrees of freedom
## Multiple R-squared:  0.3728, Adjusted R-squared:  0.3539 
## F-statistic: 19.74 on 19 and 631 DF,  p-value: < 2.2e-16

For the IMDB Rating, two predictors - productivity and best_pic_win - were dropped in the stepwise model selection process giving the following model with a slight improvement in the adjusted R-squared:

model_imdb_rating <- lm(imdb_rating ~ best_pic_nom + genre + best_dir_win + critics_rating + audience_score, data = movies2)
summary(model_imdb_rating)

## 
## Call:
## lm(formula = imdb_rating ~ best_pic_nom + genre + best_dir_win + 
##     critics_rating + audience_score, data = movies2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7221 -0.1899  0.0354  0.2822  1.1683 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.119702   0.124078  33.203  < 2e-16 ***
## best_pic_nomyes                 0.128985   0.116117   1.111 0.267065    
## genreAnimation                 -0.467160   0.179220  -2.607 0.009359 ** 
## genreArt House & International  0.182437   0.148700   1.227 0.220321    
## genreComedy                    -0.177382   0.082422  -2.152 0.031766 *  
## genreDocumentary                0.353579   0.100651   3.513 0.000475 ***
## genreDrama                      0.139499   0.070447   1.980 0.048113 *  
## genreHorror                     0.098208   0.122445   0.802 0.422818    
## genreMusical & Performing Arts  0.141752   0.160737   0.882 0.378171    
## genreMystery & Suspense         0.348215   0.091259   3.816 0.000149 ***
## genreOther                      0.014260   0.142024   0.100 0.920056    
## genreScience Fiction & Fantasy -0.196169   0.179511  -1.093 0.274897    
## best_dir_winyes                 0.178683   0.081378   2.196 0.028472 *  
## critics_ratingFresh            -0.091938   0.058063  -1.583 0.113824    
## critics_ratingRotten           -0.353055   0.064910  -5.439 7.65e-08 ***
## audience_score                  0.039296   0.001317  29.827  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5024 on 635 degrees of freedom
## Multiple R-squared:  0.7904, Adjusted R-squared:  0.7855 
## F-statistic: 159.7 on 15 and 635 DF,  p-value: < 2.2e-16

confint(model_imdb_rating, level = 0.95)

##                                       2.5 %      97.5 %
## (Intercept)                     3.876049563  4.36335391
## best_pic_nomyes                -0.099034042  0.35700429
## genreAnimation                 -0.819096553 -0.11522360
## genreArt House & International -0.109565390  0.47443940
## genreComedy                    -0.339235089 -0.01552806
## genreDocumentary                0.155930183  0.55122786
## genreDrama                      0.001161312  0.27783732
## genreHorror                    -0.142237615  0.33865373
## genreMusical & Performing Arts -0.173887982  0.45739181
## genreMystery & Suspense         0.169009868  0.52741977
## genreOther                     -0.264633589  0.29315273
## genreScience Fiction & Fantasy -0.548675762  0.15633741
## best_dir_winyes                 0.018881295  0.33848532
## critics_ratingFresh            -0.205955754  0.02208071
## critics_ratingRotten           -0.480519412 -0.22558995
## audience_score                  0.036708899  0.04188304

The regression output shows that the variables genre, best_dir_win, critics_rating, and audience_score are statistically significant predictors of IMDB Rating for a movie at the 95-percent confidence level.

Thus, for the genre variable, the reference category is Action & Adventure. We can be 95 percent confident that in the population of all movies presented on IMDB, Animation movies recieve an average rating that is by 0.47 lower and Documentaries receive an average rating that is by 0.35 higher than that of Action and Advenure movies with comparable other characteristics. Also, all else held constant, movies with a Rotten critics rating, receive, on average, an IMDB rating that is by 0.35 lower than the rating of the movies with a Certified Fresh rating. Finally, we can be 95 percent confident that in the population of movies, an average IMDB rating of comparable in other respects movies increases by 0.04 as the auudeince score incereases by one unit.

For the IMDB Number of Votes, only one varibale audience_score was dropped to achieve the highest adjusted R-squared. The parsimonious model looks as the following:

model_imdb_num_votes <- lm(imdb_num_votes ~ productivity + genre + best_pic_win +best_pic_nom + best_dir_win + critics_rating +  imdb_rating, data = movies2)
summary(model_imdb_num_votes)

## 
## Call:
## lm(formula = imdb_num_votes ~ productivity + genre + best_pic_win + 
##     best_pic_nom + best_dir_win + critics_rating + imdb_rating, 
##     data = movies2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -279365  -40598  -11249   24205  627791 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -91817      35815  -2.564 0.010588 *  
## productivity2-10                  15028       9792   1.535 0.125373    
## productivity11plus                29821      10311   2.892 0.003958 ** 
## genreAnimation                   -35122      32166  -1.092 0.275303    
## genreArt House & International   -87183      26776  -3.256 0.001190 ** 
## genreComedy                      -28915      14827  -1.950 0.051599 .  
## genreDocumentary                -120217      18561  -6.477 1.88e-10 ***
## genreDrama                       -46766      12717  -3.677 0.000256 ***
## genreHorror                      -35921      21956  -1.636 0.102325    
## genreMusical & Performing Arts  -102156      28911  -3.533 0.000440 ***
## genreMystery & Suspense          -13313      16417  -0.811 0.417724    
## genreOther                        13234      25503   0.519 0.603992    
## genreScience Fiction & Fantasy    15554      32270   0.482 0.629984    
## best_pic_winyes                  170124      41057   4.144 3.88e-05 ***
## best_pic_nomyes                   62198      23090   2.694 0.007254 ** 
## best_dir_winyes                   15997      15398   1.039 0.299228    
## critics_ratingFresh              -83684      10505  -7.966 7.64e-15 ***
## critics_ratingRotten             -60154      11815  -5.091 4.70e-07 ***
## imdb_rating                       34455       4598   7.494 2.26e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90060 on 632 degrees of freedom
## Multiple R-squared:  0.3727, Adjusted R-squared:  0.3548 
## F-statistic: 20.86 on 18 and 632 DF,  p-value: < 2.2e-16

confint(model_imdb_num_votes, level = 0.95)

##                                      2.5 %      97.5 %
## (Intercept)                    -162147.512 -21486.4966
## productivity2-10                 -4201.669  34256.6066
## productivity11plus                9572.792  50069.2704
## genreAnimation                  -98287.476  28044.1384
## genreArt House & International -139763.622 -34602.9365
## genreComedy                     -58030.172    201.1198
## genreDocumentary               -156666.128 -83767.4845
## genreDrama                      -71738.525 -21792.6087
## genreHorror                     -79035.085   7194.0090
## genreMusical & Performing Arts -158929.389 -45383.1497
## genreMystery & Suspense         -45550.973  18925.7493
## genreOther                      -36846.823  63315.4895
## genreScience Fiction & Fantasy  -47815.140  78922.0497
## best_pic_winyes                  89499.708 250747.2783
## best_pic_nomyes                  16855.465 107539.9894
## best_dir_winyes                 -14239.301  46233.5299
## critics_ratingFresh            -104312.516 -63055.3116
## critics_ratingRotten            -83355.116 -36952.1709
## imdb_rating                      25426.471  43482.8761

Model Diagnostics

To ensure that the obtained models are valid and generate credible estimates, this sections conducts diagnostics that verify the models meet the conditions required for MLR.

First condition requies that there is a linear relationship between y and x - each numerical expalantory variable and the response variable. This can be checked by using residuals plot e ~ x:

plot(model_imdb_rating$residuals ~ movies2$audience_score)

plot(model_imdb_num_votes$residuals ~ movies2$audience_score)

plot(model_imdb_num_votes$residuals ~ movies2$imdb_rating)

For the IMDB Rating model, we can see that the residuals are randomly scattered around zero, which suggest the condition is met. For the IMDB Number of Votes, the scatter around zero for imdb_rating doesn’t look as completely random, although quite close to it, so the condition is fairly satisfied.

Second condition requires nearly normal distribution of residuals centered at 0, which can be verified using histograms and qqnorm plots.

hist(model_imdb_rating$residuals)

qqnorm(model_imdb_rating$residuals)
qqline(model_imdb_rating$residuals)

hist(model_imdb_num_votes$residuals)

qqnorm(model_imdb_num_votes$residuals)
qqline(model_imdb_num_votes$residuals)

Asit can be seen from the graph, the conditions are satisfied for the the IMDB Rating model and again fairly to weakly satisfied for the IMDB Number of Votes model.

Third condition required constant variability of residuals (homoscedasticity). Plotting the residuals against the fitted values allows one to check the condition:

plot(model_imdb_rating$residuals ~ model_imdb_rating$fitted)

plot(model_imdb_num_votes$residuals ~ model_imdb_num_votes$fitted)

It looks like the condition is fairly satisfied for the IMDB Rating model and, at most, weakly satisfied for the the IMDB Number of Votes model.

The final condition of independent residuals must be satisfied beacuse the data were collected using random sampling, and there is no reason to suspect time series structure in the data.

Part 5: Prediction

In this section, I will use the selected models to predict the popularity of the 2016 movie called Nocturnal Animals. The data for this movie comes from IMDB and Rotten Tomatoes.

The movie is a drama released by the studio Focus Features, which is in the productivity range 2-10; it didn’t win a best picture Oscar but was nominated;the director of the movie hadn’t ever won an Oscar. The movie has a Fresh critics rating, an audienc score of 73, and IMDB rating of 7.5 with 155,471 votes.

To use the predictive model, those characterics need to be put in a data frame:

movie2016 <- as.data.frame(cbind(genre = "Drama", productivity = "2-10", best_pic_win = "no", best_pic_nom = "yes", best_dir_win = "no", critics_rating = "Fresh", audience_score = "73", imdb_rating = "7.5"), stringsAsFactors = FALSE)
movie2016$audience_score <- as.numeric(movie2016$audience_score) 
movie2016$imdb_rating <- as.numeric(movie2016$imdb_rating)

The following code generates the predictions:

predict(model_imdb_rating, movie2016)

##        1 
## 7.164855

predict(model_imdb_num_votes, movie2016)

##        1 
## 113368.8

The models predict the movie’s IMDB score at 7.16 and also 116,791 IMDB votes. The prediction of the IMDB score appears to be quite accurate as it comes close to the real score of 7.5. At the same time, the prediction of the IMDB number of votes is much less accurate, which was expected due to the model being less reliable based on its weak satisfaction of MLR conditions.

Part 6: Conclusion

This analysis suggests that there are statitically significant relationships between movie popularity (measured as IMDB Rating and IMDB Number of Votes) and such movie characteristics as the movie genre, whether or not the movie was nominated for or won a best picture Oscar, whether or not the director of the movie ever won an Oscar, critics rating, and audience score. Those characteristics can be used to model and predict movie IMDB Ratings with high accuracy (the model predicts 79 percent of variation in IMDB Ratings) and IMDB number of votes with a lower accuracy (the model predicts only 35 percent of variation in the response variable).

The analysis has several shortcomings that are worth noting. First, it relies on a a limited number of variables providd in the data set and therefore is constrained by it. For example, an usful measure of movie popularity would be the amount of revenue it generates operationalized uisng a numeric variable. However, the dataset provides only a 2-levelm yes-no fator top200_box that indicates whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo. The exploratory analysis showed that this measure might be relate to other movie characteristics, but it wasn’t inluded in this analysis as a response varible because MLR is not an appropriate tool for analysing binary response variables. Second, this is an observational study, which doesnt’t allow making any causal inferences. Finally, the model that predicts the IMDB number of votes is not reliable as it weakly satisfies the conditions for MLR.