Fork the repository for the problem set 2, problem-set-2
(https://github.com/macss-ml20/problem-set-2). Remember, all final submissions should be a single rendered PDF with code produced in-line. Also, don't forget to open the pull request once you've committed your final submission to your forked repository. It needs to be merged back into the course master branch to be considered "submitted". See the syllabus for details.
Joe Biden was the 47th Vice President of the United States. He was the subject of many memes, attracted the attention of Leslie Knope, and experienced a brief surge in attention due to photos from his youth.
The goal here is to fit a regression model predicting feelings toward Biden, and then implement a couple validation techniques to evaluate the original findings. The validation techniques include the simple holdout approach and the bootstrap. Note: we are not covering cross validation (LOOCV or k-fold) in this problem set, as these topics are covered in the following week.
The nes2008.csv
data contains a paired down selection of features from the full 2008 American National Election Studies survey. These data will allow you to test competing factors that may influence attitudes towards Joe Biden. The variables are coded as follows:
biden
- feeling thermometer ranging from 0-100. Feeling thermometers are a common metric in survey research used to gauge attitudes or feelings of "warmth" towards individuals and institutions. They range from 0-100, with 0 indicating extreme "coldness" and 100 indicating extreme "warmth."female
- 1 if respondent is female, 0 if respondent is maleage
- age of respondent in yearseduc
- number of years of formal education completed by respondentdem
- 1 if respondent is a Democrat, 0 otherwiserep
- 1 if respondent is a Republican, 0 otherwise
For this exercise we consider the following functional form,
where dem
and rep
party affiliation features is to allow for capturing the preferences of Independents, which must be left out to serve as the baseline category, otherwise we would encounter perfect multicollinearity.
-
(10 points) Estimate the MSE of the model using the traditional approach. That is, fit the linear regression model using the entire dataset and calculate the mean squared error for the entire dataset. Present and discuss your results at a simple, high level.
-
(30 points) Calculate the test MSE of the model using the simple holdout validation approach.
- (5 points) Split the sample set into a training set (50%) and a holdout set (50%). Be sure to set your seed prior to this part of your code to guarantee reproducibility of results.
- (5 points) Fit the linear regression model using only the training observations.
- (10 points) Calculate the MSE using only the test set observations.
- (10 points) How does this value compare to the training MSE from question 1? Present numeric comparison and discuss a bit.
-
(30 points) Repeat the simple validation set approach from the previous question 1000 times, using 1000 different splits of the observations into a training set and a test/validation set. Visualize your results as a sampling distribution ( hint: think histogram or density plots). Comment on the results obtained.
-
(30 points) Compare the estimated parameters and standard errors from the original model in question 1 (the model estimated using all of the available data) to parameters and standard errors estimated using the bootstrap (
$B = 1000$ ). Comparison should include, at a minimum, both numeric output as well as discussion on differences, similarities, etc. Talk also about the conceptual use and impact of bootstrapping.