A dynamic Bayesian model to forecast the 2022 U.S. midterm elections.
- Code for all the analyses in
R/
andstan/
. README files in each subdirectory contain more information. - Tracked, processed data are in
data/
; untracked and raw data are indata-raw/
. - Model outputs are in
docs
/; files for this documentation page are inreadme-doc/
.
Jump to: Fundamentals • Firms • National intent • Outcomes
graph TD
mod_firms[<font size=5>FIRMS]:::model
mod_firms --> |Prior on firm error| mod_natl[<font size=5>NATIONAL INTENT]:::model
mod_fund[<font size=5>FUNDAMENTALS]:::model --> |Prior on E-day intent| mod_natl
d_ret([Historical<br />House returns]):::data -.-> mod_race
mod_natl --> |Covariate| mod_race[<font size=5>OUTCOMES]:::model
d_gen([Historical generic<br />ballot polling]):::data -.-> mod_firms
d_fund([Historical economic<br />and approval data]):::data -.-> mod_fund
d_22([2022 partisanship<br />and incumbency]):::data -.-> mod_race
classDef model fill:#aa2,stroke:#000,font-size:16pt,font-weight:bold
classDef data fill:#efeff4,stroke#aaa,line-height:1.5,font-size:9pt
The fundamentals model is Bayesian linear regression of national two-way vote share for the incumbent president’s party on logit retirements; house control (1 if incumbent president’s party controls the House), and presidential control (1 for a Dem. president); an economic indicator; logit presidential approval; and several interactions with polarization, measured as the correlation between House and presidential results in the previous election. The model is fit separately to presidential and midterm years. The economic indicator is the first principal component of three economic indicators: GDP change over the past year, log unemployment rate; and urban CPI change over the past year (inflation). The principal components are calculated only on data from 1948–2006 to allow the weights to be used in predictive models after 2008. To build national two-way vote share, we impute vote share for uncontested House races using a BART model fit on contested House elections from 1976 to 2020. Coefficients are given an R2-D2 prior. The data are available here.
Fundamentals-only prediction for 2022:
The firm error model goes hand-in-hand with the firm error component of the national intent model, below. The idea is to use historical firm performance in polling the generic ballot and presidential races as a prior for firm performance this cycle. We can decompose firm error into several components:
- Constant year-to-year bias in all firms in polling these races.
- Year-specific bias shared by all firms.
- Firm-specific bias.
- Bias from polling methodology (IVR/online/phone/mixed/unknown).
- Bias from LV polls. Due to limited data we only code an indicator for if a poll is not an LV poll—we don’t distinguish between RV/A/V polls.
Given total firm bias from all these sources, firms also vary in how close their results cluster around this bias. If a firm consistently reports numbers 5pp too favorable for Democrats, we can adjust for that. Less consistency means less adjustment is possible. Polling variance is affected by several factors:
- Sample size
- Time to the election
- LV vs. other polls
- Firm variance
We operationalize this framework with the following model, which is fit to around 5,100 historical polling results.
where
We can simulate from the model to get predictive values of firm bias and variance in hypothetical election-day likely voter polls for the 2022 election. These predictive values are the best way to evaluate each firm’s overall quality for this election. A firm is better—that is, its polls contain more information about the race—if it has lower variance (std. dev.), a lower herding value, and bias closer to 0 (though this will be adjusted for).
The intent model estimates latent national vote intent, which is assumed to evolve as a random walk, with and observation model that is closely related to the firm error model, above.
where
Estimates for the 2010–2020 cycles, based only on previous years, are shown below.
The outcomes models maps district partisanship, the national
environment, and other district and national factors onto vote shares in
each House district and Senate race. We use a multilevel model with a
student-t response, as described by the following (brms
) model syntax.
House:
ldem_seat ~ ldem_seat ~ inc_pres + offset(ldem_pred) + ldem_pres_adj:ldem_gen +
polar*(inc_seat + ldem_exp + exp_mis) - polar + region +
(1 + edu_o15 | year) + (1 | division:year) + (1 | dem_cand) + (1 | rep_cand)
sigma ~ polar + I(ldem_pres_adj^2)
Senate:
ldem_seat ~ ldem_pres_adj * ldem_gen +
(midterm + inc_pres + inc_seat)^2 + miss_polls*inc_seatc +
(1 + white + edu_o15 + poll_avg | year) + (1 | region) +
(1 | cand_dem) + (1 | cand_rep)
sigma ~ polar + I(ldem_pres_adj^2)
Here, inc_pres
is the party of the incumbent president, coded as plus
or minus 1; inc_seat
is the party of the seat’s incumbent, coded as 1
for a Democrat, -1 for a Republican and 0 if open; ldem_pres_adj
is
the logit last presidential result in the district, shifted back to a
neutral national environment (i.e., subtracting off the national
presidential result); ldem_gen
is the logit generic ballot;
ldem_pred
is the sum of these two; polar
measures polarization as
the lagged correlation between House and presidential results;
ldem_exp
is the logit share of campaign expenditures by the Democrat;
exp_mis
codes whether expenditure data are missing for the race (as
they unfortunately often are); poll_avg
is the average of polls
conducted in the last 30 days, shrunk to ldem_pres_adj
based on the
number of polls; miss_polls
is an indicator for no polling being
available; and dem_cand
and rep_cand
are the Democratic and
Republican candidates, respectively.
For the House model, the standard deviation of the year random effects
is estimated around 0.05 (on the logit scale); the standard deviation of
the division-year random effects is estimated around 0.04. The model is
fit to all 2,320 contested House elections from 2010 to 2020. Posterior
summaries for all coefficients are shown below. The overall model
For the Senate model, the standard deviation of the year random effects
is estimated around 0.03 (on the logit scale). The model is fit to all
258 contested, two-way Senate elections from 2006 to 2020. Posterior
summaries for all coefficients are shown below. The overall model
There are two Republican candidates running against a single Democrat in the Alaska at-large district. This poses a challenge to predicting a winner, since the dynamics of rank-choice voting could be determinative. As an ad-hoc adjustment, we simulate hypothetical Alaska at-large elections based on the results of the 2022 special election, with randomness added. We use the results of this simulation to understand the probability of a Democratic win based on the first-round balloting results. We can then translate this into a (random) vote boost to apply to the Democratic candidate in the first round in order to produce a rough approximation of the final rank-choice reallocated vote.