search.json

[
  {
    "objectID": "assessment/index.html",
    "href": "assessment/index.html",
    "title": "Assessment",
    "section": "",
    "text": "3,500-word long research report\n   26th April 2023\n   Submit to Turnitin via Canvas\n\nMore detailed information and submission link are available on the Canvas site"
  },
  {
    "objectID": "data/data-documentation.html",
    "href": "data/data-documentation.html",
    "title": "Data documentation",
    "section": "",
    "text": "The datasets used in this course and available for download from the course website are the following:\n\n\n\nFile name\nOriginal name\nType\nVersion\nSurvey\nLinks\n\n\n\n\neb89.1\nZA6963_v1-0-0\n.dta\n.sav\n1.0.0\nEurobarometer; 89.1 (March 2018)\nSource\nQuestionnaire\nCodebook\n\n\ness9\nESS9e03_1\n.dta\n.sav\n3.1\nEuropean Social Survey; Integrated file, Round 9 (2018)\nSource\nQuestionnaire\nCodebook\n\n\nevs5\nZA7500_v4-0-0\n.dta\n.sav\n4.0.0\nEuropean Values Study; Wave 5 (2017-2020)\nSource\nQuestionnaire\nCodebook\n\n\nEUinUK2018\nEUinUK2018_Polish\n.dta\n-\nSurvey data collected by McGhee and Moreh (2018), ESRC Centre for Population Change\nSource\nQuestionnaire\nCodebook\n\n\nLaddLenz\nLaddLenz\n.dta\n-\nReplication data for Ladd and Lenz (2009), based on British Election Panel Study data\nSource\nQuestionnaire\nCodebook\n\n\nosterman\nReplication_data_ESS1-9_20201113\n.dta\n-\nReplication data for Österman (2020), based on European Social Survey Rounds 1-9 data\nSource\nQuestionnaire\nCodebook\n\n\n\nThe datasets can be read into R from \"https://cgmoreh.github.io/SSC7001M/data/FILE_NAME\" using an appropriate command from the haven package or other importing function.\n\n\n\n\n\n\n\n\n\nFile\n\n\nOriginal name\n\n\nType\n\n\nVersion\n\n\nOrigin\n\n\nAccess\n\n\n\n\n\n\nosterman\n\n\nReplication_data_ESS1-9_20201113\n\n\n.dta\n\n\nNA\n\n\nReplication data for Österman (2021), based on European Social Survey Rounds 1-9 data\n\n\nSource  Questionnaire  Codebook\n\n\n\n\nLaddLenz\n\n\nLaddLenz\n\n\n.dta\n\n\nNA\n\n\nReplication data for Ladd and Lenz (2009), based on British Election Panel Study data. Included in Hainmueller (2012)\n\n\nSource  Questionnaire  Codebook\n\n\n\n\n\n\n\n\n\n\nReferences\n\nHainmueller, Jens. 2012. “Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.” Political Analysis 20 (1): 25–46. https://doi.org/10.1093/pan/mpr025.\n\n\nLadd, Jonathan McDonald, and Gabriel S. Lenz. 2009. “Exploiting a Rare Communication Shift to Document the Persuasive Power of the News Media.” American Journal of Political Science 53 (2): 394–410. https://doi.org/10.1111/j.1540-5907.2009.00377.x.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nÖsterman, Marcus. 2021. “Can We Trust Education for Fostering Trust? Quasi-experimental Evidence on the Effect of Education and Tracking on Social Trust.” Social Indicators Research 154 (1): 211–33. https://doi.org/10.1007/s11205-020-02529-y."
  },
  {
    "objectID": "data/index.html",
    "href": "data/index.html",
    "title": "Data documentation",
    "section": "",
    "text": "File name\n\n\nType\n\n\nDescription\n\n\nLink to source\n\n\n\n\n\n\nevs5\n\n\n.sav\n\n\nEuropean Values Study; Wave 5 (2017-2021)\n\n\nSource\n\n\n\n\nosterman\n\n\n.dta\n\n\nReplication data for Österman (2021), based on European Social Survey Rounds 1-9 data\n\n\nData source  Open access article  Supplementary materials\n\n\n\n\nLaddLenz\n\n\n.dta\n\n\nReplication data for Ladd and Lenz (2009), based on British Election Panel Study data. Included in Hainmueller (2012)\n\n\nSource\n\n\n\n\nEverydayTrust\n\n\n.Rds\n\n\nReplication data for Weiss et al. (2021)\n\n\nSource\n\n\n\n\ngaltonpeas\n\n\n.Rds\n\n\nData underpinning a paper presented by Sir Francis Galton to the Royal Institute on February 9, 1877, summarising his experiments on sweet peas in which he compared the size of peas produced by parent plants to those produced by offspring plants.\n\n\nSource\n\n\n\n\ngalton1886\n\n\n.dta\n\n\nSir Francis Galton’s famous data on the heights or parents and their children underpinning his 1886 paper (Galton 1886).\n\n\nSource and more info\n\n\n\n\nValentino17\n\n\n.dta\n\n\nReplication data for Valentino et al. (2019), based on original data collected through YouGov in 11 countries. The original dataset provided by the authors is called imm.bjpols.dta and the original analysis was performed in Stata.\n\n\nData source  Open access article  Supplementary materials\n\n\n\n\nEjrnaes21\n\n\n.dta\n\n\nReplication data for Ejrnæs and Jensen (2021), based on data from the European Social Survey Round 8. The original dataset provided by the authors is called G&O_Final.tab and the original analysis was performed in Stata.\n\n\nData source  Open access article  Supplementary materials\n\n\n\n\nworkout\n\n\n.Rds\n\n\nExample dataset from Mehmetoglu and Mittner (2021); a combined version of the original workout2 and workout3 datasets included in the {astatur} package\n\n\nData source\n\n\n\n\n\n\nThe datasets can be downloaded by clicking on the file name, or read into R directly from \"https://cgmoreh.github.io/HSS8005/data/___\" (using a type-appropriate read function and replacing ___ with “File name” and “Type” extension; e.g. haven::read_dta(\"https://cgmoreh.github.io/HSS8005/data/dataset.dta\")).\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEjrnæs, Anders, and Mads Dagnis Jensen. 2021. “Go Your Own Way: The Pathways to Exiting the European Union.” Government and Opposition, February, 1–23. https://doi.org/10.1017/gov.2020.37.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGalton, Francis. 1886. “Regression Towards Mediocrity in Hereditary Stature.” The Journal of the Anthropological Institute of Great Britain and Ireland 15: 246–63. https://doi.org/10.2307/2841583.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nHainmueller, Jens. 2012. “Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.” Political Analysis 20 (1): 25–46. https://doi.org/10.1093/pan/mpr025.\n\n\nLadd, Jonathan McDonald, and Gabriel S. Lenz. 2009. “Exploiting a Rare Communication Shift to Document the Persuasive Power of the News Media.” American Journal of Political Science 53 (2): 394–410. https://doi.org/10.1111/j.1540-5907.2009.00377.x.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMehmetoglu, Mehmet, and Matthias Mittner. 2021. Applied Statistics Using R: A Guide for the Social & Natural Sciences. First. Thousand Oaks: SAGE Publications.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nÖsterman, Marcus. 2021. “Can We Trust Education for Fostering Trust? Quasi-experimental Evidence on the Effect of Education and Tracking on Social Trust.” Social Indicators Research 154 (1): 211–33. https://doi.org/10.1007/s11205-020-02529-y.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489.\n\n\nValentino, Nicholas A., Stuart N. Soroka, Shanto Iyengar, Toril Aalberg, Raymond Duch, Marta Fraile, Kyu S. Hahn, et al. 2019. “Economic and Cultural Drivers of Immigrant Support Worldwide.” British Journal of Political Science 49 (4): 1201–26. https://doi.org/10.1017/S000712341700031X.\n\n\nWeiss, Alexa, Corinna Michels, Pascal Burgmer, Thomas Mussweiler, Axel Ockenfels, and Wilhelm Hofmann. 2021. “Trust in Everyday Life.” Journal of Personality and Social Psychology 121: 95–114. https://doi.org/10.1037/pspi0000334."
  },
  {
    "objectID": "index.html",
    "href": "index.html",
    "title": "\n            Quantitative analysis \n        ",
    "section": "",
    "text": "Quantitative analysis \n        \n        \n            HSS8005 • Intermediate/Advanced stream • 2023\nNewcastle University (UK)\n        \n        \n            A second course in applied statistics and probability for the understanding of society and culture. It is aimed at an interdisciplinary audience through real-life research examples from various fields in the social sciences and humanities. The course emphasizes the scientific application of statistical methods, developing a reproducible research workflow, and computational techniques.\n        \n    \n\n\n\n\n\nModule leader\n\n   Dr. Chris Moreh\n   HDB.4.106\n   chris.moreh@newcastle.ac.uk\n   Tutorial booker\n \n\n\n\nTeaching Assistants\n\n   Bilal Alsharif\n   Fengting Du\n\n\n\n\n\nSession dates\n\n   Thursdays\n    Check on your Timetable app\n   Lecture: 10:00-11:30\n   Labs:   13:00-14:30 (Group 03)        14:30-16:00 (Group 04)  \n\n\n\nAssessment\n\n   26th April 2023\n   3,500-word long research report\n   Submit to Turnitin via Canvas\n\n\n\n\n Chris’s mastodon feed\nwhere he posts stuff of interest to #HSS8005\n\n\n\n\n\n\nModule overview\nThis module is offered by School X - Researcher Education and Development to postgraduate students within the Faculty of Humanities and Social Sciences at Newcastle University. The module aims to provide a broad applied introduction to more advanced methods in quantitative analysis for students from various disciplinary backgrounds. See the module plan page for details about the methods covered. The course content consists of eight lectures (1.5 hours each) and eight IT labs (1.5 hours) . The course stands on three pillars: application, reproducibility and computation.\nApplication: we will work with real data originating from large-scale representative surveys or published research, with the aim of applying methods to concrete research scenarios. IT lab exercises will involve reproducing small bits of published research, using the data and (critically) the modelling approaches used by the authors. The aim is to see how methods have been used in practice in various disciplines and learn how to reproduce (and potentially improve) those analyses. This will then enable students to apply this knowledge to their own research questions. The data used in IT labs may be cleansed to allow focusing more on modelling tasks than on data wrangling, but exercises will address some of the more common data manipulation challenges and will cover essential functions. Data cleansing scripts will also be provided so that interested students can use them in their own work.\nReproducibility: developing a reproducible workflow that allows your future self or a reviewer of your work to understand your process of analysis and reproduce your results is essential for reliable and collaborative scientific research. We enforce the ideas and procedures of reproducible research both through replicating published research (see above) and in our practice (in the IT labs and the assignment). For an overview of why it’s important to develop a reproducible workflow early on in your research career and how to do it using (some) of the tools used in this module, read Chapter 3 of TSD (see Resources>Readings). It’s also worth reading through Kieran Healy’s The Plain Person’s Guide to Plain Text Social Science, although there are now better software options than those discussed there. In this course, we will be using a suite of well-integrated free and open-source software to aid our reproducible workflow: the  statistical programming language and its currently most popular dialect – the {tidyverse} – via the  IDE for data analysis, and  for scientific writing and publishing (see Resources>Software).\nComputation: the development of computational methods underpins the application of the most important statistical ideas of the past 50 years (see Andrew Gelman’s article on these developments here or an online workshop talk here; Richard McElreath’s great talk on Science as Amateur Software Development is well worth watching too). This module aims to develop basic computational skills that allow the application of complex statistical models to practical scientific problems without advanced mathematical knowledge, and which lay the foundation on which students can then pursue further learning and research in computational humanities and social sciences.\n\nThe course and the website were written and are maintained by Chris Moreh.\n\n\nPrerequisites\nTo benefit the most from this module, students are expected to have a foundational level of knowledge in quantitative methods: a good understanding of data types and distributions, familiarity with inferential statistics, and some exposure to linear regression. This is roughly equivalent to the content covered in the Introductory stream of the module or a textbook such as OpenIntro Statistics (which you can download for free in PDF).\nThose who don’t feel completely up to date with linear regression but are determined to advance more quickly and read/practice beyond the compulsory material during weeks 1-3 are also encouraged to sign up.\nThose with a stronger background in multiple linear regression (e.g. students with undergraduate-level training in econometrics) will still benefit from weeks 1-3 as the approach we are taking is probably different from the one they are familiar with.\nNo previous knowledge of  or command-based statistical analysis software is needed. Gaining experience with using statistical software is part of the skills development aims of the module. However, it is not a general data science module, and the IT labs will cover a very limited number of functions (from both base R, the tidyverse and other reliable user-written packages) that are most useful for tackling specific analysis tasks. Students are advised to complete some additional self-paced free online training in the use of the software, such as Data Carpentry’s R for Social Scientists, and to consult Wickham, Çetinkaya-Rundel and Grolemund’s R for Data Science (2nd ed.) online book."
  },
  {
    "objectID": "materials/handouts/index.html",
    "href": "materials/handouts/index.html",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "Title\n\n\nDescription\n\n\nReading Time\n\n\n\n\n\n\nWeek 1 handout\n\n\n\n\n0 min\n\n\n\n\nWeek 2 handout\n\n\n\n\n0 min\n\n\n\n\nWeek 3 handout\n\n\n\n\n0 min\n\n\n\n\nWeek 4 handout\n\n\n\n\n0 min\n\n\n\n\nWeek 5 handout\n\n\n\n\n0 min\n\n\n\n\nWeek 6 handout\n\n\n\n\n0 min\n\n\n\n\nWeek 7 handout\n\n\n\n\n0 min\n\n\n\n\nWeek 8 handout\n\n\n\n\n0 min\n\n\n\n\nWeek 1 handout sheet\n\n\n\n\n0 min\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "materials/index.html",
    "href": "materials/index.html",
    "title": "Materials",
    "section": "",
    "text": "Materials for each week are available from the side menu. The table below outlines the weekly topics.\n\n\n\n\n\n\nWeekly topics\n\n\n\n\n\n\n\n\nWeek 1  Gamblers, God, Guinness and peas\n\n\nA brief history of statistics\n\n\n\n\nWeek 2  Revisiting Flatland\n\n\nA review of general linear models\n\n\n\n\nWeek 3  Dear Prudence, Help! I may be cheating with my X\n\n\nInteractions and the logic of causal inference\n\n\n\n\nWeek 4  The Y question\n\n\nGeneralised linear models\n\n\n\n\nWeek 5  Do we live in a simulation?\n\n\nBasic data simulation for statistical inference and power analysis\n\n\n\n\nWeek 6  Challenging hierarchies\n\n\nMultilevel models\n\n\n\n\nWeek 7  The unobserved\n\n\nLatent variables and structural models\n\n\n\n\nWeek 8  Words, words, mere words…\n\n\nText as data\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "materials/info/index.html",
    "href": "materials/info/index.html",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "Title\n\n\nSubtitle\n\n\n\n\n\n\nWeek 1  Gamblers, God, Guinness and peas\n\n\nA brief history of statistics\n\n\n\n\nWeek 2  Revisiting Flatland\n\n\nA review of general linear models\n\n\n\n\nWeek 3  Dear Prudence, Help! I may be cheating with my X\n\n\nInteractions and the logic of causal inference\n\n\n\n\nWeek 4  The Y question\n\n\nGeneralised linear models\n\n\n\n\nWeek 5  Do we live in a simulation?\n\n\nBasic data simulation for statistical inference and power analysis\n\n\n\n\nWeek 6  Challenging hierarchies\n\n\nMultilevel models\n\n\n\n\nWeek 7  The unobserved\n\n\nLatent variables and structural models\n\n\n\n\nWeek 8  Words, words, mere words…\n\n\nText as data\n\n\n\n\n\n\nNo matching items\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/info/info_w01.html",
    "href": "materials/info/info_w01.html",
    "title": "Week 1  Gamblers, God, Guinness and peas",
    "section": "",
    "text": "Readings\nTextbook readings\n\nROS: Chapters 1 and 2\nTSD: Chapters 1, 2 and 3 (“Foundations”)\nR4DS: Chapters 1-10 (“Whole game”)\n\nIntuition building\n\nJaynes, E. T. (2003). Probability theory: The logic of science. Cambridge University Press (available via the NU library)\n\nPreface: pp. xix-xxvii\nChapter 16 (“Orthodox methods: historical background”): pp. 490-506\n\nMcElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan (2nd ed.). Taylor and Francis, CRC Press (available online)\n\nChapter 1: pp. 1-18\n\n\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/info/info_w02.html",
    "href": "materials/info/info_w02.html",
    "title": "Week 2  Revisiting Flatland",
    "section": "",
    "text": "Readings\nStatistics\n\nROS: Chapters 3, 4, 6-12\nTSD: Chapter 12 (“Linear models”)\n\nCoding\n\nTSD: Chapters 9 and 11\nR4DS: Chapters 11, 12\n\nApplication\n\nÖsterman, Marcus. 2021. ‘Can We Trust Education for Fostering Trust? Quasi-Experimental Evidence on the Effect of Education and Tracking on Social Trust’. Social Indicators Research 154(1):211–33 - (online)\n\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/info/info_w03.html",
    "href": "materials/info/info_w03.html",
    "title": "Week 3  Dear Prudence, Help! I may be cheating with my X",
    "section": "",
    "text": "Readings\nStatistics\n\nROS: Chapters 10-12, 18-20\n\nCoding\n\nTSD: Chapter 14\n\nApplication\n\nÖsterman, Marcus. 2021. ‘Can We Trust Education for Fostering Trust? Quasi-Experimental Evidence on the Effect of Education and Tracking on Social Trust’. Social Indicators Research 154(1):211–33 - (online)\n\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/info/info_w04.html",
    "href": "materials/info/info_w04.html",
    "title": "Week 4  The Y question",
    "section": "",
    "text": "Readings\nStatistics\n\nROS: Chapters 13-15\n\nCoding\n\nTSD: Chapter 13\n\nApplication\n\nLadd, Jonathan McDonald, and Gabriel S. Lenz. 2009. ‘Exploiting a Rare Communication Shift to Document the Persuasive Power of the News Media’. American Journal of Political Science 53(2):394–410. doi: 10.1111/j.1540-5907.2009.00377.x.(published version should be accessible with university login; additional Appendix available here)\nWeiss, Alexa, Corinna Michels, Pascal Burgmer, Thomas Mussweiler, Axel Ockenfels, and Wilhelm Hofmann. 2021. ‘Trust in Everyday Life’. Journal of Personality and Social Psychology 121:95–114. doi: 10.1037/pspi0000334 (access preprint version here)\n\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/info/info_w05.html",
    "href": "materials/info/info_w05.html",
    "title": "Week 5  Do we live in a simulation?",
    "section": "",
    "text": "Readings\n\nROS: Chapters 5 (pp. 69-76) and 16 (pp. 291-310)\nTSD: TDS makes extensive use of simulation methods for various purposes at different stages of a research project (e.g. from data preparation through statistical inference to sharing results and data openly). A search on a keyword stub “simulat” can point you various sections of interest that are all worth reading.\n\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/info/info_w06.html",
    "href": "materials/info/info_w06.html",
    "title": "Week 6  Challenging hierarchies",
    "section": "",
    "text": "Readings\nTextbook\n\nARM: Chapters 11 (pp. 237-249) and 12 (pp. 251-278)\nTSD: Chapter section 15.2\n\nApplication\n\nValentino et al. (2017) Economic and cultural drivers of immigrant support worldwide. British Journal of Political Science, 49(4), 1201–1226. (The accepted manuscript version can be downloaded from here; Note: this version of the article also contains a brief “response to reviewers” by the authors, which you may find interesting)\n\n\n\nFurther readings\n\nARM: Chapters 13 (pp. 279-299), 14 (pp. 301-323) and 15 (pp. 325-342)\n\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/info/info_w07.html",
    "href": "materials/info/info_w07.html",
    "title": "Week 7  The unobserved",
    "section": "",
    "text": "Readings\nTextbook\n\nChapters 13 and 14 in Mehmetoglu, M. & Mittner, M. (2022) Applied statistics using R: a guide for the social sciences. London: Sage (NCL library access here)\n\nVideo\n\nKubinec, R. (2019) An introduction to latent variable models for data science. Sage Research Methods (video file, 00:17:44) (NCL library access here)\n\nApplication\n\nEjrnæs, A., & Jensen, M. D. (2022) Go Your Own Way: The Pathways to Exiting the European Union. Government and Opposition, 57(2), 253-275. https://doi.org/10.1017/gov.2020.37 (The accepted manuscript version can be downloaded from here)\n\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/info/info_w08.html",
    "href": "materials/info/info_w08.html",
    "title": "Week 8  Words, words, mere words…",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/draft-notes_w01.html",
    "href": "materials/notes/draft-notes_w01.html",
    "title": "Gamblers, God, Guinness and peas",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/draft-notes_w02.html",
    "href": "materials/notes/draft-notes_w02.html",
    "title": "Revisiting Flatland",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/draft-notes_w03.html",
    "href": "materials/notes/draft-notes_w03.html",
    "title": "Dear Prudence, Help! I may be cheating with my X",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/draft-notes_w04.html",
    "href": "materials/notes/draft-notes_w04.html",
    "title": "The Y question",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/draft-notes_w05.html",
    "href": "materials/notes/draft-notes_w05.html",
    "title": "Do we live in a simulation?",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/draft-notes_w06.html",
    "href": "materials/notes/draft-notes_w06.html",
    "title": "Challenging hierarchies",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/draft-notes_w07.html",
    "href": "materials/notes/draft-notes_w07.html",
    "title": "The unobserved",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/draft-notes_w08.html",
    "href": "materials/notes/draft-notes_w08.html",
    "title": "Words, words, mere words…",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/notes/index.html",
    "href": "materials/notes/index.html",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "Title\n\n\nDescription\n\n\nReading Time\n\n\n\n\n\n\nGamblers, God, Guinness and peas\n\n\nIn the first contribution to a series of articles on the history of probability and statistics in the journal Biometrika, Florence Nightingale David (1955) (no linear relationship with the famous social reformer) paraphrased a contemporary archaeologist who quipped that “a symptom of decadence in a civilization is when men become interested in their own history”, giving the interest in his own discipline as proof of the validity of his statement. David, however, thought that this does not stand true also for scientists’ and statisticians’ own emerging interest in their disciplines. He was right, in that the critical examination of the intellectual development of statistics and probability theory that followed has improved the discipline by excavating ideas that had been buried by mainstream statistics, but he was also mistaken, in that this activity threw light on the decadence of mainstream statistical practice. In this lecture we will look back on the development of some basic statistical concepts and learn about the ideas and preoccupations that influenced them over the centuries. The aim of this overview is to build up essential intuition about the concepts and methods that we will learn later. Brains-on activities will include casting astragali, fighting Laplace’s Demon, tasting tea, and comparing peas in a pod. By the end, we will gain a clearer understanding of the limits of statistical analysis and the dangers of not acknowledging those limits.\nThe IT lab will provide a very hands-on practical introduction to the statistical software that will be used in the module.\n\n\n0 min\n\n\n\n\nRevisiting Flatland\n\n\nIn Edwin Abbott’s 1884 novella, the inhabitants of Flatland are geometric shapes living in a two-dimensional world, incapable of imagining the existence of higher dimensions. A sphere passing through the plain of their world is a fascinating but incomprehensible event: Flatlanders can only see a dot becoming a circle, increasing in circumference, then shrinking back in size and disappearing. There are, in this universe, worlds with even more limited views, like the one-dimensional Lineland and the zero-dimensional Pointland. Any attempt to expand the perspective of their inhabitant(s) is doomed to failure. But as in any good adventure story, a chosen Flatland native embarks on a journey of discovery and revelation - and ostracism and imprisonment. The story is interpreted as an allegorical criticism of Victorian-age social structure, but can equally describe the limitations of inhabiting uncritically a methodological world in which all data are ‘normal’ and all relationships are linear. Moving beyond linearity and acquiring the statistical intuition needed to think in higher dimensions and perceive more complex relationships is indeed a matter of practice-induced revelation. It’s unlikely that we will reach statistical nirvana in this short course, but we’ll attempt to build some more substantial structures upon the arid plains of linear regression. We start by looking around in the Flat-, Line- and Point-lands of quantitative analysis. Incorrigible procrastinators may want to check out a full-length computer animated film version of Flatland on YouTube. Others may be better served by this brief TED-Ed animation.\n\n\n0 min\n\n\n\n\nDear Prudence, Help! I may be cheating with my X\n\n\nMuch of what we do in quantitative data analysis is about examining relationships. We are often interested in proposing and testing models of relationships between two or more variables. Sometimes our variables cry out to us begging for help, and we turn into agony aunts and uncles to our data. Other times we must psychoanalyse our data to uncover hidden associations and interactions. This is not an easy task. Do it carelessly, and you may unwittingly cheat yourself and the readers of your research. This week we’ll build some intuition for detecting complex and uneasy relationships within the design matrix X - that promiscuous commune on the right-hand-side of our regression equations. We’ll expand on the linear additive models that we looked at in the previous week by considering interactions among our predictor variables, we’ll explore the possibilities and challenges of asking causal questions of observational data, and we’ll think about ways to avoid what evolutionary anthropologist Richard McElreath calls ‘causal salad’. We may get an uncomfortable feeling that we may have cheated with our Xs in the past, but we’ll look towards the future. By the way, Dear Prudence is Slate magazine’s advice column; I like the name because being prudent really is essential in data analysis and interpretation. If you’re done with the readings for this week, you may indulge in some Prudie advice on matters more serious than statistics.\n\n\n0 min\n\n\n\n\nThe Y question\n\n\nIt wasn’t until the last quarter of the 20th century that a unified vision of statistical modelling emerged, allowing practitioners to see how the general linear model we have explored so far is only a specific case of a more general class of models. We could have had a fancy, memorable name for this class of models - as John Nelder, one of its inventors, acknowledged later in life (Senn 2003, 127) - but back then academics were not required to undertake marketing training on the tweetabilty-factor of the chosen names for their theories; so we ended up with “generalised linear models”. These models can be applied to explananda (“explained”, “response”, “outcome”, “dependent” etc. variables, our ys) whose possible values have certain constraints (such as being limited by a lower bound or constrained to discreet choices) that makes the parameters of the Gaussian (‘normal’) distribution inefficient in describing them. Instead, they follow some of the other “exponential distributions” (and not only the exponential: cf. Gelman, Hill, and Vehtari (2020, 264)), of which the Poisson, gamma, beta, binomial and multinomial are probably the most common in human and social sciences research. Their “generalised linear modelling” involves mapping them unto a linear model using a so-called “link function”. We will explore what all of this means in practice and how it can be applied to data that we are interested in most in our respective fields of study.\n\n\n0 min\n\n\n\n\nDo we live in a simulation?\n\n\nWe have known ever since science-fiction author Philip K. Dick’s memorable “Metz address” of 1977 that our world is a computer simulation. Of course, like some common-currency theories in the social sciences, this knowledge will never be truly verified. We won’t even attempt to get to the bottom of it in class; instead, we’ll practice some basic methods of computer simulation for statistical inference and for generating data that has some idealised characteristics. Such methods play an increasingly important role in computational statistics and are extremely useful for designing robust data collection and analysis plans. If you make a mistake in the code and end up in an infinite loop, but you’re afraid that stopping the process may cause the known universe to implode, you can watch Dick on YouTube while you wait. If something like this can happen to our data, who says it couldn’t happen to us?\n\n\n0 min\n\n\n\n\nChallenging hierarchies\n\n\nBy now we got a sense that every new thing we learn about turns out to be merely a specific case of a larger class of things. So, all the models we covered so far are specific, single-level, versions of multilevel models, in which our cases can be seen as clustered within larger entities. Sometimes they are part of several cross-cutting clusters and/or the clusters are themselves clustered. In general terms, we must acknowledge that there are dependencies in our data that may influence their behaviour. It turns out that data about humans living in societies look somewhat like humans living in societies. The importance of including information about hierarchical dependencies in our models is probably emphasised by no one else more than McElreath (2020, 15), who wants “to convince the reader of something that appears unreasonable: multilevel regression deserves to be the default form of regression. Papers that do not use multilevel models should have to justify not using a multilevel approach.” We will encounter some of the uses and challenges of multilevel modelling.\n\n\n0 min\n\n\n\n\nThe unobserved\n\n\nThe unobserved sounds like the title of a promising horror film; if we have achieved our aims in the module so far, our horror should be ‘merely’ metaphysical by now (Kołakowski anyone? No? Okay, never mind). We have already had to deal with various aspects of latency in our analyses. At the most fundamental level, we speak about population parameters, but we never actually observe them; even a sample statistic can be a purely imaginary case that doesn’t occur in real life. We have discussed the effects of omitted variables, which are thus unobserved by our model, but which we may have access to in our data. And, of course, our most interesting measurements are likely to be proxies of some unobservable theoretical construct (Mulvin (2021) has recently published a wonderfully rich book about proxies in general). This week we pick up an earlier thread from week 4, where we thought about binary and ordered multinomial variables as discretised manifestations of some continuous ‘latent variable’. We expand on this idea by exploring simple and then more complex latent variable models (factor analysis, structural equation modelling), as a further generalisation of the hierarchical perspective introduced earlier. This gives us a few more tools to deal with our radical uncertainty. (n.b. missing data points are another challenge that could fall under this heading, and learning how to deal with them is extremely important; but “The missing” is too good a title not to deserve a high-budget, weak-storyline, full-on special effects sequel somewhere else)\n\n\n0 min\n\n\n\n\nWords, words, mere words…\n\n\nAs researchers in humanities and the social sciences, we use words both as tools of analysis and as sources of data. Words, and more broadly, texts, are also increasingly important for quantitative research in an age of so-called ‘big data’, when the digital world is saturated with unstructured textual information. But the statistical inspection of text is neither new, nor restricted to the humanistic tail of the social sciences. For example, a documented interest in the statistical study of literary style for the purposes of attributing authorship dates back to the mid-1850s (see El-Shagi and Jung 2015); and investors can use textual data such as minutes from the Bank of England’s Monetary Policy Committee’s deliberations to estimate future monetary policy decisions before they are actually taken (cf. Lord 1958). Methods for the collection and quantitative analysis of large-scale textual data are increasingly available, but their technical implementation is complex and requires efficient combination of humanistic subject knowledge and statistical expertise. Faced with words, one is understandably caught between Shakespeare’s Troilus and Wilde’s Dorian Gray. “Words, words, mere words, no matter from the heart; th’ effect doth operate another way. … My love with words and errors still she feeds, but edifies another with her deeds” - believed the betrayed Troilus. “Words! Mere words! How terrible they were! How clear, and vivid, and cruel! One could not escape from them. And yet what a subtle magic there was in them! They seemed to be able to give a plastic form to formless things, and to have a music of their own as sweet as that of viol or of lute. Mere words! Was there anything so real as words?” - pondered Dorian.\n\n\n0 min\n\n\n\n\n\n\nNo matching items\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/index.html",
    "href": "materials/slides-frame/index.html",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "Title\n\n\nDescription\n\n\n\n\n\n\nWeek 1  Gamblers, God, Guinness and peas\n\n\n\n\n\n\nWeek 2  Revisiting Flatland\n\n\n\n\n\n\nWeek 3  Dear Prudence, Help! I may be cheating with my X\n\n\n\n\n\n\nWeek 4  The Y question\n\n\n\n\n\n\nWeek 5  Do we live in a simulation?\n\n\n\n\n\n\nWeek 6  Challenging hierarchies\n\n\n\n\n\n\nWeek 7  The unobserved\n\n\n\n\n\n\nWeek 8  Words, words, mere words…\n\n\n\n\n\n\n\n\nNo matching items\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/slides-frame_w01.html",
    "href": "materials/slides-frame/slides-frame_w01.html",
    "title": "Week 1  Gamblers, God, Guinness and peas",
    "section": "",
    "text": "View the slides full-screen in a standalone browser window here. The lecture recording is available on ReCap (requires Newcastle University login)\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/slides-frame_w02.html",
    "href": "materials/slides-frame/slides-frame_w02.html",
    "title": "Week 2  Revisiting Flatland",
    "section": "",
    "text": "View the slides full-screen in a standalone browser window here. The lecture recording is available on ReCap (requires Newcastle University login)\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/slides-frame_w03.html",
    "href": "materials/slides-frame/slides-frame_w03.html",
    "title": "Week 3  Dear Prudence, Help! I may be cheating with my X",
    "section": "",
    "text": "View the slides full-screen in a standalone browser window here. The lecture recording is available on ReCap (requires Newcastle University login)\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/slides-frame_w04.html",
    "href": "materials/slides-frame/slides-frame_w04.html",
    "title": "Week 4  The Y question",
    "section": "",
    "text": "View the slides full-screen in a standalone browser window here. The lecture recording is available on ReCap (requires Newcastle University login)\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/slides-frame_w05.html",
    "href": "materials/slides-frame/slides-frame_w05.html",
    "title": "Week 5  Do we live in a simulation?",
    "section": "",
    "text": "View the slides full-screen in a standalone browser window here. The lecture recording is available on ReCap (requires Newcastle University login)\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/slides-frame_w06.html",
    "href": "materials/slides-frame/slides-frame_w06.html",
    "title": "Week 6  Challenging hierarchies",
    "section": "",
    "text": "View the slides full-screen in a standalone browser window here. The lecture recording is available on ReCap (requires Newcastle University login)\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/slides-frame_w07.html",
    "href": "materials/slides-frame/slides-frame_w07.html",
    "title": "Week 7  The unobserved",
    "section": "",
    "text": "View the slides full-screen in a standalone browser window here. The lecture recording is available on ReCap (requires Newcastle University login)\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides-frame/slides-frame_w08.html",
    "href": "materials/slides-frame/slides-frame_w08.html",
    "title": "Week 8  Words, words, mere words…",
    "section": "",
    "text": "View the slides full-screen in a standalone browser window here. The lecture recording is available on ReCap (requires Newcastle University login)\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/slides/dice-calculations.html#the-number-of-possible-outcomes-from-throwing-three-dice-without-counting-the-order",
    "href": "materials/slides/dice-calculations.html#the-number-of-possible-outcomes-from-throwing-three-dice-without-counting-the-order",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "The number of possible outcomes from throwing three dice, without counting the order",
    "text": "The number of possible outcomes from throwing three dice, without counting the order\nThe number of possible outcomes when throwing three six-sided dice, without considering the order of the numbers, is nCr(6+3-1, 3), where nCr represents the number of combinations.\nIn this case, nCr(6+3-1, 3) = nCr(8, 3) = 56. So there are 56 possible outcomes.\nThe formula nCr(n, r) represents the number of combinations of r elements from a set of n elements.\nIn the formula nCr(6+3-1, 3), n is 6+3-1 = 8, which represents the total number of elements in the set when considering all the possible outcomes from rolling three six-sided dice. The -1 in the formula accounts for the fact that we’re considering repeating dice rolls.\nr is 3, which represents the number of elements we want to choose (in this case, the number of dice rolls we want to consider in a combination).\nSo, nCr(6+3-1, 3) gives us the number of possible combinations of 3 dice rolls from a set of 8 possible outcomes (the 6 possible outcomes from each dice roll plus 2 duplicates). This is also known as the number of ways to choose 3 dice rolls out of 8 possible outcomes.\nThe result nCr(8, 3) = 56 means there are 56 possible combinations of 3 dice rolls from 8 possible outcomes.\nHow the ‘n’ is calculated?\nThe n in nCr(n, r) represents the total number of elements in the set from which we want to choose r elements.\nIn the case of three six-sided dice, each dice has 6 possible outcomes. If we roll three dice, the total number of possible outcomes is 6 * 6 * 6 = 216. However, because we don’t care about the order in which the dice land, we need to remove the duplicate combinations that would result from simply rearranging the order of the same three outcomes.\nTo account for duplicates, we consider the problem as choosing r elements from a set of n elements, where n is the total number of unique combinations that can be made from rolling three dice.\nTo calculate n, we can use the formula n = n + r - 1, where n is the total number of outcomes from rolling the dice (in this case, 6), and r is the number of dice we’re rolling (in this case, 3).\nSo in this case, n = 6 + 3 - 1 = 8. This means that we have 8 unique combinations of dice rolls, and we can choose 3 dice rolls from this set to form a combination. And the number of combinations of 3 dice rolls from this set of 8 unique combinations is nCr(8, 3) = 56.\nHow is ‘n = n + r - 1’ is derived?\nThe formula n = n + r - 1 is used to find the number of unique combinations of r elements from a set of n elements, when the elements can be repeated.\nThe reasoning behind this formula is as follows:\nImagine you have n distinct elements, and you want to form combinations of r elements. In this case, the number of unique combinations is simply nCr(n, r).\nHowever, when elements can be repeated, the number of unique combinations is not simply nCr(n, r), because the same combination of elements can be formed by rearranging the same r elements in different orders. To account for this, we need to consider the problem as choosing r elements from a set of n + r - 1 elements, where n + r - 1 is the total number of unique combinations that can be made from the n elements.\nThe -1 in the formula accounts for the fact that we’re considering repeating elements, so we need to remove one duplicate combination that would result from simply rearranging the same r elements.\nSo in the case of three six-sided dice, each dice has 6 possible outcomes, giving us n = 6 unique elements. If we roll three dice, we have r = 3 elements. And to find the number of unique combinations of r elements from a set of n elements, where elements can be repeated, we use the formula n = n + r - 1, giving us n = 6 + 3 - 1 = 8.\nThis means that there are 8 unique combinations of dice rolls that can be formed from rolling three six-sided dice, and we can choose 3 dice rolls from this set to form a combination. The number of combinations of 3 dice rolls from this set of 8 unique combinations is nCr(8, 3) = 56.\nHere’s an empirical demonstration in R using the formula:\n\n# Define the number of unique elements and the number of elements to choose\nn <- 6\nr <- 3\n\n# Calculate the number of unique combinations using the formula n = n + r - 1\nm <- n + r - 1\n\n# Calculate the number of combinations using the combination formula nCr(m, r)\ncombinations <- choose(m, r)\ncombinations\n\n[1] 56\n\n\nThis code returns the value 56, which is the number of unique combinations of r elements from a set of m elements, where m = n + r - 1 and n is the number of unique elements and r is the number of elements to choose. This is the same result as we found earlier using nCr(8, 3).\nAn empirical demonstration in R by enumerating all possible combinations and counting the number of unique combinations (without the choose function) (Not correct!!)\n\n# Define the number of unique elements and the number of elements to choose\nn <- 6\nr <- 3\n\n# Calculate the number of unique combinations using the formula n = n + r - 1\nm <- n + r - 1\n\n# Generate all possible combinations of length r from 1 to m\nall_combinations <- combn(1:m, r)\n\n# Convert the combinations to a matrix for easier manipulation\ncombination_matrix <- as.matrix(all_combinations)\n\n# Find the unique rows in the combination matrix\nunique_combinations <- unique(combination_matrix)\n\n# Count the number of unique combinations\nnum_unique_combinations <- nrow(unique_combinations)\nnum_unique_combinations\n\n[1] 3\n\n\nAn empirical demonstation by first finding all possible outcomes and then selecting out the duplicates: (not correct!!)\n\n# Define the number of unique elements and the number of elements to choose\nn <- 6\nr <- 3\n\n# Generate all possible combinations of length r from 1 to n\nall_combinations <- combn(1:n, r)\n\n# Convert the combinations to a matrix for easier manipulation\ncombination_matrix <- as.matrix(all_combinations)\n\n# Find the unique rows in the combination matrix\nunique_combinations <- unique(combination_matrix)\n\n# Count the number of unique combinations\nnum_unique_combinations <- nrow(unique_combinations)\nnum_unique_combinations\n\n[1] 3"
  },
  {
    "objectID": "materials/slides/w1.html#not-an-outline-slide",
    "href": "materials/slides/w1.html#not-an-outline-slide",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Not an outline slide",
    "text": "Not an outline slide\n\n\nGaming chance\nsecond topic\nThird topic\nForth topic"
  },
  {
    "objectID": "materials/slides/w1.html#section",
    "href": "materials/slides/w1.html#section",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "Gaming chance"
  },
  {
    "objectID": "materials/slides/w1.html#testing",
    "href": "materials/slides/w1.html#testing",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Testing",
    "text": "Testing\n\n\nTesting\n\n\nhow\n\n\nfragments work in\n\n\nreality"
  },
  {
    "objectID": "materials/slides/w1.html#testing-2",
    "href": "materials/slides/w1.html#testing-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Testing 2",
    "text": "Testing 2\n\nTesting\n. . .\nHow\n. . .\nfragments work\n. . .\nreally"
  },
  {
    "objectID": "materials/slides/w1.html#statistics",
    "href": "materials/slides/w1.html#statistics",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Statistics",
    "text": "Statistics\n\nand the state"
  },
  {
    "objectID": "materials/slides/w1.html#statistics-1",
    "href": "materials/slides/w1.html#statistics-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Statistics",
    "text": "Statistics\nand probability\n\n\n\nStatistics as the mathematical science of using probability to describe uncertainty"
  },
  {
    "objectID": "materials/slides/w1.html#gaming-chance",
    "href": "materials/slides/w1.html#gaming-chance",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Gaming chance",
    "text": "Gaming chance\n\n\n\nWe may never know when humans started playing games of chance, but archaeological findings suggest it was a rather long time ago\nDuring the the First Dynasty in Egypt (c. 3500 B.C.) variants of a game involving astragali (small bones in the ankle of an animal) were already documented\nOne of the chief games may have been the simple one of throwing four astragali together and noting which sides fell uppermost"
  },
  {
    "objectID": "materials/slides/w1.html#ālea-iacta-est",
    "href": "materials/slides/w1.html#ālea-iacta-est",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Ālea iacta est",
    "text": "Ālea iacta est\n\n\n\n\nThe six-sided die we know today may have been obtained from the astragalus by grinding it down until it formed a rough cube\nDice became common in the Ptolemaic dynasty (300 to 30 B.C.)\nThere is evidence that dice were used for divination rites in this period - one carried the sacred symbols of Osiris, Horus, Isis, Nebhat, Hathor and Horhudet engraved on its six sides\nIn Roman times, rule by divination attained great proportions; Emperors Septimius Severus (Emperor A.D. 193-211) and Diocletian (Emperor AD. 284-305) were notorious for their reliance on the whims of the gods"
  },
  {
    "objectID": "materials/slides/w1.html#fat-chance",
    "href": "materials/slides/w1.html#fat-chance",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Fat chance",
    "text": "Fat chance\n\n\n\nHe threw four knucklebones on to the table and committed his hopes to the throw. If he threw well, particularly if he obtained the image of the goddess herself, no two showing the same number, he adored the goddess, and was in high hopes of gratifying his passion; if he threw badly, as usually happens, and got an unlucky combination, he called down imprecations on all Cnidos, and was as much overcome by grief as if he had suffered some personal loss.\n— Lucian of Samosata (c. 125 – 180), writing in his trademark satirical style about a young man who fell in love with Praxiteles’s Aphrodite of Knidos; cited in F. N. David (1955:8)"
  },
  {
    "objectID": "materials/slides/w1.html#chance-with-limitations",
    "href": "materials/slides/w1.html#chance-with-limitations",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Chance with limitations",
    "text": "Chance with limitations\n\n\n\nDice were sometimes faked. Sometimes numbers were left off or duplicated; hollow dice have been found dating from Roman time\nDice were also imperfect; a “fair” die was the exception rather than the rule\nExperiment by F. N. David using three dice from the British Museum:"
  },
  {
    "objectID": "materials/slides/w1.html#exercise",
    "href": "materials/slides/w1.html#exercise",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Exercise",
    "text": "Exercise\n\n\nWhich of the three dice (if any) would you call “fair”?\nWhat distribution of outcomes would you expect 204 fair dice rolls to produce prior to seeing any results?\nHow would you expect that distribution to change as the number of rolls progresses towards \\(\\infty\\)?\nWhat name would you give to that distribution?\nverv\nrever"
  },
  {
    "objectID": "materials/slides/w1.html#from-chance-to-probability",
    "href": "materials/slides/w1.html#from-chance-to-probability",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "From chance to probability\n",
    "text": "From chance to probability\n\n\n\n\nUntil 18th century people had mostly used probability to solve problems about dice throwing and other games of chance\nJacob (Jacques/James) Bernoulli (1654/1655-1705), a Swiss mathematician trained as a theologian and ordained as a minister of the Reformed church in Basel, began asking questions about probabilistic inference instead\nHis work focused on the mathematics of uncertainty - what he came to call “stochastics” (from the Greek word \\(στόχος\\) [stókhos] meaning to “aim” or “guess’)\n\nArs Conjectandi (The Art of Conjecturing) - published posthumously in 1713"
  },
  {
    "objectID": "materials/slides/w1.html#inferential-questions",
    "href": "materials/slides/w1.html#inferential-questions",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Inferential questions",
    "text": "Inferential questions\n\n\n\nSuppose you are presented with a large urn full of tiny white and black pebbles, in a ratio that’s unknown to you. You begin selecting pebbles from the urn and recording their colors, black or white. How do you use these results to make a guess about the ratio of pebble colors in the urn as a whole?\n\n\nBernoulli’s solution: if you take a large enough sample, you can be very sure, to within a small margin of absolute certainty, that the proportion of white pebbles you observe in the sample is close to the proportion of white pebbles in the urn.\nA first version of the Law of Large Numbers"
  },
  {
    "objectID": "materials/slides/w1.html#large-numbers",
    "href": "materials/slides/w1.html#large-numbers",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Large numbers",
    "text": "Large numbers\n\n\nBernoulli’s solution, more technically:  For any given \\(\\epsilon\\) > 0 and any \\(s\\) > 0, there is a sample size \\(n\\) such that, with \\(w\\) being the number of white pebbles counted in the sample and \\(f\\) being the true fraction of white pebbles in the urn, the probability of \\(w/n\\) falling between \\(f − \\epsilon\\) and \\(f + \\epsilon\\) is greater than \\(1 − s\\).\nthe fraction \\(w/n\\) is the ratio of white to total pebbles we observe in our sample\n\\(\\epsilon\\) (epsilon) captures the fact that we may not see the true urn ratio exactly thanks to random variation in the sample; larger samples help assure that we get closer to the “true” value, but uncertainty always remains\n\\(s\\) reflects just how sure we want to be; for example, set \\(s\\) = 0.01 and be 99% percent sure.\n“moral certainty” as distinct from absolute certainty of the kind logical deduction provides"
  },
  {
    "objectID": "materials/slides/w2.html#guessing-game",
    "href": "materials/slides/w2.html#guessing-game",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Guessing game",
    "text": "Guessing game\n\n\n\n\n\n\n152 140 137 157 145 164 149 169 148 165 154 151 145 150 150 163 157 144 122 105 86 161 156 130 109 146 149 147 137 126 114 148 162 146 146 153 143 143 148 161 152 163 171 147 148 145 122 129 98 154 144 147 157 127 110 98 166 152 142 159 156 164 152 161 154 145 145 152 164 144 130 130 154 143 146 167 158 91 166 150 148 138 155 161 162 148 114 159 149 137 158 145 157 179 119 170 146 147 113 163 134 152 160 150 143 167 159 155 149 111 112 163 152 124 112 86 170 146 159 151 161 170 159 74 150 153 97 162 163 149 117 100 163 162 145 163 151 150 142 171 91 157 152 149 130 147 145 122 114 157 154 121 116 167 143 152 97 160 159 150 161 161 149 125 141 155 142 160 150 156 104 95 156 153 167 150 148 159 162 156 159 147 173 166 142 143 133 128 119 152 157 149 157 150 148 102 153 161 149 114 101 138 91 163 149 159 150 158 156 149 144 154 131 157 157 154 108 168 145 148 101 113 149 155 163 157 123 161 145 144 149 110 150 166 144 157 154 164 156 154 135 144 114 163 146 121 155 145 107 147 152 164 166 156 152 140 158 163 151 171 150 164 142 94 149 105 146 161 163 145 145 171 127 159 159 154 160 150 149 127 143 142 147 163 164 160 154 167 151 148 125 111 153 139 152 155 148 144 118 144 93 148 156 150 156 154 131 102 157 169 150 112 160 168 144 145 160 147 164 153 149 160 149 85 84 60 93 111 91 154 100 62 82 97 80 150 152 141 88 158 149 152 155 124 104 161 149 97 93 161 157 167 157 91 60 137 152 152 81 109 71 89 67 85 70 162 152 89 90 72 84 159 142 142 169 123 75 74 91 160 68 136 158 85 93 152 156 154 157 120 114 84 156 137 114 94 168 148 140 157 76 66 161 114 146 161 70 134 68 150 163 149 149 162 154 69 151 164 153 152 132 156 140 159 143 84 152 161 128 161 145 132 118 160 155 161 166 158 155 98 64 161 147 147 147 173 158 147 125 106 166 150 76 162 140 67 63 164 148 160 155 152 62 146 152 157 56 61 152 145 118 78 161 151 122 93 154 147 140 157 91 155 144 83 158 147 124 89 160 137 165 155 111 154 145 142 145 164 161 155 161 170 150 124 85 161 155 106 126 166 148 124 90 102 152 149 154 54 147 57 101 122 82 155 156 133 125 102 161 146 133 88 156 152 163 115 68 143 77 145 163 156 71 159\n\n\n\n\nWhat are the most appropriate summary statistics for this sample?\nWhat is the Mean (\\(\\bar{x}\\)) of this dataset? What is its standard deviation (\\(s\\))?"
  },
  {
    "objectID": "materials/slides/w2.html#revised-guesses",
    "href": "materials/slides/w2.html#revised-guesses",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Revised guesses",
    "text": "Revised guesses\n\n\nThe data consist of 544 measurements of human height\nThe mean of the data is 138.2635963\nThe standard deviation is 27.6024476"
  },
  {
    "objectID": "materials/slides/w2.html#kung-demography",
    "href": "materials/slides/w2.html#kung-demography",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "!Kung demography",
    "text": "!Kung demography\n\n\nthis is a bulletpoint\nanother bullet\nfinal one bites the bullet"
  },
  {
    "objectID": "materials/slides/w2.html#section",
    "href": "materials/slides/w2.html#section",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "library(patchwork)\n\ny <- tibble(y)\n\nggplot(y, aes(x = y)) + scale_x_continuous(n.breaks = 20, limits = c(50, 180)) +\n  geom_boxplot() + theme_void() +\nggplot(y, aes(x = y)) + \n  geom_histogram() + scale_x_continuous(n.breaks = 20, limits = c(50, 180)) + scale_y_continuous(n.breaks = 10) +\n  geom_vline(aes(xintercept = median(y))) +\n#box + hist + \n  plot_layout(nrow = 2, heights = c(0.2, 4))"
  },
  {
    "objectID": "materials/slides/w2.html#section-1",
    "href": "materials/slides/w2.html#section-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "ggplot(y, aes(x = y)) + \n  geom_histogram() + scale_x_continuous(n.breaks = 20) + scale_y_continuous(n.breaks = 10) +\n  geom_vline(aes(xintercept = median(y)))\n\n\ny <- y |> mutate(height = y, age = howell$age, male = as_factor(howell$male))\n\nggplot(y, aes(x = height, fill = factor(round(age)))) + guides(fill = FALSE) +\n  geom_histogram() + scale_x_continuous(n.breaks = 20) + scale_y_continuous(n.breaks = 10) + \n  scale_fill_grey(start = 0.9, end = 0.1)"
  },
  {
    "objectID": "materials/slides/w2.html#section-2",
    "href": "materials/slides/w2.html#section-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "y <- y |> mutate(x = howell$age, z = as_factor(howell$male))\n\n#theme_set(theme_bw())\n\nplot(y, aes(x = x, y = y, colour = z)) + \n  geom_point(alpha = 0.7) + scale_x_continuous(n.breaks = 10) + scale_y_continuous(n.breaks = 10) + \n  labs(title = \"\", x = \"x\", y = \"y\", colour = \"\") + \n  theme(legend.position = c(0.92,0.16)) +\n  scale_color_manual(name = \"Sex\", labels = c(\"Female\", \"Male\"), values = c(100,200))\n\n#+ scale_color_brewer(palette = \"Dark2\") \n  \n  \n  #xlab(\"\\nx\") + ylab(\"y\\n\") + scale_color_brewer(palette = \"Dark2\") #+ scale_colour_manual(\"\", values = c(1, 2), labels = c(\"Female\", \"Male\"))"
  },
  {
    "objectID": "materials/slides/w2.html#section-3",
    "href": "materials/slides/w2.html#section-3",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "summary(howell)\n\n     height           weight            age             male       \n Min.   : 53.98   Min.   : 4.252   Min.   : 0.00   Min.   :0.0000  \n 1st Qu.:125.09   1st Qu.:22.008   1st Qu.:12.00   1st Qu.:0.0000  \n Median :148.59   Median :40.058   Median :27.00   Median :0.0000  \n Mean   :138.26   Mean   :35.611   Mean   :29.34   Mean   :0.4724  \n 3rd Qu.:157.48   3rd Qu.:47.209   3rd Qu.:43.00   3rd Qu.:1.0000  \n Max.   :179.07   Max.   :62.993   Max.   :88.00   Max.   :1.0000  \n\n\n\n\nsummary(lm(height ~ weight, data = howell))\n\n\nCall:\nlm(formula = height ~ weight, data = howell)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-28.9634  -5.7794   0.7503   6.7207  20.7799 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept)  75.4359     1.0517   71.72   <2e-16 ***\nweight        1.7643     0.0273   64.63   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 9.363 on 542 degrees of freedom\nMultiple R-squared:  0.8851,    Adjusted R-squared:  0.8849 \nF-statistic:  4177 on 1 and 542 DF,  p-value: < 2.2e-16\n\n\n\n\nsummary(lm(height ~ weight + age + male, data = howell))\n\n\nCall:\nlm(formula = height ~ weight + age + male, data = howell)\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-29.011  -5.409   0.730   6.490  19.735 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept) 75.94238    1.06471  71.327  < 2e-16 ***\nweight       1.65634    0.03740  44.289  < 2e-16 ***\nage          0.11247    0.02621   4.291 2.11e-05 ***\nmale         0.07931    0.80942   0.098    0.922    \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 9.222 on 540 degrees of freedom\nMultiple R-squared:  0.889, Adjusted R-squared:  0.8884 \nF-statistic:  1441 on 3 and 540 DF,  p-value: < 2.2e-16"
  },
  {
    "objectID": "materials/slides/w4.html#revisiting-trust-end-education",
    "href": "materials/slides/w4.html#revisiting-trust-end-education",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Revisiting trust end education\n",
    "text": "Revisiting trust end education\n\n\nosterman <- sjlabelled::read_stata(\"https://cgmoreh.github.io/HSS8005-data/osterman.dta\")\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +\n  geom_jitter(alpha = 0.03) +\n  geom_smooth(method = \"lm\") + \n  scale_y_continuous(n.breaks = 15) + \n  scale_x_continuous(n.breaks = 10)"
  },
  {
    "objectID": "materials/slides/w4.html#three-linear-models",
    "href": "materials/slides/w4.html#three-linear-models",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Three linear models",
    "text": "Three linear models\n\nlm_data <- osterman |> select(trustindex3, eduyrs25, agea, female) |> \n  mutate(med_trust = trustindex3 - median(trustindex3)) |> \n  mutate(d_trust = trustindex3 >= median(trustindex3))"
  },
  {
    "objectID": "materials/slides/w4.html#three-linear-models-1",
    "href": "materials/slides/w4.html#three-linear-models-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Three linear models",
    "text": "Three linear models\n\nlm_data <- osterman |> select(trustindex3, eduyrs25, agea, female) |> \n  mutate(med_trust = trustindex3 - median(trustindex3)) |> \n  mutate(d_trust = trustindex3 >= median(trustindex3))\n\nlm1 <- lm(trustindex3 ~ eduyrs25 + agea + female, lm_data)\nlm2 <- lm(med_trust ~ eduyrs25 + agea + female, lm_data)\nlm3 <- lm(d_trust ~ eduyrs25 + agea + female, lm_data)"
  },
  {
    "objectID": "materials/slides/w4.html#three-linear-models-2",
    "href": "materials/slides/w4.html#three-linear-models-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Three linear models",
    "text": "Three linear models\n\nlm_data <- osterman |> select(trustindex3, eduyrs25, agea, female) |> \n  mutate(med_trust = trustindex3 - median(trustindex3)) |> \n  mutate(d_trust = trustindex3 >= median(trustindex3))\n\nlm1 <- lm(trustindex3 ~ eduyrs25 + agea + female, lm_data)\nlm2 <- lm(med_trust ~ eduyrs25 + agea + female, lm_data)\nlm3 <- lm(d_trust ~ eduyrs25 + agea + female, lm_data)\n\nlm_table <- modelsummary::modelsummary(list(\n                                \"lm\" = lm1, \n                                \"lm_cent\" = lm2, \n                                \"lm_prob\" = lm3), \n                                statistic = 'conf.int')\n\nlm_table"
  },
  {
    "objectID": "materials/slides/w4.html#three-linear-models-3",
    "href": "materials/slides/w4.html#three-linear-models-3",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Three linear models",
    "text": "Three linear models\n\n\n\n\n\n   \n    lm \n    lm_cent \n    lm_prob \n  \n\n\n (Intercept) \n    2.955 \n    −2.379 \n    0.009 \n  \n\n  \n    [2.871, 3.039] \n    [−2.463, −2.295] \n    [−0.013, 0.031] \n  \n\n eduyrs25 \n    0.116 \n    0.116 \n    0.027 \n  \n\n  \n    [0.113, 0.120] \n    [0.113, 0.120] \n    [0.026, 0.028] \n  \n\n agea \n    0.016 \n    0.016 \n    0.004 \n  \n\n  \n    [0.014, 0.017] \n    [0.014, 0.017] \n    [0.003, 0.004] \n  \n\n female \n    0.041 \n    0.041 \n    0.010 \n  \n\n  \n    [0.013, 0.069] \n    [0.013, 0.069] \n    [0.003, 0.018] \n  \n\n\n\n\n\ntidy(m_logit) |> select(term, estimate) |>  rename(logit = estimate) |> \n  add_column(\n    tidy(m_probit) |> select(estimate) |>  rename(\"probit <br> longer name\" = estimate)) |> kable(escape = FALSE)"
  },
  {
    "objectID": "materials/slides/w4.html#the-logit-model",
    "href": "materials/slides/w4.html#the-logit-model",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "The logit model",
    "text": "The logit model\n\nm_logit <- glm(d_trust ~ eduyrs25 + agea + female, family = binomial(link = \"logit\"), data = lm_data)\nplot_predictions(m_logit, condition = c(\"eduyrs25\"))"
  },
  {
    "objectID": "materials/slides/w4.html#the-probit-model",
    "href": "materials/slides/w4.html#the-probit-model",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "The probit model",
    "text": "The probit model\n\nm_probit <- glm(d_trust ~ eduyrs25 + agea + female, family = binomial(link = \"probit\"), data = lm_data)\nplot_predictions(m_probit, condition = c(\"eduyrs25\"))"
  },
  {
    "objectID": "materials/slides/w4.html#model-comparisons",
    "href": "materials/slides/w4.html#model-comparisons",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Model comparisons",
    "text": "Model comparisons\n\n\n\n\n\n   \n    lm \n    lm_cent \n    lm_prob \n    Logit \n    Probit \n  \n\n\n (Intercept) \n    2.955 \n    −2.379 \n    0.009 \n    −2.103 \n    −1.289 \n  \n\n  \n    [2.871, 3.039] \n    [−2.463, −2.295] \n    [−0.013, 0.031] \n    [−2.201, −2.006] \n    [−1.348, −1.230] \n  \n\n eduyrs25 \n    0.116 \n    0.116 \n    0.027 \n    0.116 \n    0.071 \n  \n\n  \n    [0.113, 0.120] \n    [0.113, 0.120] \n    [0.026, 0.028] \n    [0.112, 0.120] \n    [0.069, 0.074] \n  \n\n agea \n    0.016 \n    0.016 \n    0.004 \n    0.015 \n    0.009 \n  \n\n  \n    [0.014, 0.017] \n    [0.014, 0.017] \n    [0.003, 0.004] \n    [0.014, 0.017] \n    [0.009, 0.010] \n  \n\n female \n    0.041 \n    0.041 \n    0.010 \n    0.042 \n    0.026 \n  \n\n  \n    [0.013, 0.069] \n    [0.013, 0.069] \n    [0.003, 0.018] \n    [0.011, 0.073] \n    [0.007, 0.046]"
  },
  {
    "objectID": "materials/slides/w4.html#probabilities-odds-log-odds-logit-odds-ratios",
    "href": "materials/slides/w4.html#probabilities-odds-log-odds-logit-odds-ratios",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Probabilities, Odds, Log-odds (logit), Odds Ratios",
    "text": "Probabilities, Odds, Log-odds (logit), Odds Ratios\n\n\n\n\n\n\n\\(prob = \\{0, \\dots, 1\\}\\)\n\n\\(odds = {prob \\over (1-prob)}\\)\n\n\\(log\\_odds = \\ln(odds)\\)\n\n\\(odds = {\\rm e}^{\\ln(odds)}\\)\n\n\\(odds\\_ratio = {odds1 \\over odds2} = {{prob1 \\over (1-prob1)} \\over {prob2 \\over (1-prob2)}}\\)\n\n\n\n\n\n\n\n probs \n    odds \n    log_odds \n  \n\n\n 0.01 \n    0.01 \n    -4.60 \n  \n\n 0.05 \n    0.05 \n    -2.94 \n  \n\n 0.10 \n    0.11 \n    -2.20 \n  \n\n 0.20 \n    0.25 \n    -1.39 \n  \n\n 0.33 \n    0.50 \n    -0.69 \n  \n\n 0.40 \n    0.67 \n    -0.41 \n  \n\n 0.50 \n    1.00 \n    0.00 \n  \n\n 0.60 \n    1.50 \n    0.41 \n  \n\n 0.67 \n    2.00 \n    0.69 \n  \n\n 0.80 \n    4.00 \n    1.39 \n  \n\n 0.90 \n    9.00 \n    2.20 \n  \n\n 0.95 \n    19.00 \n    2.94 \n  \n\n 0.99 \n    99.00 \n    4.60"
  },
  {
    "objectID": "materials/slides/w4.html#probabilities-odds-log-odds-logit-odds-ratios-1",
    "href": "materials/slides/w4.html#probabilities-odds-log-odds-logit-odds-ratios-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Probabilities, Odds, Log-odds (logit), Odds Ratios",
    "text": "Probabilities, Odds, Log-odds (logit), Odds Ratios\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n probs \n    odds \n    log_odds \n  \n\n\n 0.01 \n    0.01 \n    -4.60 \n  \n\n 0.05 \n    0.05 \n    -2.94 \n  \n\n 0.10 \n    0.11 \n    -2.20 \n  \n\n 0.20 \n    0.25 \n    -1.39 \n  \n\n 0.33 \n    0.50 \n    -0.69 \n  \n\n 0.40 \n    0.67 \n    -0.41 \n  \n\n 0.50 \n    1.00 \n    0.00 \n  \n\n 0.60 \n    1.50 \n    0.41 \n  \n\n 0.67 \n    2.00 \n    0.69 \n  \n\n 0.80 \n    4.00 \n    1.39 \n  \n\n 0.90 \n    9.00 \n    2.20 \n  \n\n 0.95 \n    19.00 \n    2.94 \n  \n\n 0.99 \n    99.00 \n    4.60"
  },
  {
    "objectID": "materials/slides/w4.html#poisson-model",
    "href": "materials/slides/w4.html#poisson-model",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Poisson model",
    "text": "Poisson model\n\n\n# A tibble: 78 × 3\n   department course  number_of_A\n   <chr>      <chr>         <int>\n 1 1          DEP_1_a           4\n 2 1          DEP_1_b           2\n 3 1          DEP_1_c           5\n 4 1          DEP_1_d           4\n 5 1          DEP_1_e           1\n 6 1          DEP_1_f           4\n 7 1          DEP_1_g           3\n 8 1          DEP_1_h           3\n 9 1          DEP_1_i           3\n10 1          DEP_1_j           3\n# … with 68 more rows"
  },
  {
    "objectID": "materials/slides/w4.html#poisson-model-1",
    "href": "materials/slides/w4.html#poisson-model-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Poisson model",
    "text": "Poisson model"
  },
  {
    "objectID": "materials/slides/w4.html#poisson-model-2",
    "href": "materials/slides/w4.html#poisson-model-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Poisson model",
    "text": "Poisson model\n\ngrades_base <-\n  glm(\n    number_of_A ~ department,\n    data = count_of_A,\n    family = \"poisson\"\n  )\nsummary(grades_base)\n\n\nCall:\nglm(formula = number_of_A ~ department, family = \"poisson\", data = count_of_A)\n\nDeviance Residuals: \n     Min        1Q    Median        3Q       Max  \n-2.61555  -0.69944  -0.09568   0.60343   2.28141  \n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept)   1.3269     0.1010  13.135  < 2e-16 ***\ndepartment2   0.8831     0.1201   7.353 1.94e-13 ***\ndepartment3   1.7029     0.1098  15.505  < 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 426.201  on 77  degrees of freedom\nResidual deviance:  75.574  on 75  degrees of freedom\nAIC: 392.55\n\nNumber of Fisher Scoring iterations: 4"
  },
  {
    "objectID": "materials/slides/w5.html#the-uses-of-simulation-methods",
    "href": "materials/slides/w5.html#the-uses-of-simulation-methods",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "The uses of simulation methods",
    "text": "The uses of simulation methods\n\nSimulating data uses generating random data sets with known properties using code (or some other method). This can be useful in various contexts.\n\n\nTo better understand our models. Probability models mimic variation in the world, and the tools of simulation can help us better understand this variation. Patterns of randomness are contrary to normal human thinking and simulation helps in training our intuitions about averages and variation\n\nTo run statistical analyses (e.g., simulating a null distribution against which to compare a sample)\n\nTo approximate the sampling distribution of data and propagate this to the sampling distribution of statistical estimates and procedures"
  },
  {
    "objectID": "materials/slides/w5.html#distribution-functions",
    "href": "materials/slides/w5.html#distribution-functions",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Distribution functions",
    "text": "Distribution functions\n\nBase R functions:\n\nrnorm(): sampling from a normal distribution\nrunif(): sampling from a uniform distribution\nrbinom(): sampling from a binomial distribution\nrpois(): sampling from a Poisson distribution\n\n(Other distributions are also available)\n\nsample(): sampling elements from an R object with or without replacement\nreplicate(): often plays a role in conjunction with sampling functions; it is used to evaluate an expression N number of times repeatedly\n\nFrom non-base packages:\n\n\nMASS::mvtnorm(): multivariate normal; sampling multiple variables with a known correlation structure (i.e., we can tell R how variables should be correlated with one another) and normally distributed errors"
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-uniform-distribution",
    "href": "materials/slides/w5.html#sampling-from-a-uniform-distribution",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a uniform distribution",
    "text": "Sampling from a uniform distribution\nThe runif function returns some number (n) of random numbers from a uniform distribution with a range from \\(a\\) (min) to \\(b\\) (max) such that \\(X\\sim\\mathcal U(a,b)\\) (verbally, \\(X\\) is sampled from a uniform distribution with the parameters \\(a\\) and \\(b\\)), where \\(-\\infty < a < b < \\infty\\) (verbally, \\(a\\) is greater than negative infinity but less than \\(b\\), and \\(b\\) is finite). The default is to draw from a standard uniform distribution (i.e., \\(a = 0\\) and \\(b = 1\\)):\n\n\n# Sample a vector of ten numbers and store the results in the object `rand_unifs`\n# Note that the numbers will be different each time we re-run the `runif` function above.\n# If we want to recreate the same sample, we should set a `seed` number first\n\nrand_unifs <- runif(n = 10000, min = 0, max = 1);\n\n\n\nThe first 40 numbers from the sample are:\n\n\n [1] 0.73389592 0.77027279 0.12883356 0.62677799 0.07682038 0.08668081\n [7] 0.95609747 0.76159718 0.55481559 0.61747149 0.25032236 0.19532391\n[13] 0.16115864 0.97814687 0.99120674 0.09791592 0.93735431 0.53521339\n[19] 0.47323976 0.32125960 0.04244730 0.59705072 0.07353607 0.76877016\n[25] 0.38614356 0.67211119 0.26172603 0.32942547 0.92414770 0.28457958\n[31] 0.25625157 0.26928066 0.66945283 0.08099618 0.27268495 0.60555933\n[37] 0.07795224 0.30725433 0.47694105 0.34310998 0.94421787 0.12908665\n[43] 0.18597160 0.14777209 0.46402415 0.61465427 0.78012954 0.59137894\n[49] 0.58202826 0.22381850"
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-uniform-distribution-1",
    "href": "materials/slides/w5.html#sampling-from-a-uniform-distribution-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a uniform distribution",
    "text": "Sampling from a uniform distribution\nTo visualise the entire sample, we can plot it on a histogram:"
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-normal-distribution",
    "href": "materials/slides/w5.html#sampling-from-a-normal-distribution",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a normal distribution",
    "text": "Sampling from a normal distribution\nThe rnorm function returns some number (n) of randomly generated values given a set mean (\\(\\mu\\); mean) and standard deviation (\\(\\sigma\\); sd), such that \\(X\\sim\\mathcal N(\\mu,\\sigma^2)\\). The default is to draw from a standard normal (a.k.a., “Gaussian”) distribution (i.e., \\(\\mu = 0\\) and \\(\\sigma = 1\\)):\n\n\nrand_norms_10000 <- rnorm(n = 10000, mean = 0, sd = 1)\n\nprint(rand_norms_10000[1:20])\n\n\n\n [1]  0.8247128 -0.2646844 -0.8189774 -1.1496807 -0.9199141  1.9054621\n [7]  2.1109840 -0.2281314  2.5573187 -1.1336439 -1.8498121  0.7892403\n[13]  0.1478274 -1.1718075  0.9450400 -0.6083184  0.9430121 -1.0393722\n[19] -0.6519066  0.4566983"
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-normal-distribution-1",
    "href": "materials/slides/w5.html#sampling-from-a-normal-distribution-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a normal distribution",
    "text": "Sampling from a normal distribution\n\nHistograms allow us to check how samples from the same distribution might vary.\nExercise: Compare the above distribution with a normal distribution that had a standard deviation of 2 instead of 1.\nSample 10,000 new values in rnorm with sd = 2 instead of sd = 1 and create a new histogram with hist.\nTo see what the distribution of sampled data might look like given a low sample size (e.g., 10), repeat the process of sampling from rnorm(n = 10, mean = 0, sd = 1) multiple times and look at the shape of the resulting histogram."
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-poisson-distribution",
    "href": "materials/slides/w5.html#sampling-from-a-poisson-distribution",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a Poisson distribution",
    "text": "Sampling from a Poisson distribution\nA Poisson process describes events happening with some given probability over an area of time or space such that \\(X\\sim Poisson(\\lambda)\\), where the rate parameter \\(\\lambda\\) is both the mean and variance of the Poisson distribution (note that by definition, \\(\\lambda > 0\\), and although \\(\\lambda\\) can be any positive real number, data are always integers, as with count data).\n\nSampling from a Poisson distribution can be done in R with rpois, which takes only two arguments specifying the number of values to be returned (n) and the rate parameter (lambda). There are no default values for rpois.\n\n\nrand_poissons <- rpois(n = 10, lambda = 1.5)\n\nprint(rand_poissons)\n\n\n\n [1] 2 2 0 1 3 2 3 5 3 0"
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-poisson-distribution-1",
    "href": "materials/slides/w5.html#sampling-from-a-poisson-distribution-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a Poisson distribution",
    "text": "Sampling from a Poisson distribution\nA histogram of a large number of values to see the distribution when \\(\\lambda = 4.5\\):\n\nrand_poissons_10000 <- rpois(n = 10000, lambda = 4.5)"
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-binomial-distribution",
    "href": "materials/slides/w5.html#sampling-from-a-binomial-distribution",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a binomial distribution",
    "text": "Sampling from a binomial distribution\n\nA binomial distribution describes the number of ‘successes’ for some number of independent trials (\\(\\Pr(success) = p\\)).\nThe rbinom function returns the number of successes after size trials, in which the probability of success in each trial is prob.\nSampling from a binomial distribution in R with rbinom is a bit more complex than using runif, rnorm, or rpois.\nLike those previous functions, the rbinom function returns some number (n) of random numbers, but the arguments and output can be slightly confusing at first."
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-binomial-distribution-1",
    "href": "materials/slides/w5.html#sampling-from-a-binomial-distribution-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a binomial distribution",
    "text": "Sampling from a binomial distribution\n\nFor example, suppose we want to simulate the flipping of a fair coin 1000 times, and we want to know how many times that coin comes up heads (‘success’). We can do this with the following code:\n\n\ncoin_flips <- rbinom(n = 1, size = 1000, prob = 0.5)\n\ncoin_flips\n\n[1] 493\n\n\n\n\nThe above result shows that the coin came up heads 493 times. But note the (required) argument n. This allows us to set the number of sequences to run.\nIf we instead set n = 2, then this could simulate the flipping of a fair coin 1000 times once to see how many times heads comes up, then repeating the whole process a second time to see how many times heads comes up again (or, if it is more intuitive, the flipping of two separate fair coins 1000 times at the same time)."
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-binomial-distribution-2",
    "href": "materials/slides/w5.html#sampling-from-a-binomial-distribution-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a binomial distribution",
    "text": "Sampling from a binomial distribution\n\ncoin_flips_2 <- rbinom(n = 2, size = 1000, prob = 0.5)\n\ncoin_flips_2\n\n[1] 480 476\n\n\n\nA coin was flipped 1000 times and returned 480 heads, and then another fair coin was flipped 1000 times and returned 476 heads.\n\n\n\nAs with the rnorm and runif functions, we can check to see what the distribution of the binomial function looks like if we repeat this process."
  },
  {
    "objectID": "materials/slides/w5.html#sampling-from-a-binomial-distribution-3",
    "href": "materials/slides/w5.html#sampling-from-a-binomial-distribution-3",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a binomial distribution",
    "text": "Sampling from a binomial distribution\n\nSuppose that we want to see the distribution of the number of times heads comes up after 1000 flips. We can simulate the process of flipping 1000 times in a row with 10000 different coins:\n\n\ncoin_flips_10000 <- rbinom(n = 10000, size = 1000, prob = 0.5)"
  },
  {
    "objectID": "materials/slides/w5.html#random-sampling-using-sample",
    "href": "materials/slides/w5.html#random-sampling-using-sample",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Random sampling using sample\n",
    "text": "Random sampling using sample\n\n\nSometimes it is useful to sample a set of values from a vector or list. The R function sample is very flexible for sampling a subset of numbers or elements from some structure (x) in R according to some set probabilities (prob).\nElements can be sampled from x some number of times (size) with or without replacement (replace), though an error will be returned if the size of the sample is larger than x but replace = FALSE (default).\nSuppose we want to ask R to pick a random number from one to ten with equal probability:\n\n\nrand_number_1 <- sample(x = 1:10, size = 1)\n\nprint(rand_number_1)\n\n[1] 2"
  },
  {
    "objectID": "materials/slides/w5.html#random-sampling-using-sample-1",
    "href": "materials/slides/w5.html#random-sampling-using-sample-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Random sampling using sample\n",
    "text": "Random sampling using sample\n\n\nWe can increase the size of the sample to 10:\n\n\nrand_number_10 <- sample(x = 1:10, size = 10)\nprint(rand_number_10)\n\n [1]  5  3  1  7  6  9  2  8  4 10\n\n\n\nNote that all numbers from 1 to 10 have been sampled, but in a random order. This is because the default is to sample without replacement, meaning that once a number has been sampled for the first element in rand_number_10, it is no longer available to be sampled again.\n\n\n\nWe can change this and allow for sampling with replacement:\n\n\nrand_number_10_r <- sample(x = 1:10, size = 10, replace = TRUE)\n\nprint(rand_number_10_r)\n\n [1]  4 10  1  9  3  4 10  9  6 10\n\n\n\nNote that the numbers {4, 9, 10} are now repeated in the set of randomly sampled values above."
  },
  {
    "objectID": "materials/slides/w5.html#random-sampling-using-sample-2",
    "href": "materials/slides/w5.html#random-sampling-using-sample-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Random sampling using sample\n",
    "text": "Random sampling using sample\n\n\nSo far, because we have not specified a probability vector prob, the function assumes that every element in 1:10 is sampled with equal probability\nHere’s an example in which the numbers 1-5 are sampled with a probability of 0.05, while the numbers 6-10 are sampled with a probability of 0.15, thereby biasing sampling toward larger numbers; we always need to ensure that these probabilities need to sum to 1.\n\n\nprob_vec      <- c( rep(x = 0.05, times = 5), rep(x = 0.15, times = 5))\n\nrand_num_bias <- sample(x = 1:10, size = 10, replace = TRUE, prob = prob_vec)\n\nprint(rand_num_bias)\n\n [1]  1  2  6  3  9  9  8  8  6 10"
  },
  {
    "objectID": "materials/slides/w5.html#sampling-random-characters-from-a-list",
    "href": "materials/slides/w5.html#sampling-random-characters-from-a-list",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling random characters from a list",
    "text": "Sampling random characters from a list\n\nWe can also sample characters from a list of elements; it is no different than sampling numbers\nFor example, if we want to create a simulated data set that includes three different species of some plant or animal, we could create a vector of species identities from which to sample:\n\n\nspecies <- c(\"species_A\", \"species_B\", \"species_C\");\n\n\nWe can then sample from these three possible categories. For example:\n\n\nsp_sample <- sample(x = species, size = 24, replace = TRUE, \n                    prob = c(0.5, 0.25, 0.25))\n\n\nWhat did the code above do?\n\n\n\n\n [1] \"species_B\" \"species_C\" \"species_A\" \"species_A\" \"species_B\" \"species_A\"\n [7] \"species_A\" \"species_C\" \"species_C\" \"species_B\" \"species_C\" \"species_A\"\n[13] \"species_C\" \"species_A\" \"species_A\" \"species_A\" \"species_A\" \"species_C\"\n[19] \"species_B\" \"species_A\" \"species_C\" \"species_B\" \"species_A\" \"species_B\""
  },
  {
    "objectID": "materials/slides/w5.html#simulating-data-with-known-correlations",
    "href": "materials/slides/w5.html#simulating-data-with-known-correlations",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Simulating data with known correlations",
    "text": "Simulating data with known correlations\n\nWe can generate variables \\(X_{1}\\) and \\(X_{2}\\) that have known correlations \\(\\rho\\) with with one another.\nFor example: two standard normal random variables with a sample size of 10000, and with correlation between them of 0.3:\n\n\nN   <- 10000\nrho <- 0.3\nx1  <- rnorm(n = N, mean = 0, sd = 1)\nx2  <- (rho * x1) + sqrt(1 - rho*rho) * rnorm(n = N, mean = 0, sd = 1)\n\n\nThese variables are generated by first simulating the sample \\(x_{1}\\) (x1 above) from a standard normal distribution. Then, \\(x_{2}\\) (x2 above) is calculated as\n\n\\(x_{2} = \\rho x_{1} + \\sqrt{1 - \\rho^{2}}x_{rand}\\),\nwhere \\(x_{rand}\\) is a sample from a normal distribution with the same variance as \\(x_{1}\\)."
  },
  {
    "objectID": "materials/slides/w5.html#simulating-data-with-known-correlations-1",
    "href": "materials/slides/w5.html#simulating-data-with-known-correlations-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Simulating data with known correlations",
    "text": "Simulating data with known correlations\n\nWe can generate variables \\(X_{1}\\) and \\(X_{2}\\) that have known correlations \\(\\rho\\) with with one another.\nFor example: two standard normal random variables with a sample size of 10000, and with correlation between them of 0.3:\n\n\nN   <- 10000\nrho <- 0.3\nx1  <- rnorm(n = N, mean = 0, sd = 1)\nx2  <- (rho * x1) + sqrt(1 - rho*rho) * rnorm(n = N, mean = 0, sd = 1)\n\n\nDoes the correlation equal rho (with some sampling error)?\n\n\ncor(x1, x2)\n\n[1] 0.2952028"
  },
  {
    "objectID": "materials/slides/w5.html#simulating-data-with-known-correlations-2",
    "href": "materials/slides/w5.html#simulating-data-with-known-correlations-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Simulating data with known correlations",
    "text": "Simulating data with known correlations\n\nThere is a more efficient way to generate any number of variables with different variances and correlations to one another.\nWe need to use the MASS library, which can be installed and loaded as below:\n\n\ninstall.packages(\"MASS\")\nlibrary(\"MASS\")\n\n\n\n\n\nIn the MASS library, the function mvrnorm can be used to generate any number of variables for a pre-specified covariance structure."
  },
  {
    "objectID": "materials/slides/w5.html#statistical-power",
    "href": "materials/slides/w5.html#statistical-power",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Statistical power",
    "text": "Statistical power\n\n\nStatistical power is defined as the probability, before a study is performed, that a particular comparison will achieve “statistical significance” at some predetermined level (typically a p-value below 0.05), given some assumed true effect size\nIf a certain effect of interest exists (e.g. a difference between two groups) power is the chance that we actually find the effect in a given study\nA power analysis is performed by first hypothesizing an effect size, then making some assumptions about the variation in the data and the sample size of the study to be conducted, and finally using probability calculations to determine the chance of the p-value being below the threshold\nThe conventional view is that you should avoid low-power studies because they are unlikely to succeed\nThere are several problems with this view, but it’s often required by research funding bodies"
  },
  {
    "objectID": "materials/slides/w5.html#example-simulating-a-regression-design",
    "href": "materials/slides/w5.html#example-simulating-a-regression-design",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nWe can use simulation to test rather complex study designs\nImagine you are interested in students attitude towards smoking and how it depends on the medium of the message and the focus of the message\nWe want to know whether people’s attitude is different after seeing a visual anti-smoking message (these pictures on the package) vs a text-message (the text belonging to that picture)\nWe are interested in whether the attitude that people report is different after seeing a message that regards the consequences on other people (e.g. smoking can harm your loved ones) as compared to yourself (smoking can cause cancer)"
  },
  {
    "objectID": "materials/slides/w5.html#example-simulating-a-regression-design-1",
    "href": "materials/slides/w5.html#example-simulating-a-regression-design-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\nStudy design:\nDV: attitude towards smoking (0-100) IV1: medium (text vs. visual) IV2: focus (internal vs. external)\nThis is, there are 4 groups:\n\ngroup_TI will receive text-messages that are internal\ngroup_TE will receive text-messages that are external\ngroup_VI will receive visual messages that are internal\ngroup_VE will receive visual messages that are external"
  },
  {
    "objectID": "materials/slides/w5.html#example-simulating-a-regression-design-2",
    "href": "materials/slides/w5.html#example-simulating-a-regression-design-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nassume that we expect that people’s attitude will be more negative after seeing a visual rather than text message if the focus is internal (i.e. the message is about yourself) because it might be difficult to imagine that oneself would get cancer after reading a text but seeing a picture might cause fear regardless\nfor the external focus on the other hand, we expect a more negative attitude after reading a text as compared to seeing a picture, as it might have more impact on attitude to imagine a loved one get hurt than seeing a stranger in a picture suffering from the consequences of second-hand smoking\nwe expect that the internal focus messages will be related to lower attitudes compared to the external focus messages on average but we expect no main-effect of picture vs. text-messages"
  },
  {
    "objectID": "materials/slides/w5.html#example-simulating-a-regression-design-3",
    "href": "materials/slides/w5.html#example-simulating-a-regression-design-3",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nvisualize some rough means that show the desired behavior that we described in words earlier and see where we are going\nwe could make the overall mean of the internal focus groups (group_TI and group_VI) 20 and the mean of the external groups (group_TE and group_VE) 50 (this would already reflect the main-effect but also a belief that the smoking-attitudes are on average quite negative as we assume both means to be on the low end of the scale)\nassume that the mean of group_TI is 30 while the mean of group_VI is 10 and we could assume that the mean of group_TE is 40 and the mean of group_VE is 60"
  },
  {
    "objectID": "materials/slides/w5.html#section-2",
    "href": "materials/slides/w5.html#section-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "focus <- rep(c(\"internal\", \"external\"), each = 2)\nmedia <- rep(c(\"text\", \"visual\"), times = 2)\nmean_TI <- 50\nmean_VI <- 20\nmean_TE <- 30\nmean_VE <- 60\n\npd <- data.frame(score = c(mean_TI, mean_VI, mean_TE, mean_VE), focus = focus, media = media)\n\ninteraction.plot(pd$focus, pd$media, pd$score, ylim = c(0,100))"
  },
  {
    "objectID": "materials/slides/w5.html#section-3",
    "href": "materials/slides/w5.html#section-3",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "focus <- rep(c(\"internal\", \"external\"), each = 2)\nmedia <- rep(c(\"text\", \"visual\"), times = 2)\nmean_TI <- 43\nmean_VI <- 40\nmean_TE <- 45\nmean_VE <- 47\n\npd <- data.frame(score = c(mean_TI, mean_VI, mean_TE, mean_VE), focus = focus, media = media)\n\ninteraction.plot(pd$focus, pd$media, pd$score, ylim = c(0,100))"
  },
  {
    "objectID": "materials/slides/w5.html#example-simulating-a-regression-design-4",
    "href": "materials/slides/w5.html#example-simulating-a-regression-design-4",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nin the new example there is a difference between the two media groups on average but it is only .50 points, so arguably it is small enough to represent the assumption of “no” effect, as in real-life “no” effect in terms of a difference being actually 0 is rather rare\ncome up with some reasonable standard-deviation; if we start at 50 and we want most people to be < 80, we can set the 2-SD bound at 80 to get a standard-deviation of 15 (80-50)/2.\nlet’s assume that each of our groups has a standard-deviation of 15 points.\n\ngroup_TI = normal(n, 43, 15)\ngroup_VI = normal(n, 40, 15)\ngroup_TE = normal(n, 45, 15)\ngroup_VE = normal(n, 47, 15)"
  },
  {
    "objectID": "materials/slides/w5.html#example-simulating-a-regression-design-5",
    "href": "materials/slides/w5.html#example-simulating-a-regression-design-5",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nn <- 1e5\ngroup_TI <-  rnorm(n, 43, 15)\ngroup_VI <-  rnorm(n, 40, 15)\ngroup_TE <-  rnorm(n, 45, 15)\ngroup_VE <-  rnorm(n, 47, 15)\n\nparticipant <- c(1:(n*4))\nfocus <- rep(c(\"internal\", \"external\"), each = n*2)\nmedia <- rep(c(\"text\", \"visual\"), each = n, times = 2)\n\ndata <- data.frame(participant = participant, focus = focus, media = media, score = c(group_TI, group_VI, group_TE, group_VE))\n\nsummary(data)\n\n  participant       focus              media               score       \n Min.   :1e+00   Length:400000      Length:400000      Min.   :-36.68  \n 1st Qu.:1e+05   Class :character   Class :character   1st Qu.: 33.48  \n Median :2e+05   Mode  :character   Mode  :character   Median : 43.73  \n Mean   :2e+05                                         Mean   : 43.74  \n 3rd Qu.:3e+05                                         3rd Qu.: 54.00  \n Max.   :4e+05                                         Max.   :117.63"
  },
  {
    "objectID": "materials/slides/w5.html#ready-for-power-analysis",
    "href": "materials/slides/w5.html#ready-for-power-analysis",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Ready for power-analysis",
    "text": "Ready for power-analysis\n\nSome additional assumptions: suppose we have enought funding for a sizeable data collection and the aim is to ensure that we do not draw unwarranted conclusions from the research\nWe should then set the alpha-level at a more conservative value (\\(\\alpha = .001\\)); with this, we expect to draw non-realistic conclusions in the interaction effect in only about 1 in every 1,000 experiments\nWe also want to be sure that we do detect an existing effect and keep our power high at 95%; with this, we expect that if there is an interaction effect, we would detect it in 19 out of 20 cases (only miss it in 1 out of 20, or 5%)\nRunning the power-simulation can be very memory-demanding and the code can run a very long time to complete; it’s advised to start from various “low-resolution” sample-sizes (e.g. n = 10, n = 100, n = 200, etc.) to get a rough idea of where we can expect our loop to end. Then, the search can be made more specific in order to identify a more precise sample size."
  },
  {
    "objectID": "materials/slides/w5.html#ready-for-power-analysis-1",
    "href": "materials/slides/w5.html#ready-for-power-analysis-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Ready for power-analysis",
    "text": "Ready for power-analysis\n\nset.seed(1)\nn_sims <- 1000 # we want 1000 simulations\np_vals <- c()\npower_at_n <- c(0) # this vector will contain the power for each sample-size (it needs the initial 0 for the while-loop to work)\nn <- 100 # sample-size and start at 100 as we can be pretty sure this will not suffice for such a small effect\nn_increase <- 100 # by which stepsize should n be increased\ni <- 2\n\npower_crit <- .95\nalpha <- .001\n\nwhile(power_at_n[i-1] < power_crit){\n  for(sim in 1:n_sims){\n    group_TI <-  rnorm(n, 43, 15)\n    group_VI <-  rnorm(n, 40, 15)\n    group_TE <-  rnorm(n, 45, 15)\n    group_VE <-  rnorm(n, 47, 15)\n    \n    participant <- c(1:(n*4))\n    focus <- rep(c(\"internal\", \"external\"), each = n*2)\n    media <- rep(c(\"text\", \"visual\"), each = n, times = 2)\n    \n    data <- data.frame(participant = participant, focus = focus, media = media, score = c(group_TI, group_VI, group_TE, group_VE))\n    data$media_sum_num <- ifelse(data$media == \"text\", 1, -1) # apply sum-to-zero coding\n    data$focus_sum_num <- ifelse(data$focus == \"external\", 1, -1) \n    lm_int <- lm(score ~ 1 + focus_sum_num + media_sum_num + focus_sum_num:media_sum_num, data = data) # fit the model with the interaction\n    lm_null <- lm(score ~ 1 + focus_sum_num + media_sum_num, data = data) # fit the model without the interaction\n    p_vals[sim] <- anova(lm_int, lm_null)$`Pr(>F)`[2] # put the p-values in a list\n  }\n    print(n)\n    power_at_n[i] <- mean(p_vals < alpha) # check power (i.e. proportion of p-values that are smaller than alpha-level of .10)\n    names(power_at_n)[i] <- n\n    n <- n+n_increase # increase sample-size by 100 for low-resolution testing first\n    i <- i+1 # increase index of the while-loop by 1 to save power and cohens d to vector\n}\n\n[1] 100\n[1] 200\n[1] 300\n[1] 400\n[1] 500\n[1] 600\n[1] 700\n[1] 800\n[1] 900\n\npower_at_n <- power_at_n[-1] # delete first 0 from the vector"
  },
  {
    "objectID": "materials/slides/w5.html#example-simulating-a-regression-design-6",
    "href": "materials/slides/w5.html#example-simulating-a-regression-design-6",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\nWe can plot the results form the power-simulation:\n\n\nAt roughly 900 participants we observe sufficient power"
  },
  {
    "objectID": "materials/slides/w6.html#the-uses-of-simulation-methods",
    "href": "materials/slides/w6.html#the-uses-of-simulation-methods",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "The uses of simulation methods",
    "text": "The uses of simulation methods\n\nSimulating data uses generating random data sets with known properties using code (or some other method). This can be useful in various contexts.\n\n\nTo better understand our models. Probability models mimic variation in the world, and the tools of simulation can help us better understand this variation. Patterns of randomness are contrary to normal human thinking and simulation helps in training our intuitions about averages and variation\n\nTo run statistical analyses (e.g., simulating a null distribution against which to compare a sample)\n\nTo approximate the sampling distribution of data and propagate this to the sampling distribution of statistical estimates and procedures"
  },
  {
    "objectID": "materials/slides/w6.html#distribution-functions",
    "href": "materials/slides/w6.html#distribution-functions",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Distribution functions",
    "text": "Distribution functions\n\nBase R functions:\n\nrnorm(): sampling from a normal distribution\nrunif(): sampling from a uniform distribution\nrbinom(): sampling from a binomial distribution\nrpois(): sampling from a Poisson distribution\n\n(Other distributions are also available)\n\nsample(): sampling elements from an R object with or without replacement\nreplicate(): often plays a role in conjunction with sampling functions; it is used to evaluate an expression N number of times repeatedly\n\nFrom non-base packages:\n\n\nMASS::mvtnorm(): multivariate normal; sampling multiple variables with a known correlation structure (i.e., we can tell R how variables should be correlated with one another) and normally distributed errors"
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-uniform-distribution",
    "href": "materials/slides/w6.html#sampling-from-a-uniform-distribution",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a uniform distribution",
    "text": "Sampling from a uniform distribution\nThe runif function returns some number (n) of random numbers from a uniform distribution with a range from \\(a\\) (min) to \\(b\\) (max) such that \\(X\\sim\\mathcal U(a,b)\\) (verbally, \\(X\\) is sampled from a uniform distribution with the parameters \\(a\\) and \\(b\\)), where \\(-\\infty < a < b < \\infty\\) (verbally, \\(a\\) is greater than negative infinity but less than \\(b\\), and \\(b\\) is finite). The default is to draw from a standard uniform distribution (i.e., \\(a = 0\\) and \\(b = 1\\)):\n\n\n# Sample a vector of ten numbers and store the results in the object `rand_unifs`\n# Note that the numbers will be different each time we re-run the `runif` function above.\n# If we want to recreate the same sample, we should set a `seed` number first\n\nrand_unifs <- runif(n = 10000, min = 0, max = 1);\n\n\n\nThe first 40 numbers from the sample are:\n\n\n [1] 0.99434864 0.96431541 0.30580586 0.33276507 0.84967627 0.81678374\n [7] 0.51459419 0.10484424 0.91966070 0.26868353 0.83214199 0.87764814\n[13] 0.34502670 0.10482143 0.62726192 0.78069416 0.28723441 0.99650014\n[19] 0.06301852 0.52260594 0.45916681 0.03622946 0.72827098 0.40192253\n[25] 0.77440006 0.15546010 0.35228083 0.07063814 0.56907472 0.29733538\n[31] 0.68845754 0.43638929 0.45369228 0.62800198 0.35717584 0.48973529\n[37] 0.68858431 0.76422131 0.94052166 0.86891697 0.10667781 0.67989207\n[43] 0.41068036 0.21645607 0.21990561 0.29829047 0.48076992 0.92049340\n[49] 0.55169980 0.02008806"
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-uniform-distribution-1",
    "href": "materials/slides/w6.html#sampling-from-a-uniform-distribution-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a uniform distribution",
    "text": "Sampling from a uniform distribution\nTo visualise the entire sample, we can plot it on a histogram:"
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-normal-distribution",
    "href": "materials/slides/w6.html#sampling-from-a-normal-distribution",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a normal distribution",
    "text": "Sampling from a normal distribution\nThe rnorm function returns some number (n) of randomly generated values given a set mean (\\(\\mu\\); mean) and standard deviation (\\(\\sigma\\); sd), such that \\(X\\sim\\mathcal N(\\mu,\\sigma^2)\\). The default is to draw from a standard normal (a.k.a., “Gaussian”) distribution (i.e., \\(\\mu = 0\\) and \\(\\sigma = 1\\)):\n\n\nrand_norms_10000 <- rnorm(n = 10000, mean = 0, sd = 1)\n\nprint(rand_norms_10000[1:20])\n\n\n\n [1] -1.4294749  0.7034161  0.2124047 -0.7159934  1.9414967  2.2186264\n [7]  0.9284274  0.5083624  0.4948380 -0.9719948 -0.4409321  1.5844225\n[13] -0.3116727  1.3008722 -2.1558888 -0.5306407 -0.5345091  1.5416112\n[19]  1.5387759 -1.0633193"
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-normal-distribution-1",
    "href": "materials/slides/w6.html#sampling-from-a-normal-distribution-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a normal distribution",
    "text": "Sampling from a normal distribution\n\nHistograms allow us to check how samples from the same distribution might vary.\nExercise: Compare the above distribution with a normal distribution that had a standard deviation of 2 instead of 1.\nSample 10,000 new values in rnorm with sd = 2 instead of sd = 1 and create a new histogram with hist.\nTo see what the distribution of sampled data might look like given a low sample size (e.g., 10), repeat the process of sampling from rnorm(n = 10, mean = 0, sd = 1) multiple times and look at the shape of the resulting histogram."
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-poisson-distribution",
    "href": "materials/slides/w6.html#sampling-from-a-poisson-distribution",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a Poisson distribution",
    "text": "Sampling from a Poisson distribution\nA Poisson process describes events happening with some given probability over an area of time or space such that \\(X\\sim Poisson(\\lambda)\\), where the rate parameter \\(\\lambda\\) is both the mean and variance of the Poisson distribution (note that by definition, \\(\\lambda > 0\\), and although \\(\\lambda\\) can be any positive real number, data are always integers, as with count data).\n\nSampling from a Poisson distribution can be done in R with rpois, which takes only two arguments specifying the number of values to be returned (n) and the rate parameter (lambda). There are no default values for rpois.\n\n\nrand_poissons <- rpois(n = 10, lambda = 1.5)\n\nprint(rand_poissons)\n\n\n\n [1] 1 0 2 1 0 2 1 0 0 1"
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-poisson-distribution-1",
    "href": "materials/slides/w6.html#sampling-from-a-poisson-distribution-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a Poisson distribution",
    "text": "Sampling from a Poisson distribution\nA histogram of a large number of values to see the distribution when \\(\\lambda = 4.5\\):\n\nrand_poissons_10000 <- rpois(n = 10000, lambda = 4.5)"
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-binomial-distribution",
    "href": "materials/slides/w6.html#sampling-from-a-binomial-distribution",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a binomial distribution",
    "text": "Sampling from a binomial distribution\n\nA binomial distribution describes the number of ‘successes’ for some number of independent trials (\\(\\Pr(success) = p\\)).\nThe rbinom function returns the number of successes after size trials, in which the probability of success in each trial is prob.\nSampling from a binomial distribution in R with rbinom is a bit more complex than using runif, rnorm, or rpois.\nLike those previous functions, the rbinom function returns some number (n) of random numbers, but the arguments and output can be slightly confusing at first."
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-binomial-distribution-1",
    "href": "materials/slides/w6.html#sampling-from-a-binomial-distribution-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a binomial distribution",
    "text": "Sampling from a binomial distribution\n\nFor example, suppose we want to simulate the flipping of a fair coin 1000 times, and we want to know how many times that coin comes up heads (‘success’). We can do this with the following code:\n\n\ncoin_flips <- rbinom(n = 1, size = 1000, prob = 0.5)\n\ncoin_flips\n\n[1] 543\n\n\n\n\nThe above result shows that the coin came up heads 543 times. But note the (required) argument n. This allows us to set the number of sequences to run.\nIf we instead set n = 2, then this could simulate the flipping of a fair coin 1000 times once to see how many times heads comes up, then repeating the whole process a second time to see how many times heads comes up again (or, if it is more intuitive, the flipping of two separate fair coins 1000 times at the same time)."
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-binomial-distribution-2",
    "href": "materials/slides/w6.html#sampling-from-a-binomial-distribution-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a binomial distribution",
    "text": "Sampling from a binomial distribution\n\ncoin_flips_2 <- rbinom(n = 2, size = 1000, prob = 0.5)\n\ncoin_flips_2\n\n[1] 466 501\n\n\n\nA coin was flipped 1000 times and returned 466 heads, and then another fair coin was flipped 1000 times and returned 501 heads.\n\n\n\nAs with the rnorm and runif functions, we can check to see what the distribution of the binomial function looks like if we repeat this process."
  },
  {
    "objectID": "materials/slides/w6.html#sampling-from-a-binomial-distribution-3",
    "href": "materials/slides/w6.html#sampling-from-a-binomial-distribution-3",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling from a binomial distribution",
    "text": "Sampling from a binomial distribution\n\nSuppose that we want to see the distribution of the number of times heads comes up after 1000 flips. We can simulate the process of flipping 1000 times in a row with 10000 different coins:\n\n\ncoin_flips_10000 <- rbinom(n = 10000, size = 1000, prob = 0.5)"
  },
  {
    "objectID": "materials/slides/w6.html#random-sampling-using-sample",
    "href": "materials/slides/w6.html#random-sampling-using-sample",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Random sampling using sample\n",
    "text": "Random sampling using sample\n\n\nSometimes it is useful to sample a set of values from a vector or list. The R function sample is very flexible for sampling a subset of numbers or elements from some structure (x) in R according to some set probabilities (prob).\nElements can be sampled from x some number of times (size) with or without replacement (replace), though an error will be returned if the size of the sample is larger than x but replace = FALSE (default).\nSuppose we want to ask R to pick a random number from one to ten with equal probability:\n\n\nrand_number_1 <- sample(x = 1:10, size = 1)\n\nprint(rand_number_1)\n\n[1] 6"
  },
  {
    "objectID": "materials/slides/w6.html#random-sampling-using-sample-1",
    "href": "materials/slides/w6.html#random-sampling-using-sample-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Random sampling using sample\n",
    "text": "Random sampling using sample\n\n\nWe can increase the size of the sample to 10:\n\n\nrand_number_10 <- sample(x = 1:10, size = 10)\nprint(rand_number_10)\n\n [1]  5  4  2  7  8  9  6  1 10  3\n\n\n\nNote that all numbers from 1 to 10 have been sampled, but in a random order. This is because the default is to sample without replacement, meaning that once a number has been sampled for the first element in rand_number_10, it is no longer available to be sampled again.\n\n\n\nWe can change this and allow for sampling with replacement:\n\n\nrand_number_10_r <- sample(x = 1:10, size = 10, replace = TRUE)\n\nprint(rand_number_10_r)\n\n [1] 10  8  4  3  7  8  8  1  1  7\n\n\n\nNote that the numbers {1, 7, 8} are now repeated in the set of randomly sampled values above."
  },
  {
    "objectID": "materials/slides/w6.html#random-sampling-using-sample-2",
    "href": "materials/slides/w6.html#random-sampling-using-sample-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Random sampling using sample\n",
    "text": "Random sampling using sample\n\n\nSo far, because we have not specified a probability vector prob, the function assumes that every element in 1:10 is sampled with equal probability\nHere’s an example in which the numbers 1-5 are sampled with a probability of 0.05, while the numbers 6-10 are sampled with a probability of 0.15, thereby biasing sampling toward larger numbers; we always need to ensure that these probabilities need to sum to 1.\n\n\nprob_vec      <- c( rep(x = 0.05, times = 5), rep(x = 0.15, times = 5))\n\nrand_num_bias <- sample(x = 1:10, size = 10, replace = TRUE, prob = prob_vec)\n\nprint(rand_num_bias)\n\n [1]  7  5  4  6  4 10  8  7  8  9"
  },
  {
    "objectID": "materials/slides/w6.html#sampling-random-characters-from-a-list",
    "href": "materials/slides/w6.html#sampling-random-characters-from-a-list",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Sampling random characters from a list",
    "text": "Sampling random characters from a list\n\nWe can also sample characters from a list of elements; it is no different than sampling numbers\nFor example, if we want to create a simulated data set that includes three different species of some plant or animal, we could create a vector of species identities from which to sample:\n\n\nspecies <- c(\"species_A\", \"species_B\", \"species_C\");\n\n\nWe can then sample from these three possible categories. For example:\n\n\nsp_sample <- sample(x = species, size = 24, replace = TRUE, \n                    prob = c(0.5, 0.25, 0.25))\n\n\nWhat did the code above do?\n\n\n\n\n [1] \"species_B\" \"species_C\" \"species_B\" \"species_B\" \"species_C\" \"species_B\"\n [7] \"species_A\" \"species_B\" \"species_C\" \"species_A\" \"species_B\" \"species_A\"\n[13] \"species_B\" \"species_A\" \"species_A\" \"species_B\" \"species_A\" \"species_A\"\n[19] \"species_A\" \"species_B\" \"species_A\" \"species_C\" \"species_C\" \"species_A\""
  },
  {
    "objectID": "materials/slides/w6.html#simulating-data-with-known-correlations",
    "href": "materials/slides/w6.html#simulating-data-with-known-correlations",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Simulating data with known correlations",
    "text": "Simulating data with known correlations\n\nWe can generate variables \\(X_{1}\\) and \\(X_{2}\\) that have known correlations \\(\\rho\\) with with one another.\nFor example: two standard normal random variables with a sample size of 10000, and with correlation between them of 0.3:\n\n\nN   <- 10000\nrho <- 0.3\nx1  <- rnorm(n = N, mean = 0, sd = 1)\nx2  <- (rho * x1) + sqrt(1 - rho*rho) * rnorm(n = N, mean = 0, sd = 1)\n\n\nThese variables are generated by first simulating the sample \\(x_{1}\\) (x1 above) from a standard normal distribution. Then, \\(x_{2}\\) (x2 above) is calculated as\n\n\\(x_{2} = \\rho x_{1} + \\sqrt{1 - \\rho^{2}}x_{rand}\\),\nwhere \\(x_{rand}\\) is a sample from a normal distribution with the same variance as \\(x_{1}\\)."
  },
  {
    "objectID": "materials/slides/w6.html#simulating-data-with-known-correlations-1",
    "href": "materials/slides/w6.html#simulating-data-with-known-correlations-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Simulating data with known correlations",
    "text": "Simulating data with known correlations\n\nWe can generate variables \\(X_{1}\\) and \\(X_{2}\\) that have known correlations \\(\\rho\\) with with one another.\nFor example: two standard normal random variables with a sample size of 10000, and with correlation between them of 0.3:\n\n\nN   <- 10000\nrho <- 0.3\nx1  <- rnorm(n = N, mean = 0, sd = 1)\nx2  <- (rho * x1) + sqrt(1 - rho*rho) * rnorm(n = N, mean = 0, sd = 1)\n\n\nDoes the correlation equal rho (with some sampling error)?\n\n\ncor(x1, x2)\n\n[1] 0.3126728"
  },
  {
    "objectID": "materials/slides/w6.html#simulating-data-with-known-correlations-2",
    "href": "materials/slides/w6.html#simulating-data-with-known-correlations-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Simulating data with known correlations",
    "text": "Simulating data with known correlations\n\nThere is a more efficient way to generate any number of variables with different variances and correlations to one another.\nWe need to use the MASS library, which can be installed and loaded as below:\n\n\ninstall.packages(\"MASS\")\nlibrary(\"MASS\")\n\n\n\n\n\nIn the MASS library, the function mvrnorm can be used to generate any number of variables for a pre-specified covariance structure."
  },
  {
    "objectID": "materials/slides/w6.html#statistical-power",
    "href": "materials/slides/w6.html#statistical-power",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Statistical power",
    "text": "Statistical power\n\n\nStatistical power is defined as the probability, before a study is performed, that a particular comparison will achieve “statistical significance” at some predetermined level (typically a p-value below 0.05), given some assumed true effect size\nIf a certain effect of interest exists (e.g. a difference between two groups) power is the chance that we actually find the effect in a given study\nA power analysis is performed by first hypothesizing an effect size, then making some assumptions about the variation in the data and the sample size of the study to be conducted, and finally using probability calculations to determine the chance of the p-value being below the threshold\nThe conventional view is that you should avoid low-power studies because they are unlikely to succeed\nThere are several problems with this view, but it’s often required by research funding bodies"
  },
  {
    "objectID": "materials/slides/w6.html#example-simulating-a-regression-design",
    "href": "materials/slides/w6.html#example-simulating-a-regression-design",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nWe can use simulation to test rather complex study designs\nImagine you are interested in students attitude towards smoking and how it depends on the medium of the message and the focus of the message\nWe want to know whether people’s attitude is different after seeing a visual anti-smoking message (these pictures on the package) vs a text-message (the text belonging to that picture)\nWe are interested in whether the attitude that people report is different after seeing a message that regards the consequences on other people (e.g. smoking can harm your loved ones) as compared to yourself (smoking can cause cancer)"
  },
  {
    "objectID": "materials/slides/w6.html#example-simulating-a-regression-design-1",
    "href": "materials/slides/w6.html#example-simulating-a-regression-design-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\nStudy design:\nDV: attitude towards smoking (0-100) IV1: medium (text vs. visual) IV2: focus (internal vs. external)\nThis is, there are 4 groups:\n\ngroup_TI will receive text-messages that are internal\ngroup_TE will receive text-messages that are external\ngroup_VI will receive visual messages that are internal\ngroup_VE will receive visual messages that are external"
  },
  {
    "objectID": "materials/slides/w6.html#example-simulating-a-regression-design-2",
    "href": "materials/slides/w6.html#example-simulating-a-regression-design-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nassume that we expect that people’s attitude will be more negative after seeing a visual rather than text message if the focus is internal (i.e. the message is about yourself) because it might be difficult to imagine that oneself would get cancer after reading a text but seeing a picture might cause fear regardless\nfor the external focus on the other hand, we expect a more negative attitude after reading a text as compared to seeing a picture, as it might have more impact on attitude to imagine a loved one get hurt than seeing a stranger in a picture suffering from the consequences of second-hand smoking\nwe expect that the internal focus messages will be related to lower attitudes compared to the external focus messages on average but we expect no main-effect of picture vs. text-messages"
  },
  {
    "objectID": "materials/slides/w6.html#example-simulating-a-regression-design-3",
    "href": "materials/slides/w6.html#example-simulating-a-regression-design-3",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nvisualize some rough means that show the desired behavior that we described in words earlier and see where we are going\nwe could make the overall mean of the internal focus groups (group_TI and group_VI) 20 and the mean of the external groups (group_TE and group_VE) 50 (this would already reflect the main-effect but also a belief that the smoking-attitudes are on average quite negative as we assume both means to be on the low end of the scale)\nassume that the mean of group_TI is 30 while the mean of group_VI is 10 and we could assume that the mean of group_TE is 40 and the mean of group_VE is 60"
  },
  {
    "objectID": "materials/slides/w6.html#section-2",
    "href": "materials/slides/w6.html#section-2",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "focus <- rep(c(\"internal\", \"external\"), each = 2)\nmedia <- rep(c(\"text\", \"visual\"), times = 2)\nmean_TI <- 50\nmean_VI <- 20\nmean_TE <- 30\nmean_VE <- 60\n\npd <- data.frame(score = c(mean_TI, mean_VI, mean_TE, mean_VE), focus = focus, media = media)\n\ninteraction.plot(pd$focus, pd$media, pd$score, ylim = c(0,100))"
  },
  {
    "objectID": "materials/slides/w6.html#section-3",
    "href": "materials/slides/w6.html#section-3",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "focus <- rep(c(\"internal\", \"external\"), each = 2)\nmedia <- rep(c(\"text\", \"visual\"), times = 2)\nmean_TI <- 43\nmean_VI <- 40\nmean_TE <- 45\nmean_VE <- 47\n\npd <- data.frame(score = c(mean_TI, mean_VI, mean_TE, mean_VE), focus = focus, media = media)\n\ninteraction.plot(pd$focus, pd$media, pd$score, ylim = c(0,100))"
  },
  {
    "objectID": "materials/slides/w6.html#example-simulating-a-regression-design-4",
    "href": "materials/slides/w6.html#example-simulating-a-regression-design-4",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nin the new example there is a difference between the two media groups on average but it is only .50 points, so arguably it is small enough to represent the assumption of “no” effect, as in real-life “no” effect in terms of a difference being actually 0 is rather rare\ncome up with some reasonable standard-deviation; if we start at 50 and we want most people to be < 80, we can set the 2-SD bound at 80 to get a standard-deviation of 15 (80-50)/2.\nlet’s assume that each of our groups has a standard-deviation of 15 points.\n\ngroup_TI = normal(n, 43, 15)\ngroup_VI = normal(n, 40, 15)\ngroup_TE = normal(n, 45, 15)\ngroup_VE = normal(n, 47, 15)"
  },
  {
    "objectID": "materials/slides/w6.html#example-simulating-a-regression-design-5",
    "href": "materials/slides/w6.html#example-simulating-a-regression-design-5",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\n\nn <- 1e5\ngroup_TI <-  rnorm(n, 43, 15)\ngroup_VI <-  rnorm(n, 40, 15)\ngroup_TE <-  rnorm(n, 45, 15)\ngroup_VE <-  rnorm(n, 47, 15)\n\nparticipant <- c(1:(n*4))\nfocus <- rep(c(\"internal\", \"external\"), each = n*2)\nmedia <- rep(c(\"text\", \"visual\"), each = n, times = 2)\n\ndata <- data.frame(participant = participant, focus = focus, media = media, score = c(group_TI, group_VI, group_TE, group_VE))\n\nsummary(data)\n\n  participant       focus              media               score       \n Min.   :1e+00   Length:400000      Length:400000      Min.   :-27.34  \n 1st Qu.:1e+05   Class :character   Class :character   1st Qu.: 33.48  \n Median :2e+05   Mode  :character   Mode  :character   Median : 43.77  \n Mean   :2e+05                                         Mean   : 43.75  \n 3rd Qu.:3e+05                                         3rd Qu.: 54.01  \n Max.   :4e+05                                         Max.   :117.23"
  },
  {
    "objectID": "materials/slides/w6.html#ready-for-power-analysis",
    "href": "materials/slides/w6.html#ready-for-power-analysis",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Ready for power-analysis",
    "text": "Ready for power-analysis\n\nSome additional assumptions: suppose we have enought funding for a sizeable data collection and the aim is to ensure that we do not draw unwarranted conclusions from the research\nWe should then set the alpha-level at a more conservative value (\\(\\alpha = .001\\)); with this, we expect to draw non-realistic conclusions in the interaction effect in only about 1 in every 1,000 experiments\nWe also want to be sure that we do detect an existing effect and keep our power high at 95%; with this, we expect that if there is an interaction effect, we would detect it in 19 out of 20 cases (only miss it in 1 out of 20, or 5%)\nRunning the power-simulation can be very memory-demanding and the code can run a very long time to complete; it’s advised to start from various “low-resolution” sample-sizes (e.g. n = 10, n = 100, n = 200, etc.) to get a rough idea of where we can expect our loop to end. Then, the search can be made more specific in order to identify a more precise sample size."
  },
  {
    "objectID": "materials/slides/w6.html#ready-for-power-analysis-1",
    "href": "materials/slides/w6.html#ready-for-power-analysis-1",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Ready for power-analysis",
    "text": "Ready for power-analysis\n\nset.seed(1)\nn_sims <- 1000 # we want 1000 simulations\np_vals <- c()\npower_at_n <- c(0) # this vector will contain the power for each sample-size (it needs the initial 0 for the while-loop to work)\nn <- 100 # sample-size and start at 100 as we can be pretty sure this will not suffice for such a small effect\nn_increase <- 100 # by which stepsize should n be increased\ni <- 2\n\npower_crit <- .95\nalpha <- .001\n\nwhile(power_at_n[i-1] < power_crit){\n  for(sim in 1:n_sims){\n    group_TI <-  rnorm(n, 43, 15)\n    group_VI <-  rnorm(n, 40, 15)\n    group_TE <-  rnorm(n, 45, 15)\n    group_VE <-  rnorm(n, 47, 15)\n    \n    participant <- c(1:(n*4))\n    focus <- rep(c(\"internal\", \"external\"), each = n*2)\n    media <- rep(c(\"text\", \"visual\"), each = n, times = 2)\n    \n    data <- data.frame(participant = participant, focus = focus, media = media, score = c(group_TI, group_VI, group_TE, group_VE))\n    data$media_sum_num <- ifelse(data$media == \"text\", 1, -1) # apply sum-to-zero coding\n    data$focus_sum_num <- ifelse(data$focus == \"external\", 1, -1) \n    lm_int <- lm(score ~ 1 + focus_sum_num + media_sum_num + focus_sum_num:media_sum_num, data = data) # fit the model with the interaction\n    lm_null <- lm(score ~ 1 + focus_sum_num + media_sum_num, data = data) # fit the model without the interaction\n    p_vals[sim] <- anova(lm_int, lm_null)$`Pr(>F)`[2] # put the p-values in a list\n  }\n    print(n)\n    power_at_n[i] <- mean(p_vals < alpha) # check power (i.e. proportion of p-values that are smaller than alpha-level of .10)\n    names(power_at_n)[i] <- n\n    n <- n+n_increase # increase sample-size by 100 for low-resolution testing first\n    i <- i+1 # increase index of the while-loop by 1 to save power and cohens d to vector\n}\n\n[1] 100\n[1] 200\n[1] 300\n[1] 400\n[1] 500\n[1] 600\n[1] 700\n[1] 800\n[1] 900\n\npower_at_n <- power_at_n[-1] # delete first 0 from the vector"
  },
  {
    "objectID": "materials/slides/w6.html#example-simulating-a-regression-design-6",
    "href": "materials/slides/w6.html#example-simulating-a-regression-design-6",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "Example: simulating a regression design",
    "text": "Example: simulating a regression design\nWe can plot the results form the power-simulation:\n\n\nAt roughly 900 participants we observe sufficient power"
  },
  {
    "objectID": "materials/worksheets/index.html",
    "href": "materials/worksheets/index.html",
    "title": "Worksheets",
    "section": "",
    "text": "References\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html",
    "href": "materials/worksheets/worksheets_w01.html",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "",
    "text": "This lab is an introduction to R and RStudio for the purposes of this module. It is expected that those new to R will complete the R for Social Scientists online training course on their own (estimated to take around 5-6 hours), as well as read through the assigned chapters from the R4DS textbook. The aims of this session are more limited than the contents of those resources, while at the same time offering something additional to those already familiar with basic operations in R.\nBy the end of the session, you will:\n\nunderstand how to use the most important panels in the RStudio interface\ncreate an RStudio Project to store your work throughout the course\nbegin using R scripts (.R) and Quarto notebooks (.qmd) to record and document your coding progress\nunderstand data types and basic operations in the R language\nunderstand the principles behind functions\n\nknow how to install, load and use functions from user-written packages\ngain familiarity with some useful functions from packages included in the tidyverse ecosystem\nThese few tasks should be enough to get you started with R and RStudio. If you haven’t yet done so, complete the R for Social Scientists online training too sometime over the next week. From next week we will begin working actively with real data and address specific data management challenges that arise from there.\nThose of you who have worked on the advanced user exercise can check some optional solutions below."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#r-and-rstudio",
    "href": "materials/worksheets/worksheets_w01.html#r-and-rstudio",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "R and RStudio",
    "text": "R and RStudio\nIf you are working on university desktops in the IT labs, recent versions of both R and RStudio will already be installed. To install them on your personal computers, follow the steps outlined here based on your operating system.\nAlthough you will likely only interact directly with RStudio, R needs to be installed first. Think of the relationship between the two as that between the engine of a car (R) and the dashboard of a car (RStudio); or, imagine driving this (R) versus this (RStudio).\nYour first task is to take RStudio for a spin and get to know some of its more commonly used panes. The four main panes are:\n\nThe R Console Pane\nThe R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. The Console is the direct interface to the R software itself; it’s what we get if instead of RStudio we open the R software: a direct interface to the R programming language, where we can type commands and where results/messages are printed.\nYou can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script. For this reason, we should not use the Console pane directly too much. For typing commands that we want R to execute, we should instead use an R script file, where everything we type can be saved for later and complex analyses can be built up.\nThe Source Pane\nThis pane, by default in the upper-left, is a space to work with scripts and other text files. This pane is also where datasets (data frames) open up for viewing.\n\n\n\n\n\n\nNote\n\n\n\nNote\nIf your RStudio displays only one pane on the left, it is because you have no scripts open yet. We can open an existing one or create a new one. We’ ll do that a bit later.\n\n\nThe Environment Pane\nThis pane, by default in the upper-right, is most often used to see brief summaries of “objects” that are available in an active session. Datasets loaded for analysis would appear here\n\n\n\n\n\n\nNote\n\n\n\nNote\nIf your Environment is empty, it means that you don’t have any “objects” loaded or created yet. We will be creating some objects later and we will also import an example dataset.\n\n\nFiles, Plots, Packages, Help, etc. The lower-right pane includes several tabs including plots (display of graphics including maps), help, a file library, and available R packages (including installation/update options).\n\n\n\n\n\n\nTip\n\n\n\nTip\nYou can arrange the panes in various ways, depending on your preferences, using Tools > Global Options in the top menu. So the arrangement of panes may look different on different computers.\n\n\nGeneral settings\nYou can personalise the look and feel of your RStudio setup in various ways using Tools > Global Options from the top menu, but setting some options as default from the very start is highly recommended. You can see these in the pictures below:\n\n\n\n\n\n\n\n\n\n\n\nThe most important setting in the picture on the left is the one to restore .RData at startup and saving the workspace as .RData on exit. Make sure these are un-ticked and set to ‘Never’, respectively, as shown in the picture. It’s always safer to start each RStudio session in a clean state, without anything automatically pre-loaded from a previos session. That could lead to serious and hard to trace complications.\nIn the picture on the right, you have the option to select that the new native pipe operator (we’ll talk about it later!) be inserted using the Ctrl+Shift+M keyboard shortcut instead of the older version of the pipe (|>).\n\nThese settings will make more sense later, but it’s a good idea to have them sorted at the very beginning."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#task-1-use-r-as-a-simple-calculator",
    "href": "materials/worksheets/worksheets_w01.html#task-1-use-r-as-a-simple-calculator",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Task 1: Use R as a simple calculator",
    "text": "Task 1: Use R as a simple calculator\nThe most elementary yet still handy task you can use R for is to perform basic arithmetic operations. This is useful for getting a first experience doing things in the R language. Let’s have a look at a few operations using the Console directly. Let’s say we want to know the result of adding up three numbers: 1, 3 and 5. In the Console pane, type the command below and then click Enter:\n\n1 + 3 + 5\n\nThis will print out the result (9) in the Console:\n\n\n[1] 9\n\n\nThe [1] in the result is just the line number; in this case, our result only consists of a single line.\nWe can also save the result of this operation as an object, so we can use it for further operations. We create objects by using the so-called assignment operator consisting of the characters <-. A command involving <- can be read as “assign the value of the result from the operation on the right hand side (some expression) to the object on the left hand side (short name of object, single word, with no spaces)”. For example, let’s save our result in an object called “nine”:\n\nnine <- 1 + 3 + 5\n\nNotice that there is no output printed in the Console this time. But there are also no error messages, so the operation must have run without problems. Instead, if we look at the Environment pane, we notice that it is no longer empty, but contains an object called “nine” that stores the value “9” in it. We can now use this object for other operations, such as:\n\nnine - 3\n\n[1] 6\n\nnine + 15\n\n[1] 24\n\nnine / 3\n\n[1] 3\n\nnine * 9\n\n[1] 81\n\n\nWe see the results of these operations printed out in the Console.\nWe can also check results of so-called relational operations. There are several relational operators that allow us to compare objects in R. The most useful of these are the following:\n\n\n> greater than, >= greater than or equal to\n\n< less than, <= less than or equal to\n\n== equal to\n\n!= not equal to\n\nWhen we use these to compare two objects in R, we end us with a logical object.\nFor example, let’s check whether 9 is greater than 5, and whether it is lower than 8:\n\n9 > 5\n\n[1] TRUE\n\n9 < 8\n\n[1] FALSE\n\n\nR treats our inputs as statements that we are asking it to evaluate, and we get the answers “TRUE” and “FALSE”, respectively, as we would expect. Let’s now check whether our object “nine” is equal to the number 9. We may assume that we can achieve this by typing “nine = 9”, but let’s see what that results in:\n\nnine = 9\n\nDid we get the result we expected? Nothing was printed in the output, so seemingly nothing happened… That’s because the “=” sign is also used as an assignment operator in R, just like “<-”. So we basically assigned the value “9” to the object “nine” again. To use the equal sign as a logical operator we must type it twice (==). Let’s see:\n\nnine == 9\n\n[1] TRUE\n\n\nNow we get the answer “TRUE”, as expected.\nThis distinction between “=” and “==” is important to keep in mind. What would have happened if we had tried to test whether our object “nine” equals value “5” or not, and instead of the logical operator (==) we used the assignment operator (=)? Let’s see:\n\nnine = 5\n\nIn the Console we again see no results printed, but if we check our Environment, we see that the value of the object “nine” was changed to 5. So it can be a dangerous business. We’ll be using the “<-” as assignment operator instead of “=” to avoid any confusion in this respect. The distinction between == and = will also emerge in other contexts later.\nSo, try out the following commands in turn now and check if the results are what you’d expect:\n\nnine == 9\n\n[1] FALSE\n\nnine == 5\n\n[1] TRUE\n\nfive <- 9\nnine == five\n\n[1] FALSE\n\nfive = nine\nnine == five\n\n[1] TRUE\n\nnine + five <= 10 # lower than or equal to ...\n\n[1] TRUE\n\n\nThe text following the hashtag (#) in the last line is a comment. If you’d like to comment on any code you write just add a hash (#) or series of hashes in front of it so that R knows it should not evaluate it as a command. This will be useful when writing your commands in an R script that you can save for later, rather than interacting with R live in the Console."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#scripts-markdown-documents-and-projects",
    "href": "materials/worksheets/worksheets_w01.html#scripts-markdown-documents-and-projects",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Scripts, markdown documents and projects",
    "text": "Scripts, markdown documents and projects\nBefore learning to do more with R, let’s learn about some further file types and complete our RStudio setup. Writing brief commands that you want to test out in the Console is okay, but what you really want is to save your commands as part of a workflow in a dedicated file that you can reuse, extend and share with others. In every quantitative analysis, we need to ensure that each step in our analysis is traceable and reproducible. This is increasingly a professional standard expected of all data analysts in the social sciences. This means that we need to have an efficient way in which to share our analysis code, as well as our outputs and interpretations of our findings. RStudio has an efficient way of handling this requirement with the use of R script files and versions of the Markdown markup language that allow the efficient combining of plain text (as in the main body of an article) with analysis code and outputs produced in R. The table below lists the main characteristics of these file types:\n\n\nFormat\nExtension\nDescription\n\n\n\nR Script\n.R\nUse an R script if you want to document a large amount of code used for a particular analysis project. Scripts should contain working R commands and human-readable comments explaining the code. Commands can be run line-by-line, or the whole R script can be run at once. For example, one can write an R script containing a few hundred or thousands of lines of code that gathers and prepares raw, unruly data for analysis; if this script can run without any errors, then it can be saved and sourced from within another script that contains code that undertakes the analysis using the cleansed dataset. Comments can be added by appending them with a hashtag (#).\n\n\nR Markdown\n.Rmd\n\nMarkdown is a simple markup language that allows the formatting of plain text documents. R Markdown is a version of this language written by the R Studio team, which also allows for R code to be included. Plain text documents having the .Rmd extension and containing R Markdown-specific code can be “knitted” (exported) directly into published output document formats such as HTML, PDF or Microsoft Word, which contain both normal text as well as tables and charts produced with the embedded R code. The code itself can also be printed to the output documents.\n\n\nQuarto document\n.qmd\nQuarto is a newer version of R Markdown which allows better compatibility with other programming languages. It is a broader ecosystem design for academic publishing and communication (for example, the course website was built using quarto), but you will be using only Quarto documents in this module. There isn’t much difference between .Rmd and .qmd documents for their uses-cases on this module, so one could easily change and .Rmd extension to .qmd and still produce the same output. .qmd documents are “rendered” instead of “knitted”, but for RStudio users the underlying engine doing the conversion from Quarto/R Markdown to standard Markdown to output file (HTML, PDF, Word, etc.) is the same. Read more about Quarto document in the TSD textbook.\n\n\n\nCreating new files can be done easily via the options File > New File >  from the top RStudio menu.\nThe best way to use these files are as part of R project folders, which allow for cross-references to documents and datasets to be made relative to the path of the project folder root. This makes sure that no absolute paths to files (i.e. things like “C:/Documents/Chris/my_article/data_files/my_dataset.rds”) need to be used (instead, you would write something like “~/data_files/my_dataset.rds” if the “my_article” folder was set up as an R Project). This allows for the same code file to be run on another computer too without an error, ensuring a minimal expected level of reproducibility in your workflow.\nSetting up an existing or a new folder as an R Project involves having a file with the extension .RProj saved in it. This can be done easily via the options File > New Project from the top RStudio menu."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#task-2-set-up-a-new-r-project-with-an-.r-script-and-a-.qmd-document-included",
    "href": "materials/worksheets/worksheets_w01.html#task-2-set-up-a-new-r-project-with-an-.r-script-and-a-.qmd-document-included",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Task 2: Set up a new R Project, with an .R script and a .qmd document included:",
    "text": "Task 2: Set up a new R Project, with an .R script and a .qmd document included:\n\nCreate a new folder set up as an R project; call the folder “HSS8005_labs”; when done, you should have an empty folder with a file called “HSS8005_labs.Rproj” in it\nCreate a new R script (.R); once created, save it as “Lab_1.R” within the “HSS8005_labs” folder\nCreate a new Quarto document (.qmd); once created, save it as “Lab_1.qmd” within the “HSS8005_labs” folder\n\nYou will work in each of these new documents in this lab to gain experience with them."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#data-types-and-structures",
    "href": "materials/worksheets/worksheets_w01.html#data-types-and-structures",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Data types and structures",
    "text": "Data types and structures\nThe basic elements of data in R are called vectors. The objects that we have in the Environment, the ones we created in Task 1 are simple numeric vectors of length 1. R has 6 basic data types that you should be aware of:\n\ncharacter: a text string, e.g. “name”\nnumeric: a real or decimal number\ninteger: non-decimal number; often represented by a number followed by the letter “L”, e.g. 5L\nlogical: TRUE or FALSE\ncomplex: complex numbers with real and imaginary parts\n\nR provides several functions to examine features of vectors and other objects, for example:\n\nclass() - what kind of object is it (high-level)?\ntypeof() - what is the object’s data type (low-level)?\nlength() - how long is it? What about two dimensional objects?\nattributes() - does it have any metadata?"
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#task-3-vector-operations-in-the-r-script",
    "href": "materials/worksheets/worksheets_w01.html#task-3-vector-operations-in-the-r-script",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Task 3: Vector operations in the R script",
    "text": "Task 3: Vector operations in the R script\nLet’s learn a few vector operations. Type/copy the code below to the R script file we created earlier, and save it at the end for your records.\nFirst, let’s use the c() function to concatenate vector elements:\n\nx <- c(2.2, 6.2, 1.2, 5.5, 20.1)\n\nTo run this line of code in an R script, place the cursor on the line you want to execute and either click on the small “Run” tab in the upper-right corner of the script’s task bar, or click Ctrl+Enter (on Windows PCs).\nThe vector called x that we just created appears in the Environment. We can examine some of its features:\n\nclass(x)\n\n[1] \"numeric\"\n\ntypeof(x)\n\n[1] \"double\"\n\nlength(x)\n\n[1] 5\n\nattributes(x)\n\nNULL\n\n\nThese tell us something about the characteristics of the object, but not much about its content (apart from the fact that it has a length of 5). Functions such as min, max, range, mean, median, sum or summary give us some summary statistics about the object:\n\nmin(x)\n\n[1] 1.2\n\nmax(x)\n\n[1] 20.1\n\nrange(x)\n\n[1]  1.2 20.1\n\nmean(x)\n\n[1] 7.04\n\nmedian(x)\n\n[1] 5.5\n\nsum(x)\n\n[1] 35.2\n\nsummary(x)\n\n   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n   1.20    2.20    5.50    7.04    6.20   20.10 \n\n\nThe seq() function lets us create a sequence from a starting point to an ending point. If you specify the by argument, you can skip values. For instance, if we wanted a vector of every 5th number between 0 and 100, we could write:\n\nnumbers <- seq(from = 0, to = 100, by = 5)\n\nTo print out the result in the console, we can simply type the name of the object:\n\nnumbers\n\n [1]   0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90\n[20]  95 100\n\n\nA shorthand version to get a sequence between two numbers counting by 1s is to use the : sign. For example, print out all the numbers between 200 and 250:\n\n200:250\n\n [1] 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218\n[20] 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237\n[39] 238 239 240 241 242 243 244 245 246 247 248 249 250\n\n\nTo access a single element of a vector by position in the vector, use the square brackets []:\n\nx[2]\n\n[1] 6.2\n\n\nIf you want to access more than one element of a vector, put a vector of the positions you want to access in the brackets:\n\nx[c(2, 5)]\n\n[1]  6.2 20.1\n\n\nIf you try to access an element past the length of the vector, it will return a missing value NA:\n\nx[10]\n\n[1] NA\n\n\nIf you accidentally subset a vector by NA (the missing value), you get the vector back with all its entries replaced by NA:\n\nx[NA]\n\n[1] NA NA NA NA NA\n\n\nLet’s say you want to modify one value in your vector. You can combine the square bracket subset [] with the assignment operator <- to replace a particular value:\n\nx\n\n[1]  2.2  6.2  1.2  5.5 20.1\n\nx[3] <- 50.3\nx\n\n[1]  2.2  6.2 50.3  5.5 20.1\n\n\nYou can replace multiple values at the same time by using a vector for subsetting:\n\nx\n\n[1]  2.2  6.2 50.3  5.5 20.1\n\nx[1:2] <- c(-1.3, 42)\nx\n\n[1] -1.3 42.0 50.3  5.5 20.1\n\n\nIf the replacement vector (the right-hand side) is shorter than what you are assigning to (the left-hand side), the values will “recycle” or repeat as necessary:\n\nx[1:2] <- 3.2\nx\n\n[1]  3.2  3.2 50.3  5.5 20.1\n\nx[1:4] <- c(1.2, 2.4)\nx\n\n[1]  1.2  2.4  1.2  2.4 20.1\n\n\nYou can also create a vector of characters (words, letters, punctuation, etc):\n\njedi <- c(\"Yoda\", \"Obi-Wan\", \"Luke\", \"Leia\", \"Rey\")\n\nNote for vectors, you cannot mix characters and numbers in the same vector. If you add a single character element, the whole vector gets converted.\n\n### output is numeric\nx\n\n[1]  1.2  2.4  1.2  2.4 20.1\n\n### output is now character\nc(x, \"hey\")\n\n[1] \"1.2\"  \"2.4\"  \"1.2\"  \"2.4\"  \"20.1\" \"hey\" \n\n\nLogical vectors are just vectors that only contain the special R values TRUE or FALSE.\n\nlogical <- c(TRUE, FALSE, TRUE, TRUE, FALSE)\nlogical\n\n[1]  TRUE FALSE  TRUE  TRUE FALSE\n\n\nYou could but never should shorten TRUE to T and FALSE to F. It’s easy for this shortening to go wrong so better just to spell out the full word. Also not that this is case-sensitive, and this will produce an error:\n\ntrue\n\nError in eval(expr, envir, enclos): object 'true' not found\n\nTrue\n\nError in eval(expr, envir, enclos): object 'True' not found\n\nfalse\n\nError in eval(expr, envir, enclos): object 'false' not found"
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#data-frames",
    "href": "materials/worksheets/worksheets_w01.html#data-frames",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Data frames",
    "text": "Data frames\nIt is useful to know about vectors, but we will use them primarily as part of larger data frames. Data frames are objects that contain several vectors of similar length. In a data frame each column is a variable and each row is a case. They look like spreadsheets containing data. There are several toy data frames built into R, and we can have a look at one to see how it looks like. For example, the cars data frame is built into R and so you can access it without loading any files. To get the dimensions, you can use dim(), nrow(), and ncol().\n\ndim(mtcars)\n\n[1] 32 11\n\nnrow(mtcars)\n\n[1] 32\n\nncol(mtcars)\n\n[1] 11\n\n\nWe can also load the dataset into our Environment and look at it manually:\n\nmtcars <- mtcars\n\nThe new object has appeared in the Environment under a new section called Data. We can click on it and the dataset will open up in the Source pane. What do you think this dataset is about?\nYou can select each column/variable from the data frame use the $, turning it into a vector:\n\nmtcars$wt\n\n [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070\n[13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840\n[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780\n\n\nYou can now treat this just like a vector, with the subsets and all.\n\nmtcars$wt[1]\n\n[1] 2.62\n\n\nWe can subset to the first/last k rows of a data frame\n\nhead(mtcars)\n\n                   mpg cyl disp  hp drat    wt  qsec vs am gear carb\nMazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4\nMazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4\nDatsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1\nHornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1\nHornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2\nValiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1\n\ntail(mtcars)\n\n                mpg cyl  disp  hp drat    wt qsec vs am gear carb\nPorsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2\nLotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2\nFord Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4\nFerrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6\nMaserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8\nVolvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2\n\n\nThere are various ways in which one can further subset and wrangle vectors and data frames using base R functions, but the tidyverse and other user-written packages provide more functionality and ease of use. In this course, we will rely mostly on these."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#functions",
    "href": "materials/worksheets/worksheets_w01.html#functions",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Functions",
    "text": "Functions\nWe have already encountered some basic functions earlier. Most of the work in R is done using functions. It’s possible to create your own functions. This makes R extremely powerful and extendible. We’re not going to cover making your own functions in this course, but it’s important to be aware of this capability. There are plenty of good resources online for learning how to do this, including this one.\nAdvanced user exercise: leap year functions\nIf you have more advanced knowledge of R, here’s and exercise for you. Suppose you want to write a function that lists all the leap years between two specified years. How would you go about writing it? What are the information that you need first? What are the steps that you would take to build up the function? There are several ways of achieving such a function, and you can find three options at the bottom of this worksheet. Work individually or in a small group. Compare your results to the options given at the end."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#packages",
    "href": "materials/worksheets/worksheets_w01.html#packages",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Packages",
    "text": "Packages\nInstead of programming your own functions in the R language, you can rely on functions written by other people and bundled within a package that performs some set task. There are a large number of reliable, tested and oft-used packages containing functions that are particularly useful for social scientists.\nSome particularly useful packages: - the tidyverse bundle of packages, which includes the dplyr package (for data manipulation) and additional R packages for reading in (readr), transforming (tidyr) and visualizing (ggplot2) datasets. - to import datasets in non-native formats and to manage attached labels (a concept familiar from other statistical packages but foreign to R), load the sjlabelled package (an alternative to haven and labelled, which work in a similar way but provide less functionality) - the sjmisc package contains very useful functions for undertaking data transformations on labelled variables (recoding, grouping, missing values, etc); also has some useful tabulation functions - the sjPlot package contains functions for graphing and tabulating results from regression models\nPackages are often available from the Comprehensive R Archive Network (CRAN) or private repositories such as Bioconductor, GitHub etc. Packages made available on CRAN can be installed using the command install.packages(\"packagename\"). Once the package/library is installed (i.e. it is sitting somewhere on your computer), we then need to load it to the current R session using the command library(packagename).\nSo using a package/library is a two-stage process. We:\n\n\nInstall the package/library onto your computer (from the internet)\n\nLoad the package/library into your current session using the library command.\n\nLet’s start by installing the ‘tidyverse’ package, and then load it:\n\ninstall.packages(\"tidyverse\")  ## this command installs packages from CRAN; note the quotation marks around the package name\n\nYou can check the suite of packages that are loaded when you load the Tidyverse library using a command from the tidyverse itself:\n\ntidyverse_packages()\n\n\nQuestion\nWhy do you think we got an error message when we tried to run the above command?\n\nBecause tidyverse_packages() is itself a function from the tidyverse, in order to use that function we need not only to install the tidyverse but also to make its functions available. In other words, we did not yet load the tidyverse for use in our R session, we only just installed it on our computers.\nIf we don’t want to load a package that we have downloaded - because maybe we only want to use a single function once and we don’t want to burden our computer’s memory, we can state explicitly which package the function is from in the following way:\n\ntidyverse::tidyverse_packages()  # Here we state the package followed by two colons, then followed by the function we want\n\nBut in many cases we do want to use several functions at various points in an analysis session, so it is usually useful to load the entire package or set of packages:\n\nlibrary(tidyverse)\n\nNow we can use functions from that package without having to explicitly state the name of the package. We can still state the name explicitly, and that may be useful for readers of our code to understand what package a function come from. Also, it may happen that different packages have similarly named functions, and if all those packages are loaded, then the functions from a package loaded later will override that in the package loaded earlier. R will note in a comment whether any functions from a package are masked by another, so it’s worth paying attention to the comments and warnings printed by R when we load packages.\nThere are also convenience tools - e.g. the pacman package - that make it easier to load several packages at once, while at the same time downloading the package if it has not yet been downloaded on our computer.\nFor example, we can download a number of packages with the command below:\n\n# Install 'pacman' if not yet installed:\n\nif (!require(\"pacman\")) install.packages(\"pacman\") \n\n# Then load/install other packages using 'pacman':\n\npacman::p_load(\n  tidyverse,    # general data management tools ('dplyr', etc.)\n  sjlabelled,   # data import from other software (alternative to 'haven') and labels management\n  sjmisc        # data transformation on variables (recoding,grouping, missing values, etc)\n  )"
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#about-the-tidyverse",
    "href": "materials/worksheets/worksheets_w01.html#about-the-tidyverse",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "About the Tidyverse\n",
    "text": "About the Tidyverse\n\nData frames and ‘tibbles’\nThe Tidyverse is built around the basic concept that data in a table should have one observation per row, one variable per column, and only one value per cell. Once data is in this ‘tidy’ format, it can be transformed, visualized and modelled for an analysis.\nWhen using functions in the Tidyverse ecosystem, most data is returned as a tibble object. Tibbles are very similar to the data.frames (which are the basic types of object storing datasets in base R) and it is perfectly fine to use Tidyverse functions on a data.frame object. Just be aware that in most cases, the Tidyverse function will transform your data into a tibble. If you are unobservant, you won’t even notice a difference. However, there are a few differences between the two data types, most of which are just designed to make your life easier. For more info, check R4DS.\nSelected dplyr functions\nThe dplyr package is designed to make it easier to manipulate flat (2-D) data (i.e. the type of datasets we are most likely to use, which are laid out as in a standard spreadsheet, with rows referring to cases (observations; respondents) and columns referring to variables. dplyr provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code. Here are some of the most common functions in dplyr:\n\n\nfilter() chooses rows based on column values.\n\narrange() changes the order of the rows.\n\nselect() changes whether or not a column is included.\n\nrename() changes the name of columns.\n\nmutate()/transmute() changes the values of columns and creates new columns (variables)\n\nsummarise() compute statistical summaries (e.g., computing the mean or the sum)\n\ngroup_by() group data into rows with the same values\n\nungroup() remove grouping information from data frame.\n\ndistinct() remove duplicate rows.\n\nAll these functions work similarly as follows:\n\nThe first argument is a data frame/tibble\nThe subsequent arguments are comma separated list of unquoted variable names and the specification of what you want to do\nThe result is a new data frame\n\nFor more info, check R for Social Scientists\nThe forward-pipe (%>%/|>) workflow\nAll of the dplyr functions take a data frame or tibble as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the forward-pipe operator %>% from the magrittr package. This operator allows us to combine multiple operations into a single sequential chain of actions. As of R 4.1.0 there is also a native pipe operator in R (|>), and in RStudio one can set the shortcut to paste the new pipe operator instead (as we have done at the beginning of the lab). Going forward, we’ll use this version of the pipe operator for simplicity, but it’s likely that you will encounter the older version of the operator too in various scripts.\nLet’s start with a hypothetical example. Say you would like to perform a sequence of operations on data frame x using hypothetical functions f(), g(), and h():\n\nTake x then\n\nUse x as an input to a function f() then\n\nUse the output of f(x) as an input to a function g() then\n\nUse the output of g(f(x)) as an input to a function h()\n\nOne way to achieve this sequence of operations is by using nesting parentheses as follows:\nh(g(f(x)))\nThis code isn’t so hard to read since we are applying only three functions: f(), then g(), then h() and each of the functions is short in its name. Further, each of these functions also only has one argument. However, you can imagine that this will get progressively harder to read as the number of functions applied in your sequence increases and the arguments in each function increase as well. This is where the pipe operator |> comes in handy. |> takes the output of one function and then “pipes” it to be the input of the next function. Furthermore, a helpful trick is to read |> as “then” or “and then.” For example, you can obtain the same output as the hypothetical sequence of functions as follows:\nx |> \n  f() |> \n  g() |> \n  h()\nYou would read this sequence as:\n\nTake x then\n\nUse this output as the input to the next function f() then\n\nUse this output as the input to the next function g() then\n\nUse this output as the input to the next function h()\n\nSo while both approaches achieve the same goal, the latter is much more human-readable because you can clearly read the sequence of operations line-by-line. Instead of typing out the three strange characters of the operator, one can use the keyboard shortcut Ctrl + Shift + M (Windows) or Cmd + Shift + M (MacOS) to paste the operator."
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#task-4-data-frame-operations-in-a-quarto-document",
    "href": "materials/worksheets/worksheets_w01.html#task-4-data-frame-operations-in-a-quarto-document",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Task 4: Data frame operations in a Quarto document",
    "text": "Task 4: Data frame operations in a Quarto document\nIn this task, let’s start using the other document we created, the .qmd file. This file format allows you to combine both longer written text (such as detailed descriptions of your data analysis process or the main body of a report or journal article) with code chunks. To get you started using this file format, read Chapter 3.2. in TSD. Below we will focus only on the code chunks.\nCompared to what you have done in the R script, in the main Quarto document a # refers to a heading level rather than a comment. If you want to include a code chunk, you can click on the +C tab in the upper-right corner of the .qmd document’s toolbar, or use the keyboard shortcut Ctrl+Alt+i. In the code chunk you would write in the same way as you did in the R script (they are basically mini-scripts). Within a code-chunk, therefore, the # still refers to a comment.\nTo execute a command withing a code chunk, you can either run each line/selection separately using Ctrl+Enter as in the R script, or you can run the entire content of the chunk with the green right-pointing triangle-arrow in the upper-right corner of the chunk.\nLet’s continue doing some operations on the mtcars dataset we looked at earlier, this time using some useful tidyverse functions.\nLet’s subset the data frame by selecting certain rows or columns. In tidyverse, you can do this with the filter() function for selecting rows and the select() function for selecting columns. Here we pipe the selections into head() to show the first few rows. You could also use the dplyr::slice_head function\n\nmtcars |>\n  select(mpg, wt) |>\n  head()\n\n                   mpg    wt\nMazda RX4         21.0 2.620\nMazda RX4 Wag     21.0 2.875\nDatsun 710        22.8 2.320\nHornet 4 Drive    21.4 3.215\nHornet Sportabout 18.7 3.440\nValiant           18.1 3.460\n\n\nTo select the cars with eight cylinders:\n\nmtcars |>\n  filter(cyl == 8)\n\n                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb\nHornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2\nDuster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4\nMerc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3\nMerc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3\nMerc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3\nCadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4\nLincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4\nChrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4\nDodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2\nAMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2\nCamaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4\nPontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2\nFord Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4\nMaserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8\n\n\nWe can use the slice() function. For example, to get the 5th through 10th rows:\n\nmtcars |>\n  slice(5:10)\n\n                   mpg cyl  disp  hp drat   wt  qsec vs am gear carb\nHornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2\nValiant           18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1\nDuster 360        14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4\nMerc 240D         24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2\nMerc 230          22.8   4 140.8  95 3.92 3.15 22.90  1  0    4    2\nMerc 280          19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4\n\n\nIf we pass a vector of integers to the select function, we will get the variables corresponding to those column positions. So to get the first through third columns:\n\nmtcars |>\n  select(1:3) |>\n  head()\n\n                   mpg cyl disp\nMazda RX4         21.0   6  160\nMazda RX4 Wag     21.0   6  160\nDatsun 710        22.8   4  108\nHornet 4 Drive    21.4   6  258\nHornet Sportabout 18.7   8  360\nValiant           18.1   6  225\n\n\nIf you call summary() a data frame, it produces applies the vector version of the summary command to each column:\n\nsummary(mtcars)\n\n      mpg             cyl             disp             hp       \n Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  \n 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  \n Median :19.20   Median :6.000   Median :196.3   Median :123.0  \n Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  \n 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  \n Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  \n      drat             wt             qsec             vs        \n Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  \n 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  \n Median :3.695   Median :3.325   Median :17.71   Median :0.0000  \n Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  \n 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  \n Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  \n       am              gear            carb      \n Min.   :0.0000   Min.   :3.000   Min.   :1.000  \n 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  \n Median :0.0000   Median :4.000   Median :2.000  \n Mean   :0.4062   Mean   :3.688   Mean   :2.812  \n 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  \n Max.   :1.0000   Max.   :5.000   Max.   :8.000"
  },
  {
    "objectID": "materials/worksheets/worksheets_w01.html#solutions-to-the-advanced-exercise-leap-year-functions",
    "href": "materials/worksheets/worksheets_w01.html#solutions-to-the-advanced-exercise-leap-year-functions",
    "title": "Week 1 Computer Lab Worksheet",
    "section": "Solutions to the advanced exercise: leap year functions",
    "text": "Solutions to the advanced exercise: leap year functions\n\nleap_year_v1 <- function(year1,year2) {\n    year <- year1:year2\n    year[(year%%4==0 & year%%100!=0) | year%%400==0]\n}\n\n\nleap_year_v2 <- function(year1,year2){\n    vector<-c()\n    for(year in year1:year2){\n        if((year %% 4 == 0) & (year %% 100 != 0) | (year  %% 400 == 0)){\n            vector<-c(vector,year)\n        }}\n    return(vector)}\n\n\nleap_year_v3 <- function(year1,year2){\n    #make a vector of all years\n    year<-year1:year2\n    #find the leap years (TRUE/FALSE)\n    leaps<-ifelse((year %% 4 == 0) & (year %% 100 != 0) | (year  %% 400 == 0), TRUE, FALSE)\n    year[leaps] #return the leap years\n}"
  },
  {
    "objectID": "materials/worksheets/worksheets_w02.html#aims",
    "href": "materials/worksheets/worksheets_w02.html#aims",
    "title": "Week 2 Computer Lab Worksheet",
    "section": "Aims",
    "text": "Aims\nThis session introduces simple and multiple linear regression models. You will be working with data from Österman (2021) to replicate parts their analysis. We will be covering only basic regression methods in this session, so the article will serve mainly as a broad background to the data here. We will be returning to this article in future weeks too, expanding our modelling strategy as we discover new methods. We will also practice some data management tasks and the basics of data visualisation using principles from ‘the grammar of graphics’ as implemented in the {ggplot2} package (see Kieran Healy’s Data Visualization: A practical introduction for an introduction with many practical examples).\nBy the end of the session, you will:\n\nlearn how to import data from foreign formats (e.g. SPSS, Stata, CSV)\nknow how to perform basic descriptive statistics on a dataset\nunderstand the basics of data visualisation\nknow how to fit linear regression models in R using different functions\nlearn a few options for presenting findings from regression models"
  },
  {
    "objectID": "materials/worksheets/worksheets_w02.html#setup",
    "href": "materials/worksheets/worksheets_w02.html#setup",
    "title": "Week 2 Computer Lab Worksheet",
    "section": "Setup",
    "text": "Setup\nIn Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder. If it’s missing, complete Task 2 from the Week 1 Lab.\nYou can create a new .R script and .qmd/.Rmd for this week’s work (e.g. “Lab_2”). Start working in the .R script initially, then switch to .qmd/.Rmd later in the session to report on your final analysis."
  },
  {
    "objectID": "materials/worksheets/worksheets_w02.html#importing-data",
    "href": "materials/worksheets/worksheets_w02.html#importing-data",
    "title": "Week 2 Computer Lab Worksheet",
    "section": "Importing data",
    "text": "Importing data\nAs we have seen in Week 1, small datasets that are included in R packages (including base R) for demonstration purposes can be used by simply invoking the name of the dataset. For example, the command head(mtcars) would print out the first 5 rows (cases) in the “mtcars” dataset included in base R (more specifically, in its “datasets” package):\n\nhead(mtcars)\n\n                   mpg cyl disp  hp drat    wt  qsec vs am gear carb\nMazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4\nMazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4\nDatsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1\nHornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1\nHornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2\nValiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1\n\n\n\n\n\n\n\n\nTip\n\n\n\nThe data() function lists all the built-in packages. You can further specify package = to list datasets included in a given package; e.g.:\n\ndata(package = \"dplyr\")\n\n# Note that {dplyr} is one of the data management packages included in the core {tidyverse}. Make sure the {tidyverse} is installed.\n\n\n\nWe can also import built-in datasets to our Environment in order to inspect them manually:\n\nmtcars_data <- mtcars\n\nIf we want to access a dataset from a non-base-R package, we need to ensure that the package is installed on our system and that we specify the name of the package; e.g.:\n\nhead(starwars) # gives an Error\n\n# A tibble: 6 × 14\n  name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵\n  <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  \n1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…\n2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…\n3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  \n4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…\n5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…\n6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…\n# … with 4 more variables: species <chr>, films <list>, vehicles <list>,\n#   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,\n#   ³​eye_color, ⁴​birth_year, ⁵​homeworld\n\nhead(dplyr::starwars)\n\n# A tibble: 6 × 14\n  name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵\n  <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  \n1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…\n2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…\n3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  \n4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…\n5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…\n6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…\n# … with 4 more variables: species <chr>, films <list>, vehicles <list>,\n#   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,\n#   ³​eye_color, ⁴​birth_year, ⁵​homeworld\n\n  # works, as long as {dplyr} or the whole {tidyverse} are installed\n\nReal-life datasets, however, need to be imported into R. Datasets come in various formats. R’s native data format has the extension .rds and can be imported with the readRDS() function. The counterpart function saveRDS() exports a dataset to the .rds format. The core-tidyverse {readr} package has similar functions (read_rds() / write_rds()).\nThe .rds format is useful because it can be compressed to various sizes to take up less space, but can only be directly opened in R. It is much more common to encounter data saved in a “delimited” text format, which can be easily opened in a spreadsheet viewer/editor such as Excel. This makes it very interchangeable and therefore very common. The most common is probably the “comma separated values” (.csv) format, which can be imported with the base-R function read.csv() or the tidyverse readr::read_csv() equivalent. Read Chapter 11 in R4DS for more on these functions.\nVery often, you will need to import data saved in the format of another proprietary statistical analysis package such as SPSS or Stata. Large survey projects usually make data available in these formats. The great advantage of these formats is that they can incorporate additional information about variables and the levels of categorical variables (e.g. value labels, specific values for different types of missing values). These additional information can be extremely valuable, but they are not handled straight-forwardly in text-based format, spreadsheets and R’s native data format. To make them operational in R, we need a few specially designed functions.\nThe {haven} package — part of the extended {tidyverse}, meaning that it is installed on your system as part of {tidyverse}, but the library(\"tidyverse\") command does not load it by default; it needs to be loaded explicitly — is one of the most commonly used for this purpose. Functions such as read_sas(), read_sav() and read_dta() import datasets specific to the SAS, SPSS and Stata programs, respectively.\nIt is highly recommended to read the documentation available for the {haven} package to understand how it operates. Fundamentally, it is designed to import data to a intermediary format which stores the additional labeling information in a special format that allows users to access them, but not making them easy to use directly. A suite of packages developed by Daniel Lüdecke from the University of Hamburg offer some additional functionality to work with labels directly when summarising and plotting data. These packages integrate well with the {tidyverse} and are actively maintained, and we will use them in this course to make our lives a bit easier. To install them, run:\n\n# We can install several packages at once by first creating a vector of their names\n\nsj_packages <- c(\"sjlabelled\", \"sjPlot\", \"sjstats\", \"ggeffects\", \"sjmisc\")\n\ninstall.packages(sj_packages)\n\nThe functions sjlabelled::read_sas(), sjlabelled::read_spss() and sjlabelled::read_stata() are the {sjlabelled} equivalents of the {haven} functions mentioned above. This vignette article included with the package explains the main differences between the two.\nAs a first step, let’s import the osterman dataset that underpins the Österman (2021) article (see the Data page on the course website for information about the datasets available in this course):\n\nosterman <- sjlabelled::read_stata(\"https://cgmoreh.github.io/HSS8005-data/osterman.dta\")\n\n\n\n\n\nUsing functions learnt in Week 1, do the following:\n\ncheck the dimensions of the dataset; what does it tell you?\nprint a summary of the entire dataset; what have you learnt about the data?\n\n\nA very convenient way to create a codeplan for a dataset – especially if it has value-labelled categorical variables – is offered by the sjPlot::view_df() function, which produces a tables in HTML format that can be saved and consulted to get more information about the variables. With the default settings, we get the following:\n\nsjPlot::view_df(osterman)\n\nThe output opens up in the Viewer pane.\nThere are several additional options that can add useful complexity to the codeplan, as well as the option to restrict it to selected variables. It also works in a “piped” workflow, so it can be combined with {dplyr} verbs such as select to restrict variables beforehand in a more flexible way. Below we request a codeplan with extended options:\n\nsjPlot::view_df(osterman,\n                show.na = TRUE, \n                show.type = TRUE, \n                show.frq = TRUE, \n                show.prc = TRUE, \n                show.string.values = TRUE)\n\nThe output can also be opened in an external web browser window by clicking on the third (last) icon at the top of the Viewer toolbar. From the browser window, with a Ctrl + Right-click > Save as... we can save the table as an HTML document locally and use it as a reference.\n\nBefore moving forward, spend some time examining the codeplan that you have produced and if you haven’t yet had a chance to skim through the Österman (2021) article, have a quick read through their description of the dataset."
  },
  {
    "objectID": "materials/worksheets/worksheets_w02.html#exercise-1-create-some-basic-descriptive-graphs-using-the-ggplot-command-from-the-ggplot2-tidyverse-package-for-the-associaton-between-the-following-variables",
    "href": "materials/worksheets/worksheets_w02.html#exercise-1-create-some-basic-descriptive-graphs-using-the-ggplot-command-from-the-ggplot2-tidyverse-package-for-the-associaton-between-the-following-variables",
    "title": "Week 2 Computer Lab Worksheet",
    "section": "Exercise 1: Create some basic descriptive graphs using the ggplot() command from the {ggplot2} tidyverse package for the associaton between the following variables:",
    "text": "Exercise 1: Create some basic descriptive graphs using the ggplot() command from the {ggplot2} tidyverse package for the associaton between the following variables:\n\n‘trustindex3’ and ‘eduyrs25’\n‘trustindex3’ and ‘agea’\n‘trustindex3’ and ‘female’"
  },
  {
    "objectID": "materials/worksheets/worksheets_w02.html#exercise-2-what-factors-affect-trust",
    "href": "materials/worksheets/worksheets_w02.html#exercise-2-what-factors-affect-trust",
    "title": "Week 2 Computer Lab Worksheet",
    "section": "Exercise 2: What factors affect trust?",
    "text": "Exercise 2: What factors affect trust?\nFit three simple bivariate OLS regressions using the lm() function:\n\nRegress ‘trustindex3’ on ‘eduyrs25’ and interpret the results\nRegress ‘trustindex3’ on ‘agea’ and interpret the results\nRegress ‘trustindex3’ on ‘female’ and interpret the results\nRegress ‘trustindex3’ on all three predictors listed above and interpret the results"
  },
  {
    "objectID": "materials/worksheets/worksheets_w02.html#advanced-exercise-3-apply-the-model-to-a-new-dataset",
    "href": "materials/worksheets/worksheets_w02.html#advanced-exercise-3-apply-the-model-to-a-new-dataset",
    "title": "Week 2 Computer Lab Worksheet",
    "section": "(Advanced) Exercise 3: Apply the model to a new dataset",
    "text": "(Advanced) Exercise 3: Apply the model to a new dataset\nThe ostermann data originates from Waves 1-9 of the European Social Survey. The ESS data are accessible freely upon registration. As part of this exercise, access data from Wave 10 of the survey (from this site: https://www.europeansocialsurvey.org/data/) and perform the following tasks:\n\ndownload the dataset to the Rproject folder\nselect the variables required to recreate the data to fit the multiple regression model from the previous exercise\ncreate your version of the ‘trustindex3’ variable\nfit the models from Exercise 2 and compare the results.\n\nYou should already be familiar with the functions needed to complete each of these steps, but it may require some self-study. You will likely need to continue the task outside class. If you succeed in completing the Task, in Week 3 we can compare results.\n\n\n\n\n\n\n Download solutions"
  },
  {
    "objectID": "materials/worksheets/worksheets_w02.html#solutions",
    "href": "materials/worksheets/worksheets_w02.html#solutions",
    "title": "Week 2 Computer Lab Worksheet",
    "section": "SOLUTIONS",
    "text": "SOLUTIONS\nExercise 1: Create some basic descriptive graphs using the ggplot() command from the {ggplot2} tidyverse package for the associaton between the following variables:\n\n‘trustindex3’ and ‘eduyrs25’\n\nThe best way to approach this problem is by working through the first examples in Kieran Healy’s Data Visualization: A practical introduction, starting at Chapter 3, and applying them to your data. Outside class, you can develop these basic graphs into better looking ones by adding various extra layers. The ggplot() function is part of the ggplot2 package, which is included in the core tidyverse, so we don’t need to load it separately if we have already loaded the tidyverse.\nThe ggplot approach to graphs is to build them up step-by-step using various layers. The basic template code for a ggplot() graph is the following:\nggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()\nFor example, the code below sets up a totally blank canvas:\n\nggplot()\n\n\n\n\nTo start populating the canvas we need to add a first layer containing our dataset and the variables we want to ‘map’ using the aes() argument (for “aesthetic mapping”). This adds coordinates to the canvas based on the variables we want to graph (in our case, ‘trustindex3’ and ‘eduyrs25’). We are treating ‘trustindex3’ as the outcome variables in these exercises, so we will want to position it on the y axis:\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25))\n\n\n\n\nWe now have a basic layer, but no actual data. The final crucial move is to add another layer using the + operator the type of graph (called a “geom” in ggplot, short for “geometric object”, such as a bar, a line, a boxplot, histogram, etc.) that we want to use to represent the relationship between the two variables. In this case, given that both variables are measured on a numeric scale (or at least on an ordinal scale with seven or more categories), the best option is a scatterplot. In ggplot(), a scatterplot “geom” are called with the geom_point() function:\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +\n  geom_point()\n\n\n\n\nWe now have a scatterplot of the relationship between ‘trustindex3’ and ‘eduyrs25’. Because our scales are rather short and the data is spread out, this scatterplot is not very informative.\nWe can choose to add another “geom” that is better able to summarise the relationship. The function geom_jitter() (a shortcut to the specification geom_point(position = \"jitter\")) is helpful in such cases because it adds a small amount of random variation to the location of each point, making areas of overlapping points more visible. The commands below do the same thing:\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +\n  geom_point(position = \"jitter\")\n\n\n\n\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +\n  geom_point() + \n  geom_jitter()\n\n\n\n\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +\n  geom_jitter()\n\n\n\n\nAnother option is the geom_smooth() function, which provides a set of options, basically returning “smooth lines” representing various types of regression lines. The function fits a regression in the background and graphs the results. The default setting is to fit a generalized additive model that captures non-linearities in the data with a smoothing spline (the Wikipedia article on GAMs gives a maths-heavy outline of these models, but they are beyond our interests here). The smooth line produced is probably more informative about the general idea:\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +\n  geom_jitter() +\n  geom_smooth()\n\n\n\n\nBut we can also specify other regression methods, and because we are here aiming to model the relationship between trust and education as a linear model, we can specify the method as “lm”:\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +\n  geom_jitter() +\n  geom_smooth(method = \"lm\")\n\n\n\n\nNow we get a straight regression line, which is basically the visual representation of the bivariate linear regression model that we will fit in Exercise 2 below. There are numerous further specifications that can be added to improve the graph, and Healy’s book is a good resource for ideas that you can play around with. We won’t go into much more detail about these additional options here, but as a taster, let’s say that we want to make the regression line more pronounced by changing its colour to red and make the scatter dots slightly transparent by adjusting the colour’s “alpha” level:\n\nggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +\n  geom_jitter(alpha = 0.1) +\n  geom_smooth(method = \"lm\", colour = \"red\")\n\n\n\n\n\n‘trustindex3’ and ‘agea’\n\nWe can now do something similar for the relationship between trust and age (the ‘trustindex3’ and ‘agea’ variables in the dataset). Again, both variables are measured on a numeric scale, so a scatterplot should work best. Because we don’t know what to expect and therefore what additional settings would improve each individual graph, we start from the most basic informative layer and build up from there. To practice some alternative approaches to working with plots, here we will first save the basic plot as a ggplot object, to which we can later add further layers and specifications:\n\nage_plot <- ggplot(osterman, aes(y = trustindex3, x = agea)) +\n  geom_point()\n\nIf no output was generated from the command above, that’s as expected. The graph was produced, but we didn’t ask for it to be printed, we only asked for it to be saved as an object called “age_plot”. To see it, we can simply call “age_plot”. We can then make various additions to this plot.\n\nage_plot\n\n\n\n\nThis looks very similar to the previous graph, so we could add the same additional specificaitons as in the previous exercise, this time to the plot object that we saved:\n\nage_plot +\n  geom_jitter(alpha = 0.1) +\n  geom_smooth(method = \"lm\", colour = \"red\")\n\n\n\n\nThe association between age and trust appears very weak, something that we will explore further in Exercise 2.\n\n‘trustindex3’ and ‘female’\n\nWe can try a similar scatterplot here too, but there may be better options:\n\nggplot(osterman, aes(y = trustindex3, x = female)) +\n  geom_point() +\n  geom_jitter(alpha = 0.1) +\n  geom_smooth(method = \"lm\", colour = \"red\")\n\n\n\n\nThis graph can be confusing, are we are better off using another “geom” because the female variable is a dichotomous/binary factor (categorical) variable. A good visualisation tool in the this case is a boxplot, which can be called with the geom_boxplot() function. There will be a challenge, though:\n\nggplot(osterman, aes(y = trustindex3, x = female)) +\n  geom_boxplot()\n\n\n\n\nThe issue with this graph is that the female variable is recognised as numeric by R. What we need to do first - or, as part of the ggplot() function itself - is to tell R to treat *female as a factor. We could do the following:\n\nggplot(osterman, aes(y = trustindex3, x = factor(female))) +\n  geom_boxplot()\n\n\n\n\nOr as part of a piped workflow:\n\nosterman |> mutate(female = as_factor(female)) |> \n  ggplot(aes(y = trustindex3, x = female)) +\n  geom_boxplot()\n\n\n\n\nBut it may be useful to change the variable type in the dataset altogether by saving the mutation and then using the changed female variable:\n\n# We are overwriting the original dataset here, so we better not make a mistake:\nosterman <- osterman |> mutate(female = as_factor(female))\n\n# And from now on the 'female' variable will be treated as a factor:\nggplot(osterman, aes(y = trustindex3, x = female)) +\n  geom_boxplot()\n\n\n\n\nExercise 2: What factors affect trust?\nFit three simple bivariate OLS regressions using the lm() function:\n\nRegress ‘trustindex3’ on ‘eduyrs25’ and interpret the results\n\nWe will do just that, saving the regression as an object called “m1” (for model 1):\n\nm1 <- lm(trustindex3 ~ eduyrs25, data = osterman)\n\nThe model object has now been saved in the Environment, and we can inspect it manually if we want by opening it in the Source window. The object is a large list, with various components that we can call and print separately. The most basic information that we can obtain from the model is the coefficients:\n\ncoefficients(m1)\n\n(Intercept)    eduyrs25 \n  3.9085904   0.1054227 \n\n\nThis basic information is enough to solve the linear equation underpinning the model:\n\\[ y_i=b_0+b_1x_i \\] The coefficients correspond to the \\(b\\)’s in this simple model, and we can plug the values in to obtain\n\\[ trust_i=3.91 + 0.11 \\times education_i \\]\nWe find, thus, that the number of years spent in education has a positive outcome on social trust, with each additional year of education associated with a 0.11-points higher score on the trust index, above the baseline of 3.91 points in the case when education is equal to 0. With this formula we can calculate predictions of the trust score for any individual \\(i\\) from their years of education.\nWe can also get more information about the model with the summary() function. When applied to a linear model object, it provides the following output:\n\nsummary(m1)\n\n\nCall:\nlm(formula = trustindex3 ~ eduyrs25, data = osterman)\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-6.5442 -1.1737  0.1426  1.2992  6.0914 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept) 3.908590   0.022630   172.7   <2e-16 ***\neduyrs25    0.105423   0.001692    62.3   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.853 on 68209 degrees of freedom\n  (585 observations deleted due to missingness)\nMultiple R-squared:  0.05384,   Adjusted R-squared:  0.05383 \nF-statistic:  3882 on 1 and 68209 DF,  p-value: < 2.2e-16\n\n\nThis output tells us a lot more about the fitted model, for example a summary table of the residuals and an analysis of variance (ANOVA) summary of the residuals, as well as estimates of variation for our coefficients (the standard errors and the p-values associated with t-tests - displayed as Pr(>|t|)).\nWhile these are informative, the format is not ideal for further manipulation and presentation. Several user-written functions exist to improve on this output. For example, the {broom} package - part of the {tidymodels} suite of packages - has functions to extract model information into “tidy” tibbles (data sets), which can then be further manipulated and plotted. This is especially useful when working with results from many models that would benefit from comparing in a standardised format.\nThe {sjPlot} package that we used before also has functions to export a publishable-quality table in HTML format:\n\nsjPlot::tab_model(m1)\n\n\n\n\n \ntrustindex 3\n\n\nPredictors\nEstimates\nCI\np\n\n\n(Intercept)\n3.91\n3.86 – 3.95\n<0.001\n\n\neduyrs25\n0.11\n0.10 – 0.11\n<0.001\n\n\nObservations\n68211\n\n\nR2 / R2 adjusted\n0.054 / 0.054\n\n\n\n\nBy default the output table shows 95% confidence intervals (CI) instead of standard errors (SE), which can be easier to interpret (CI are calculated as Estimate +/- (1.96 x Std. Error); you can try it out in the Console, replacing in the numeric values).\nThe best approach is to graph the model results and present them in a figure, but that’s not very informative in the case of a simple model with only one predictor, so we can leave it for later.\n\nRegress ‘trustindex3’ on ‘agea’ and interpret the results\n\nWe can do as above:\n\nm2 <- lm(trustindex3 ~ agea, data = osterman)\n\nsummary(m2)\n\n\nCall:\nlm(formula = trustindex3 ~ agea, data = osterman)\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-5.4185 -1.2439  0.1195  1.4047  4.9126 \n\nCoefficients:\n             Estimate Std. Error t value Pr(>|t|)    \n(Intercept) 4.9369533  0.0310778  158.86   <2e-16 ***\nagea        0.0060192  0.0005931   10.15   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.905 on 68794 degrees of freedom\nMultiple R-squared:  0.001495,  Adjusted R-squared:  0.00148 \nF-statistic:   103 on 1 and 68794 DF,  p-value: < 2.2e-16\n\n\n\nRegress ‘trustindex3’ on ‘female’ and interpret the results\n\n\nm3 <- lm(trustindex3 ~ female, data = osterman)\n\nsummary(m3)\n\n\nCall:\nlm(formula = trustindex3 ~ female, data = osterman)\n\nResiduals:\n   Min     1Q Median     3Q    Max \n-5.247 -1.247  0.094  1.419  4.761 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept) 5.239336   0.010595 494.509   <2e-16 ***\nfemale1     0.008097   0.014560   0.556    0.578    \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.906 on 68794 degrees of freedom\nMultiple R-squared:  4.495e-06, Adjusted R-squared:  -1.004e-05 \nF-statistic: 0.3093 on 1 and 68794 DF,  p-value: 0.5781\n\n\n\nRegress ‘trustindex3’ on all three predictors listed above and interpret the results\n\nFinally we can fit a multiple linear model with several predictors:\n\nm4 <- lm(trustindex3 ~ eduyrs25 + agea + female, data = osterman)\n\nsummary(m4)\n\n\nCall:\nlm(formula = trustindex3 ~ eduyrs25 + agea + female, data = osterman)\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-6.7460 -1.1756  0.1325  1.3123  6.0706 \n\nCoefficients:\n             Estimate Std. Error t value Pr(>|t|)    \n(Intercept) 2.9547287  0.0428943  68.884  < 2e-16 ***\neduyrs25    0.1163995  0.0017345  67.108  < 2e-16 ***\nagea        0.0155578  0.0005939  26.196  < 2e-16 ***\nfemale1     0.0411570  0.0141507   2.908  0.00363 ** \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.844 on 68207 degrees of freedom\n  (585 observations deleted due to missingness)\nMultiple R-squared:  0.06336,   Adjusted R-squared:  0.06332 \nF-statistic:  1538 on 3 and 68207 DF,  p-value: < 2.2e-16\n\n\nOne interesting finding from Model 4 is to notice how radically the statistical significance of the female variable changes compared to Model 3. The impact of gender is still very weak in real terms: compared to men of similar age and education level, women score 0.04 points higher on the trust scale; but this is still a stronger effect than in the simple bivariate model (where \\(b_1\\) was 0.008), and our confidence intervals are much narrower.\nWe could tabulate these two models in a comparative table to better see the contrast:\n\nsjPlot::tab_model(m3, m4)\n\n\n\n\n \ntrustindex 3\ntrustindex 3\n\n\nPredictors\nEstimates\nCI\np\nEstimates\nCI\np\n\n\n(Intercept)\n5.24\n5.22 – 5.26\n<0.001\n2.95\n2.87 – 3.04\n<0.001\n\n\nfemale: female 1\n0.01\n-0.02 – 0.04\n0.578\n0.04\n0.01 – 0.07\n0.004\n\n\neduyrs 25\n\n\n\n0.12\n0.11 – 0.12\n<0.001\n\n\nAge ofrespondent,calculated\n\n\n\n0.02\n0.01 – 0.02\n<0.001\n\n\nObservations\n68796\n68211\n\n\nR2 / R2 adjusted\n0.000 / -0.000\n0.063 / 0.063\n\n\n\n\nIt’s worth noticing that the number of observations used in the two models is not the same, due to missing values in some variables. We could make the samples comparable by selecting out the sample of 68,211 included in Model 4, then refitting Model 3 on that sample only:\n\nsample <- m4$model\n\nm3_new <- lm(trustindex3 ~ female, data = sample)\n\nsjPlot::tab_model(m3_new, m4)\n\n\n\n\n \ntrustindex 3\ntrustindex 3\n\n\nPredictors\nEstimates\nCI\np\nEstimates\nCI\np\n\n\n(Intercept)\n5.24\n5.22 – 5.26\n<0.001\n2.95\n2.87 – 3.04\n<0.001\n\n\nfemale: female 1\n0.01\n-0.02 – 0.04\n0.582\n0.04\n0.01 – 0.07\n0.004\n\n\neduyrs 25\n\n\n\n0.12\n0.11 – 0.12\n<0.001\n\n\nAge ofrespondent,calculated\n\n\n\n0.02\n0.01 – 0.02\n<0.001\n\n\nObservations\n68211\n68211\n\n\nR2 / R2 adjusted\n0.000 / -0.000\n0.063 / 0.063\n\n\n\n\nWe see that this does not affect the overall picture.\n(Advanced) Exercise 3: Apply the model to a new dataset\nThe ostermann data originates from Waves 1-9 of the European Social Survey. The ESS data are accessible freely upon registration. As part of this exercise, access data from Wave 10 of the survey (from this site: https://www.europeansocialsurvey.org/data/) and perform the following tasks:\n\ndownload the dataset to the Rproject folder\nselect the variables required to recreate the data to fit the multiple regression model from the previous exercise\ncreate your version of the ‘trustindex3’ variable\nfit the models from Exercise 2 and compare the results.\n\nYou should already be familiar with the functions needed to complete each of these steps, but it may require some self-study. The most important missing information required to compete this task is to be found in the description on how the trustindex3 scale was computed provided by Osterman:\n\nTo study generalized social trust, I am following the established approach of using a validated three-item scale (Reeskens and Hooghe 2008; Zmerli and Newton 2008). This scale consists of the classic trust question, an item on whether people try to be fair, and an item on whether people are helpful:  - ‘Generally speaking, would you say that most people can be trusted, or that you can’t be too careful in dealing with people?’  - ‘Do you think that most people would try to take advantage of you if they got the chance, or would they try to be fair?’  - ‘Would you say that most of the time people try to be helpful or that they are mostly looking out for themselves?’  All of the items may be answered on a scale from 0 to 10 (where 10 represents the highest level of trust) and the scale is calculated as the mean of the three items. The three-item scale clearly improves measurement reliability and cross-country validity compared to using a single item, such as the classic trust question. Internal consistency for the three items is reasonably high (Cronbach’s alpha: 0.77). The scale ranges between 0 and 10 with a mean of 5.24 for my sample. See the Supplementary material for additional information on the construction of the social trust scale (Section A.1), as well as for models using the classic single-item measure of trust (Section A.9).\n\n\n<< the end >>"
  },
  {
    "objectID": "materials/worksheets/worksheets_w03.html#aims",
    "href": "materials/worksheets/worksheets_w03.html#aims",
    "title": "Week 3 Computer Lab Worksheet",
    "section": "Aims",
    "text": "Aims\nIn this session we continue working with data from Österman (2021) to replicate parts their analysis. In Week 2 we practised building very basic bivariate and multiple linear regression models. In this session, we will expand on those models and attempt to replicate (some of) the models reported in Table 3 of Österman (2021):"
  },
  {
    "objectID": "materials/worksheets/worksheets_w03.html#setup",
    "href": "materials/worksheets/worksheets_w03.html#setup",
    "title": "Week 3 Computer Lab Worksheet",
    "section": "Setup",
    "text": "Setup\nIn Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder. If it’s missing, complete Task 2 from the Week 1 Lab.\nYou can create a new .R script and .qmd/.Rmd for this week’s work (e.g. “Lab_3”). It may be useful to document more of your coding process and interpretations, so you want to work in a .qmd/.Rmd document rather than an .R script."
  },
  {
    "objectID": "materials/worksheets/worksheets_w03.html#import-data",
    "href": "materials/worksheets/worksheets_w03.html#import-data",
    "title": "Week 3 Computer Lab Worksheet",
    "section": "Import data",
    "text": "Import data\nAs a first step, let’s import the osterman dataset that underpins the Österman (2021) article (see the Data page on the course website for information about the datasets available in this course):\n\nosterman <- sjlabelled::read_stata(\"https://cgmoreh.github.io/HSS8005-data/osterman.dta\")"
  },
  {
    "objectID": "materials/worksheets/worksheets_w03.html#basic-interactions",
    "href": "materials/worksheets/worksheets_w03.html#basic-interactions",
    "title": "Week 3 Computer Lab Worksheet",
    "section": "Basic interactions",
    "text": "Basic interactions\nTo continue on from the toy models that we fitted in Week 2, we may remember that we observed a change in the size of the effect and statistical significance of the gender variable (female) when also accounting for years of education and age. The two models compared as so:\n\nm_m <- lm(trustindex3 ~ eduyrs25 + agea + female, data = osterman)\n\nm_b <- lm(trustindex3 ~ female, data = m_m$model)\n\nsjPlot::tab_model(m_b, m_m)\n\n\n\n\n \ntrustindex 3\ntrustindex 3\n\n\nPredictors\nEstimates\nCI\np\nEstimates\nCI\np\n\n\n(Intercept)\n5.24\n5.22 – 5.26\n<0.001\n2.95\n2.87 – 3.04\n<0.001\n\n\nfemale\n0.01\n-0.02 – 0.04\n0.582\n0.04\n0.01 – 0.07\n0.004\n\n\neduyrs 25\n\n\n\n0.12\n0.11 – 0.12\n<0.001\n\n\nAge ofrespondent,calculated\n\n\n\n0.02\n0.01 – 0.02\n<0.001\n\n\nObservations\n68211\n68211\n\n\nR2 / R2 adjusted\n0.000 / -0.000\n0.063 / 0.063\n\n\n\n\nThe question arises whether the gender variable interacts with the other predictors, particularly with education as that is the main focus of the research. In other words, we may have reasons to assume that the linear effect of education may be different for males and females. We can interact predictor variables in regression models using the : and * operators. With :, we include in the regression model only the interaction term, but not the component variables (i.e. their main effects); to include the component variables as well, we need to add them in as usual with the + operator. To note that we should normally want to include both main effects and and the interaction effects in a regression model. With *, we include both the main effects and the interaction effects, so it provides a shortcut to using :, but it is often useful to combine the two in more complex interaction scenarios.\nIn the code below, we interact gender with both education and age, and the two commands have the same result:\n\nmxa <- lm(trustindex3 ~ eduyrs25 + agea +female + female*eduyrs25 + female*agea, data = osterman)\n\nmxb <- lm(trustindex3 ~ female*(eduyrs25 + agea), data = osterman)\n\n## Check that they are equivalent\n\nsjPlot::tab_model(mxa, mxb)\n\n\n\n\n \ntrustindex 3\ntrustindex 3\n\n\nPredictors\nEstimates\nCI\np\nEstimates\nCI\np\n\n\n(Intercept)\n3.25\n3.13 – 3.36\n<0.001\n3.25\n3.13 – 3.36\n<0.001\n\n\neduyrs 25\n0.10\n0.10 – 0.11\n<0.001\n0.10\n0.10 – 0.11\n<0.001\n\n\nAge ofrespondent,calculated\n0.01\n0.01 – 0.02\n<0.001\n0.01\n0.01 – 0.02\n<0.001\n\n\nfemale\n-0.53\n-0.69 – -0.36\n<0.001\n-0.53\n-0.69 – -0.36\n<0.001\n\n\neduyrs25:female\n0.03\n0.02 – 0.03\n<0.001\n\n\n\n\n\nagea:female\n0.00\n0.00 – 0.01\n<0.001\n\n\n\n\n\nfemale:eduyrs25\n\n\n\n0.03\n0.02 – 0.03\n<0.001\n\n\nfemale:agea\n\n\n\n0.00\n0.00 – 0.01\n<0.001\n\n\nObservations\n68211\n68211\n\n\nR2 / R2 adjusted\n0.064 / 0.064\n0.064 / 0.064\n\n\n\n\nWe find that both interaction terms (eduyrs25:female and agea:female) are statistically significant, which can tell us that there is an interaction that may be useful to include in the model. The greatest challenge is in interpreting results from interaction models, as the usual interpretation of the main effect terms no longer stands and the focus should be on the interaction terms instead. The best approach is to visualise the interactions using a graph, and there are various user-written functions that make this task easy. Just to keep it simple, we will use the plot_model() function from the {sjPlot} package.\nFirst, we can check the interaction between education and gender:\n\nsjPlot::plot_model(mxa, type = \"pred\", terms = c(\"eduyrs25\", \"female\"))\n\n\n\n\nThis visualization in now easier to interpret than the numerical output. We see a steeper line for females than for males and the lines cross each other at around 12 years of completed schooling. Years spent in education therefore have a stronger influence on social trust for females than for males after around 12 years (which is generally the length of compulsory education in most European countries from which the data is drawn). Among those with education below this level, we find that males have higher levels of social trust, but longer educational careers have a stronger impact for women.\nWe can check the same for the interaction of gender and age:\n\nsjPlot::plot_model(mxa, type = \"pred\", terms = c(\"agea\", \"female\"))\n\n\n\n\nThe general pattern is the same, however, the overlapping confidence intervals up until about the age of 53 tell us that the observed difference between the two genders may be due to random variation in our sample. The effect of this interaction is weaker and less reliable than the one with education, and unless we have a strong theoretically motivated reason to include it in our model, we may want to keep a simpler model without this interaction effect."
  },
  {
    "objectID": "materials/worksheets/worksheets_w03.html#replicating-model-1",
    "href": "materials/worksheets/worksheets_w03.html#replicating-model-1",
    "title": "Week 3 Computer Lab Worksheet",
    "section": "Replicating Model 1",
    "text": "Replicating Model 1\nWith this basic understanding of interaction terms, we can now attempt to replicate Model 1 from the Osterman article. Based on the description provided in the article and the appended materials we can reconstruct the main components of their model as shown below. As a first step, run the command and compare our results to those presented in Model 1. Bear in mind that the effects most coefficients are not presented in the main table (see also the table’s footnotes and Appendix 4 in the Supplementary Materials). We also did not include any survey weights in our model and did not include any error corrections, which the author does; so the results will not be exactly the same.\nIn the footnotes to Table 3, the author tells us that in addition to what is reported in the table,\n\nAll models include reform FEs [Fixed Effects], a general quadratic birth year trend, and reform-specific trends for birth year (linear) and age (quadratic). Additional controls: foreign-born parents and ESS round dummies.\n\nThe version of the model summary presented in Table A.3 in the Appendix also includes the additional controls, but not the other terms.\nIn the main text, the author further explains some of the reasoning behind their modelling choices:\n\nOne dilemma for the design is that there has been a trend of increasing educational attainment throughout the studied time period, which means that the reform-windows of treated and non-treated cohorts will also pick up the effects of this trend. To counter this, [the list of covariates] includes a general quadratic birth year trend, reform-specific linear controls for birth year and reform-specific quadratic age trends. The quadratic terms are included to allow sufficient flexibility in absorbing possible non-linear trends of increasing education within the reform-window of seven treated and seven untreated birth year cohorts. … The reform-fixed effects are also essential because they imply that only the within-reform-window variation is used to estimate the effects and between-reform differences are factored out, such as pre-reform differences in social trust. (Österman 2021, 221–22)\n\nBefore we fit the model, some of the concepts in the quotation need unpacking. A quadratic term is a second-degree polynomial term - put simply, it’s the square of the variable concerned. The quadratic of a variable such as \\(age\\) is therefore \\(age^2\\), or \\(age \\times age\\). In other words, it is like a variable’s “interaction” with itself. Because of this, there are several ways in which the quadratic terms to be included in a model can be specified:\n\nWe could create the quadratic terms as new variables, and include those in the model, as below:\n\n\n# First, load the original dataset\nosterman <- sjlabelled::read_stata(\"https://cgmoreh.github.io/HSS8005/data/osterman.dta\")\n\n# Create new quadratic variables by multiplication with themselves and add them to the dataset, saving it as a new data object\nosterman2 <- osterman |> \n  mutate(agea_quad = agea*agea,        # quadratic age variable\n         yrbrn_quad = yrbrn*yrbrn)     # quadratic birth-year variable\n\n# We now have two additional variables in the dataset; we can fit the model using those:\nm1_prequad <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d +                         # main covariates reported in Table 3\n                   fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +     # additional controls for foreign-born parents and ESS Round dummies\n                   agea + yrbrn + agea_quad + yrbrn_quad +                              # general quadratic birth year trend and quadratic age\n                   factor(reform_id_num) +                                              # reform fixed effects dummies\n                   yrbrn:factor(reform_id_num) +                                        # reform-specific birth year trend\n                   agea:factor(reform_id_num) +  agea_quad:factor(reform_id_num),       # reform-specific quadratic age trend\n               data = osterman2)                                                        # the new expanded dataset\n\n\nWe can get the same results by creating the quadratic terms directly as part of the modelling function. The one thing we should keep in mind is that if we want to include standard mathematical operations within a formula function, we need to isolate or insulate the operation from R’s formula parsing code using the I() function, which returns the contents of I(...) “as.is”. The model formula would then be:\n\n\nm1_funquad <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d +                         # main covariates reported in Table 3\n                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +       # additional controls for foreign-born parents and ESS Round dummies\n                 agea + yrbrn + I(agea^2) + I(yrbrn^2) +                                # general quadratic birth year trend and quadratic age\n                 factor(reform_id_num) +                                                # reform fixed effects dummies\n                 yrbrn:factor(reform_id_num) +                                          # reform-specific birth year trend\n                 agea:factor(reform_id_num) +  I(agea^2):factor(reform_id_num),         # reform-specific quadratic age trend\n              data = osterman)                                                          # the original dataset\n\n\nIn the two previous options, the quadratic terms will be correlated with the original variables. To avoid this by relying on so-called orthogonal polynomials we should use the poly() function. We can also fit the same correlated polynomial model as the ones above by adding the raw = TRUE option to the poly() function. In the code below, we will fit the correlated version first, then the orthogonal version (This stackoverflow discussion explains in more detail the difference between the two options):\n\n\nm1_polyraw <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d + \n                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + \n                 poly(agea, 2, raw = TRUE) + poly(yrbrn, 2, raw = TRUE) +\n                 factor(reform_id_num) +           \n                 yrbrn:factor(reform_id_num) + \n                 agea:factor(reform_id_num) + poly(agea, 2, raw = TRUE):factor(reform_id_num),\n               data = osterman)\n\nm1_polyorth <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d + \n                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + \n                 poly(agea, 2) + poly(yrbrn, 2) +\n                 factor(reform_id_num) +           \n                 yrbrn:factor(reform_id_num) + \n                 agea:factor(reform_id_num) + poly(agea, 2):factor(reform_id_num),\n               data = osterman)\n\nIt’s worth noting, however, that the Stata routine used by the author fitted correlated/raw coded polynomials, so the orthogonal version below is just for a comparison and we will not use it going forward. For a side-by-side overview comparison of the models we fitted so far, we could use a model summary tabulation function (e.g. sjPlot::tab_model(), which we have used before, or modelsummary::modelsummary(), stargazer::stargazer(), or jtools::export_summs(); you can read through the function documentation and test them out). Below we will use the modelsummary() function from modelsummary. The function takes a list() object as argument, and the list can include several models to be tabulated side-by-side. We can either build the list() object first and pass the object to modelsummary(), or we can build the list directly as part of the call to modelsummary(). We can also easily include more descriptive labels for the models that we want to tabulate within the lint() function, as we will do below:\n\n# First, install the package:\ninstall.packages(\"modelsummary\")\n\n# Second, we make a list of the models we want to summarise; we can even name them:\nmodels <- list(\n  \"Pre-calculated quadratic\" = m1_prequad,\n  \"Within-function quadratic\" = m1_funquad,\n  \"poly() with raw coding\" = m1_polyraw,\n  \"poly() with default orthogonal coding\" = m1_polyorth\n)\n\n# modelsummary table with stars for p-values added, as in the published article\nmodelsummary::modelsummary(models, stars = TRUE)\n\nThe results from the modelsummary() are not shown here because it’s a long and ugly table, but it’s useful for perusing to compare the results across the different models. For a cleaner table showing only the results included in Table A.3 in the Appendix to Österman (2021), we can use the coef_map or the coef_omit option in modelsummary() and only include m1_funquad:\n\n# It's cleaner to first make a vector of the coefficients we wish to include; we can name the coefficients as they appear in Table A.3; note that we also leave out the Intercept, as in the published table:\ncoefs <- c(\"reform1_7\" = \"Reform\",\n           \"female\" = \"Female\",\n           \"blgetmg_d\" = \"Ethnic minority\",\n           \"fbrneur\" = \"Foreign-born father, Europe\",\n           \"mbrneur\" = \"Foreign-born mother, Europe\",\n           \"fnotbrneur\" = \"Foreign-born father, outside Europe\",\n           \"mnotbrneur\" = \"Foreign-born mother, outside Europe\",\n           \"factor(essround)2\" = \"ESS Round 2\",\n           \"factor(essround)3\" = \"ESS Round 3\",\n           \"factor(essround)4\" = \"ESS Round 4\",\n           \"factor(essround)5\" = \"ESS Round 5\",\n           \"factor(essround)6\" = \"ESS Round 6\",\n           \"factor(essround)7\" = \"ESS Round 7\",\n           \"factor(essround)8\" = \"ESS Round 8\",\n           \"factor(essround)9\" = \"ESS Round 9\")\n\n# Then we pass the vector to coef_map to select the coefficients to print\nmodelsummary::modelsummary(list(\"M1\" = m1_funquad), stars = TRUE, coef_map = coefs)\n\n\n\n\n   \n    M1 \n  \n\n\n Reform \n    0.063* \n  \n\n  \n    (0.027) \n  \n\n Female \n    0.058*** \n  \n\n  \n    (0.013) \n  \n\n Ethnic minority \n    −0.241*** \n  \n\n  \n    (0.054) \n  \n\n Foreign-born father, Europe \n    −0.111** \n  \n\n  \n    (0.042) \n  \n\n Foreign-born mother, Europe \n    −0.108* \n  \n\n  \n    (0.044) \n  \n\n Foreign-born father, outside Europe \n    −0.065 \n  \n\n  \n    (0.073) \n  \n\n Foreign-born mother, outside Europe \n    −0.110 \n  \n\n  \n    (0.078) \n  \n\n ESS Round 2 \n    0.059 \n  \n\n  \n    (0.045) \n  \n\n ESS Round 3 \n    0.162* \n  \n\n  \n    (0.075) \n  \n\n ESS Round 4 \n    0.243* \n  \n\n  \n    (0.108) \n  \n\n ESS Round 5 \n    0.360* \n  \n\n  \n    (0.144) \n  \n\n ESS Round 6 \n    0.397* \n  \n\n  \n    (0.179) \n  \n\n ESS Round 7 \n    0.449* \n  \n\n  \n    (0.212) \n  \n\n ESS Round 8 \n    0.655** \n  \n\n  \n    (0.246) \n  \n\n ESS Round 9 \n    0.816** \n  \n\n  \n    (0.283) \n  \n\n Num.Obs. \n    68796 \n  \n\n R2 \n    0.200 \n  \n\n R2 Adj. \n    0.198 \n  \n\n AIC \n    268913.8 \n  \n\n BIC \n    270056.1 \n  \n\n Log.Lik. \n    −134331.877 \n  \n\n RMSE \n    1.71 \n  \n\n\n + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001\n\n\n\nWe can compare our table to the published one to confirm that the results are very close, but that there are some inconsistencies that we will address later in Worksheet 6."
  },
  {
    "objectID": "materials/worksheets/worksheets_w03.html#exercise-1",
    "href": "materials/worksheets/worksheets_w03.html#exercise-1",
    "title": "Week 3 Computer Lab Worksheet",
    "section": "Exercise 1",
    "text": "Exercise 1\nThis model provides a lot to play around with. For this exercise, start from this complex model and start simplifying it in various ways. Try excluding some of the interaction terms and check the change in the results. Plot the interaction terms and attempt an interpretation. Note down your thoughts and models. Playing around with this model is a good way to learn about the effect of interactions."
  },
  {
    "objectID": "materials/worksheets/worksheets_w03.html#exercise-2",
    "href": "materials/worksheets/worksheets_w03.html#exercise-2",
    "title": "Week 3 Computer Lab Worksheet",
    "section": "Exercise 2",
    "text": "Exercise 2\nRead through the Osterman article and check your understanding of how the author interprets these results. Particularly, how does the interaction model help us elucidate causal factors in the regression model?"
  },
  {
    "objectID": "materials/worksheets/worksheets_w04.html",
    "href": "materials/worksheets/worksheets_w04.html",
    "title": "Week 4 Computer Lab Worksheet",
    "section": "",
    "text": "In this lab you will practice fitting various types of generalised linear models. These models generalise linear regression to situations where the outcome (dependent) variable is not drawn from a normal (Gaussian) distribution. We will look at examples of logit/probit and Poisson models. Continuing the idea of causal estimation from week 3, we start with data underpinning the article by Ladd and Lenz (2009) and will attempt to replicate a simpler version of their probit model (see their Table 1A). We then fit a Poisson model with data from Weiss et al. (2021), attempting to replicate their Table S6.\nBy the end of the session, you will:\n\nlearn how to fit logit, probit and Poisson models\ngain experience summarising, visualising and interpreting results from these models\npractice further data manipulation techniques"
  },
  {
    "objectID": "materials/worksheets/worksheets_w04.html#setup",
    "href": "materials/worksheets/worksheets_w04.html#setup",
    "title": "Week 4 Computer Lab Worksheet",
    "section": "Setup",
    "text": "Setup\nIn Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder. If it’s missing, complete Task 2 from the Week 1 Lab.\nYou can create a new .R script and .qmd/.Rmd for this week’s work (e.g. “Lab_4”).\nWe can also load the packages that we will be using in this lab. As the number of libraries needed for completing more advanced analyses increases, it’s worth using some method that avoids repetition, such as the p_load() function from pacman:\n\n## First, install {pacman} if not yet installed:\nif (!require(\"pacman\")) install.packages(\"pacman\")\n\n## Then use {pacman} to attach packages and if they are not yet installed, install them:\npacman::p_load(tidyverse, sjPlot, sjlabelled, sjmisc,\n               modelsummary, summarytools, gtsummary,\n               marginaleffects)\n\n\n\n\n\n\n\nIf install.packages() throws up an error in Rmd/qmd:\n\n\n\n\n\nIf you are working in Rmarkdown/Quarto and you get an error message that includes the description\n...Error in contrib.url(repos, \"source\") :\n  trying to use CRAN without setting a mirror...\nyou can specify the CRAN mirror to be used by adding\n\noptions(repos = list(CRAN = \"http://cran.rstudio.com/\"))\n\nbefore the call to install.packages()"
  },
  {
    "objectID": "materials/worksheets/worksheets_w04.html#exercise-1-estimating-the-persuasive-power-of-the-news-media",
    "href": "materials/worksheets/worksheets_w04.html#exercise-1-estimating-the-persuasive-power-of-the-news-media",
    "title": "Week 4 Computer Lab Worksheet",
    "section": "Exercise 1: Estimating the persuasive power of the news media",
    "text": "Exercise 1: Estimating the persuasive power of the news media\nIn this exercise we will focus on data from an article by Ladd and Lenz (2009) (follow the doi link in the citation to access the article; you can also download it here). In this article the authors aim to estimate whether (print) media has the power to persuade people to vote differently. In general terms, this is a very interesting question, even though the power of print media has definitely weakened over the past few decades and measuring the effect of alternative media sources would present additional challenges. Nevertheless, the authors attempt to take advantage of a unique natural experiment that arguably facilitates tackling this causal question: between the 1992 and 1997 UK general elections, four newspapers (the Sun, Daily Star, Independent, and Financial Times) changed their editorial stance and tone from supporting the Conservative Party to supporting Tony Blair’s Labour Party. Such radical shifts editorial stance are extremely rare. The data they use come from several waves of the British Election Panel Study from between 1992 and 1997 and include variables on voting behaviour in 1992 and 1997, as well as whether the respondent was a reader of one of the “switching” newspapers. These are the main variables that are useful for tackling the causal effect involved in the research question, but there are several other variable that provide various controls.\nData exploration\nWe can load a version of the dataset from the course website. The dataset is in Stata format (.dta), so we’ll use one of the import functions we’ve used before:\n\n# Let's call the data object \"news\" for simplicity\n\nnews <- sjlabelled::read_stata(\"https://cgmoreh.github.io/HSS8005-data/LaddLenz.dta\")\n\n\n\n\nBecause it’s a Stata dataset, the variables are likely to have useful labels that can provide more information about them. We have already used the sjPlot::view_df() function, which is very useful for this:\n\nview_df(news)\n\nHave a look at the table to get a sense for the data. Table 1S in the Appendix to the article offers more details on the coding and meaning of the variables, which will be very useful later. You can download the Appendix here.\n\nQuestions\n\nWhat does the hhincq92 (“Prior Income”) variable measure, and how are its levels coded?\nWhat might the values coded as “9” in some of the variables mean?\n\n\n\nTo better understand the dataset and check whether any data management tasks are needed before the statistical analysis, let’s learn a few new functions that can come handy. There are various packages that include functions to summarise data, and we will look at three further options: modelsummary, gtsummary and summarytools.\nsummarytools::dfSummary()\nThis is a nice quick function that produces more complex “codeplans” than sjPlot::view_df() for labelled data in a similar HTML format. The easiest way to use it is to combine it with summarytools::view() as part of the same pipeline to get a nicely formatted HTML table in the Viewer pane:\n\nsummarytools::dfSummary(news) |> summarytools::view()\n\ngtsummary::tbl_summary()\nThis is another option that works well with labelled data and produces summaries of both numeric continuous and categorical variables in a publication-ready table style, but it can be very useful for quick descriptive checks. It’s a little bit slow:\n\ngtsummary::tbl_summary(news)\n\nmodelsummary::datasummary_skim()\nThis function doesn’t show variable labels and by default only summarises “numeric” variables; categorical variables can be tabulated separately by adding the argument type = \"categorical\". It works on our dataset because all the variables are currently recognised as numeric, but it’s not the most efficient for labelled data. We’ll look at it because the {modelsummary} package contains a number of other extremely useful functions that produce flexible tables, which we will use later:\n\nmodelsummary::datasummary_skim(news)\n\n\n\nQuestions\n\nWhat else have we learnt about the data?\nAre there any data transformation tasks necessary before we use the data for statistical modelling?\n\n\nData wrangling\nIt really appears that the values coded as 9 across several variables are in fact “missing” values. Instead of excluding them from the statistical analysis by setting them as NA, we will keep them, just as the authors of the original study have done. But we’ll relabel the categorical (factor) variables before further analysis.\nThere are some other strange inconsistencies that we should take care of. Let’s take them step-by step in the code-chunk below using some convenience functions from sjlabelled and sjmisc (note that everything could be done with some extra coding using core tidyverse packages):\n\nnews <- news |> \n  \n## wkclass should be an indicator dummy, but kept as numeric;\n## `dicho` function by default creates a new variable with \"_d\" as suffix in the name;\n## setting `suffix = \"\"` overwrites the original variable\n  sjmisc::dicho(wkclass, dich.by = 0.5, as.num = TRUE, suffix = \"\") |> \n  \n## categories (levels) can be labelled\n  sjlabelled::set_labels(know_3, labels = c(\"low\", \n                                           \"medium\", \n                                           \"high\")) |> \n  sjlabelled::set_labels(copemg92, labels = c(\"Mortgage Very Difficult\" = 1, \n                                              \"Mortgage a Bit Difficult\" = 0.5, \n                                              \"Not Really Difficult or No Mortgage\" = 0,\n                                              \"Not answered\" = 9)) |> \n  sjlabelled::set_labels(hedqul92, labels = c(\"Less than O level (or foreign qual)\" = 0,\n                                              \"O Level or Equivalent\" = .25,\n                                              \"A Level or Equivalent\" = .5,\n                                              \"Some Higher Education\" = .75,\n                                              \"College Degree\" = 1)) |> \n  sjlabelled::set_labels(hhincq92, labels = c(\"5999 or Less\", \n                                              \"6000-11,999\",\n                                              \"12,000-19,999\",\n                                              \"20,000 or More\",\n                                              \"Not answered\")) |> \n  sjlabelled::set_labels(ragect92, labels = c(\"18-24\", \n                                              \"25-34\",\n                                              \"35-44\", \n                                              \"45-54\", \n                                              \"55-59\",\n                                              \"60-64\", \n                                              \"65+\", \n                                              \"Not Given\")) |> \n\n## some variables should be categorical (factor)\n  \n  ### first, change to character type to overwrite the fractional numeric codes with the labels\n  sjlabelled::as_character(know_3, hedqul92, hhincq92, ragect92) |> \n  \n  ### then, change to factors; labels will show, levels are renumbered starting from 1\n  sjlabelled::as_factor(know_3, hedqul92, hhincq92, ragect92) |> \n\n##### instead of the two commands above, changing to `sjlabelled::as_label()` has the same effect\n##### note also that we did not set \"copemg92\" as categorical, just because the authors will treat is as numeric in their model, even if it doesn't make much sense, especially because of the \"Not answered\" category\n\n## Finally, some changes will make it easier to read the model as fit by the authors:\n\n  ### will change the label associated with the `tolabor` variable (the \"treatment\" variable)\n  sjlabelled::var_labels(tolabor = \"Treatment (read switching paper)\")\n\nWe can check some frequency tabulations to see the outcome:\n\n# Five frequency tables\nnews |> sjmisc::frq(know_3, copemg92, hedqul92, hhincq92, ragect92)\n\nOr we can check a whole summary table of the data frame again, which now looks like this:\n\ngtsummary::tbl_summary(news)\n\n\n\n\n\n\n\nThis is what your table should look like:\n\n\n\n\n\n\n\n\n\n\n Characteristic \n    N = 1,593 \n  \n\n\n outcome: voted labor in 1997 \n    718 (45%) \n  \n\n Treatment (read switching paper) \n    211 (13%) \n  \n\n Prior Conservative Identification \n    666 (42%) \n  \n\n Prior Labour Identification \n    506 (32%) \n  \n\n Prior Liberal Identification \n    241 (15%) \n  \n\n White \n    1,557 (98%) \n  \n\n Working-Class \n    955 (60%) \n  \n\n Parents Voted Labour \n    581 (36%) \n  \n\n Prior Ideological Moderation \n    0.69 (0.49, 0.83) \n  \n\n Prior Labour Vote \n    528 (33%) \n  \n\n Prior Conservative Vote \n    640 (40%) \n  \n\n Prior Liberal Vote \n    293 (18%) \n  \n\n Prior Labour Party Support \n    0.50 (0.25, 0.75) \n  \n\n Prior Conservative Party Support \n    0.50 (0.25, 0.75) \n  \n\n Prior Political Knowledge \n     \n  \n\n high \n    744 (47%) \n  \n\n low \n    252 (16%) \n  \n\n medium \n    597 (37%) \n  \n\n Prior Television Viewer \n    446 (28%) \n  \n\n Prior Daily Newspaper Reader \n    1,111 (70%) \n  \n\n Prior Ideology \n    0.55 (0.41, 0.68) \n  \n\n Authoritarianism \n    0.56 (0.48, 0.64) \n  \n\n Prior Trade Union Member \n    378 (24%) \n  \n\n Prior Coping Mortgage \n     \n  \n\n 0 \n    272 (17%) \n  \n\n 0.5 \n    511 (32%) \n  \n\n 1 \n    807 (51%) \n  \n\n 9 \n    3 (0.2%) \n  \n\n Prior Education \n     \n  \n\n A Level or Equivalent \n    187 (12%) \n  \n\n College Degree \n    636 (40%) \n  \n\n Less than O level (or foreign qual) \n    168 (11%) \n  \n\n O Level or Equivalent \n    271 (17%) \n  \n\n Some Higher Education \n    331 (21%) \n  \n\n Prior Income \n     \n  \n\n 12,000-19,999 \n    377 (24%) \n  \n\n 20,000 or More \n    528 (33%) \n  \n\n 5999 or Less \n    245 (15%) \n  \n\n 6000-11,999 \n    313 (20%) \n  \n\n Not answered \n    130 (8.2%) \n  \n\n Prior Age \n     \n  \n\n 18-24 \n    103 (6.5%) \n  \n\n 25-34 \n    289 (18%) \n  \n\n 35-44 \n    347 (22%) \n  \n\n 45-54 \n    344 (22%) \n  \n\n 55-59 \n    128 (8.0%) \n  \n\n 60-64 \n    118 (7.4%) \n  \n\n 65+ \n    243 (15%) \n  \n\n Not Given \n    21 (1.3%) \n  \n\n Male \n    864 (54%) \n  \n\n North West \n    143 (9.0%) \n  \n\n Yorks \n    120 (7.5%) \n  \n\n West Midlands \n    129 (8.1%) \n  \n\n East Midlands \n    109 (6.8%) \n  \n\n East Anglia \n    49 (3.1%) \n  \n\n SW England \n    126 (7.9%) \n  \n\n SE England \n    272 (17%) \n  \n\n Greater London \n    115 (7.2%) \n  \n\n Wales \n    59 (3.7%) \n  \n\n Scotland \n    396 (25%) \n  \n\n Profession: Large Employer \n    242 (15%) \n  \n\n Profession: Small Employer \n    73 (4.6%) \n  \n\n Profession: Self Employed \n    628 (39%) \n  \n\n Profession: Employee \n    64 (4.0%) \n  \n\n Profession: Temporary Worker \n    480 (30%) \n  \n\n Profession: Junior \n    77 (4.8%) \n  \n\n\n1 n (%); Median (IQR)\n\n\n\n\n\n\nComparing proportions\nThe aim of Ladd and Lenz (2009) is to estimate the effect of “reading a switching newspaper” on respondents’ voting behaviour change between the 1992 and 1997 elections. In terms of our variables in the dataset, they aim to estimate vote_l_97 (Voted Labour in 1997) from vote_l_92 (Prior Labour vote in 1992) as moderated by tolabor (treatment: indicator of whether reading a switching newspaper). All three variables are binary indicator variables.\nWe have what is called panel data, consisting of measurements on the same individuals at two different time points (1992 and 1997), which allows us to think in causal terms. But the question could be first broken down into two smaller questions exploring:\n\nthe average treatment effect of tolabour in a cross-sectional design (oblivious of prior vote)\na before/after comparison of the treated group, comparing their average vote in 1997 to their average vote in 1992\na differences-in-differences comparison of the average changes over time in the treatment group and average changes over time in the control group.\n\nOverall mean vote\nWhat is the overall proportion of those voting Labour in the 1997 elections in the sample?\n\nmean(news$vote_l_97)\n\n[1] 0.4507219\n\n\nAs with all proportions, this can be read as a percentage if multiplied by 100:\n\nmean(news$vote_l_97) * 100\n\n[1] 45.07219\n\n\nConditional mean vote\nWhat about the proportion of Labour voters among those who read/not read papers that shifted their editorial support?\n\n# We can first break down the dataset into two:\nreader <- news |>  filter(tolabor == 1) \nnot_reader <- news |>  filter(tolabor == 0) \n\n# Then calculate means within each:\nmean_reader_97 <- mean(reader$vote_l_97)\nmean_not_reader_97 <- mean(not_reader$vote_l_97)\n\n# With the result:\nmean_reader_97\nmean_not_reader_97\n\n[1] 0.5829384\n[1] 0.4305355\n\n\nAverage Treatment Effect\nThe average treatment effect would be the difference between the two groups:\n\nATE <- mean_reader_97 - mean_not_reader_97\n\nATE\n\n[1] 0.1524029\n\n\nBefore/After\n\nmean_reader_92 <- mean(reader$vote_l_92)\nmean_not_reader_92 <- mean(not_reader$vote_l_92)\n\nBA_reader <- mean_reader_97 - mean_reader_92\nBA_reader\n\nBA_not_reader <- mean_not_reader_97 - mean_not_reader_92\nBA_not_reader\n\n[1] 0.1943128\n[1] 0.1078148\n\n\nDifferences-in-Differences\n\nDD <- BA_reader - BA_not_reader\n\nDD\n\n[1] 0.08649803\n\n\n\nQuestions\n\nWhat have we learnt from these various comparisons of proportions?\n\n\nLinear probability model\nThe comparison of proportions that we calculated earlier rely on comparing average changes, so we could obtain the results using ordinary least squares linear regression. For example, we can check what the overall mean vote for Labour in 1997 was in the sample, and the average treatment effect we calculated earlier:\n\n## just the mean of Labour vote in 1997\nmean_l_97 <- lm(vote_l_97 ~ 1, data = news)\ncoefficients(mean_l_97)\n\n(Intercept) \n  0.4507219 \n\n## ATE\nATE_reg <- lm(vote_l_97 ~ tolabor, data = news)\ncoefficients(ATE_reg)\n\n(Intercept)     tolabor \n  0.4305355   0.1524029 \n\n\nThis ATE_reg model is a first step towards fitting the model presented by the authors in Table 1A. They report there on results from a probit regression, but we can start by building a linear regression model that we are familiar with (i.e. we can fit a “linear probability model”):\n\nm0_lpm <- lm(vote_l_97 ~ tolabor + vote_l_92, data = news)\nsummary(m0_lpm)\n\n\nCall:\nlm(formula = vote_l_97 ~ tolabor + vote_l_92, data = news)\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-0.9981 -0.2114 -0.2114  0.1095  0.7886 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept)  0.21137    0.01208  17.495  < 2e-16 ***\ntolabor      0.10765    0.02800   3.844 0.000126 ***\nvote_l_92    0.67910    0.02016  33.678  < 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3784 on 1590 degrees of freedom\nMultiple R-squared:  0.4226,    Adjusted R-squared:  0.4219 \nF-statistic: 581.9 on 2 and 1590 DF,  p-value: < 2.2e-16\n\n\n\nQuestions\nWhat does this simple model tell us?\n\nWhat about if we also include the interaction effect between the two predictors?\n\nm0_lpm_int <- lm(vote_l_97 ~ tolabor * vote_l_92, data = news)\nsummary(m0_lpm_int)\n\n\nCall:\nlm(formula = vote_l_97 ~ tolabor * vote_l_92, data = news)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.92683 -0.20513 -0.20513  0.09641  0.79487 \n\nCoefficients:\n                  Estimate Std. Error t value Pr(>|t|)    \n(Intercept)        0.20513    0.01235  16.607  < 2e-16 ***\ntolabor            0.15921    0.03549   4.486 7.77e-06 ***\nvote_l_92          0.69846    0.02174  32.124  < 2e-16 ***\ntolabor:vote_l_92 -0.13597    0.05763  -2.359   0.0184 *  \n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3779 on 1589 degrees of freedom\nMultiple R-squared:  0.4247,    Adjusted R-squared:  0.4236 \nF-statistic: 390.9 on 3 and 1589 DF,  p-value: < 2.2e-16\n\n\n\nQuestion: How do we interpret the coefficients on this simple interaction model?\nIf the interpretation of the interaction model proves difficult (as it should), maybe it’s better to visualise the model by plotting the coefficients. We can do that using the sjPlot::plot_model() function:\n\nsjPlot::plot_model(m0_lpm_int, type = \"pred\", terms = c(\"tolabor\", \"vote_l_92\"))\n\n\n\n\nTo make the meaning of the plot even more straightforward for our interpretative purposes, we can make some changes to the y-axis to display vertical lines at the probability-levels that we got from the model coefficients, and label them with the name of the regression terms. Check the code below against the information from the model summary:\n\nsjPlot::plot_model(m0_lpm_int, type = \"pred\", terms = c(\"tolabor\", \"vote_l_92\")) + \n  scale_y_continuous(breaks = c(0.20513, \n                                0.20513 + 0.15921, \n                                0.20513 + 0.69846, \n                                0.20513 + 0.15921 + 0.69846-0.13597),\n                     labels = c(\"Intercept\",\n                                \"tolabor\",\n                                \"vote_l_92\",\n                                \"tolabor:vote_l_92\"))\n\n\n\n\n\nThe model we are interested in reproducing is that from the first column of Table 1A, and that model does not include any interaction terms, only main effects. We will leave out the interaction, bu instead let’s include all the predictors (independent variables) that were included in the original article. Since all the variables in the dataset are included in that model, we can take a shortcut from having to type out all the additive variable names by using the . identifier as a shorthand for “all the variables in the dataset not already mentioned in the regression formula”. This could then be combined with “-” to exclude some variables that we do not want included. In our case, we want to regress the 1997 vote for Labour on all the other variables in the dataset:\n\nm1_lpm <- lm(vote_l_97 ~ ., data = news)\n\nWe will be comparing the results from this model with those from other models, so instead of checking the model with the summary() function, let’s use the modelsummary() function from the modelsummary package. The function takes a list() object as argument, and the list can include several models to be tabulated side-by-side. For example, let’s include in the summary table three of the models that we built so far: mean_l_97, m0_lpm and m1_lpm. We can either build the list() object first and pass the object to modelsummary(), or we can build the list as part of the call to the latter:\n\nmodelsummary::modelsummary(list(mean_l_97, m0_lpm, m1_lpm))\n\n\n\n\n\n\n\nThis is what the modelsummary table should look like:\n\n\n\n\n\n\n\n\n\n\n   \n     (1) \n      (2) \n      (3) \n  \n\n\n (Intercept) \n    0.451 \n    0.211 \n    0.485 \n  \n\n  \n    (0.012) \n    (0.012) \n    (0.163) \n  \n\n tolabor \n     \n    0.108 \n    0.105 \n  \n\n  \n     \n    (0.028) \n    (0.028) \n  \n\n vote_l_92 \n     \n    0.679 \n    0.280 \n  \n\n  \n     \n    (0.020) \n    (0.046) \n  \n\n conservative \n     \n     \n    −0.003 \n  \n\n  \n     \n     \n    (0.043) \n  \n\n labor \n     \n     \n    0.156 \n  \n\n  \n     \n     \n    (0.042) \n  \n\n liberal \n     \n     \n    0.083 \n  \n\n  \n     \n     \n    (0.044) \n  \n\n white \n     \n     \n    −0.193 \n  \n\n  \n     \n     \n    (0.063) \n  \n\n wkclass \n     \n     \n    0.041 \n  \n\n  \n     \n     \n    (0.022) \n  \n\n parent_labor \n     \n     \n    0.032 \n  \n\n  \n     \n     \n    (0.021) \n  \n\n f_ideo92 \n     \n     \n    0.010 \n  \n\n  \n     \n     \n    (0.080) \n  \n\n vote_c_92 \n     \n     \n    −0.079 \n  \n\n  \n     \n     \n    (0.048) \n  \n\n vote_lib_92 \n     \n     \n    −0.076 \n  \n\n  \n     \n     \n    (0.047) \n  \n\n labfel92 \n     \n     \n    0.224 \n  \n\n  \n     \n     \n    (0.054) \n  \n\n confel92 \n     \n     \n    −0.059 \n  \n\n  \n     \n     \n    (0.052) \n  \n\n know_3low \n     \n     \n    0.085 \n  \n\n  \n     \n     \n    (0.030) \n  \n\n know_3medium \n     \n     \n    0.026 \n  \n\n  \n     \n     \n    (0.022) \n  \n\n TVnewseither \n     \n     \n    0.024 \n  \n\n  \n     \n     \n    (0.021) \n  \n\n read_paper \n     \n     \n    −0.030 \n  \n\n  \n     \n     \n    (0.021) \n  \n\n ideo92 \n     \n     \n    0.134 \n  \n\n  \n     \n     \n    (0.105) \n  \n\n auth92 \n     \n     \n    −0.002 \n  \n\n  \n     \n     \n    (0.077) \n  \n\n tusa92 \n     \n     \n    0.007 \n  \n\n  \n     \n     \n    (0.023) \n  \n\n copemg92 \n     \n     \n    −0.051 \n  \n\n  \n     \n     \n    (0.019) \n  \n\n hedqul92College Degree \n     \n     \n    −0.006 \n  \n\n  \n     \n     \n    (0.034) \n  \n\n hedqul92Less than O level (or foreign qual) \n     \n     \n    −0.028 \n  \n\n  \n     \n     \n    (0.042) \n  \n\n hedqul92O Level or Equivalent \n     \n     \n    −0.003 \n  \n\n  \n     \n     \n    (0.035) \n  \n\n hedqul92Some Higher Education \n     \n     \n    −0.035 \n  \n\n  \n     \n     \n    (0.034) \n  \n\n hhincq9220,000 or More \n     \n     \n    −0.001 \n  \n\n  \n     \n     \n    (0.026) \n  \n\n hhincq925999 or Less \n     \n     \n    0.001 \n  \n\n  \n     \n     \n    (0.033) \n  \n\n hhincq926000-11,999 \n     \n     \n    0.065 \n  \n\n  \n     \n     \n    (0.029) \n  \n\n hhincq92Not answered \n     \n     \n    0.003 \n  \n\n  \n     \n     \n    (0.039) \n  \n\n ragect9225-34 \n     \n     \n    −0.103 \n  \n\n  \n     \n     \n    (0.045) \n  \n\n ragect9235-44 \n     \n     \n    −0.068 \n  \n\n  \n     \n     \n    (0.045) \n  \n\n ragect9245-54 \n     \n     \n    −0.102 \n  \n\n  \n     \n     \n    (0.046) \n  \n\n ragect9255-59 \n     \n     \n    −0.155 \n  \n\n  \n     \n     \n    (0.053) \n  \n\n ragect9260-64 \n     \n     \n    −0.082 \n  \n\n  \n     \n     \n    (0.054) \n  \n\n ragect9265+ \n     \n     \n    −0.113 \n  \n\n  \n     \n     \n    (0.050) \n  \n\n ragect92Not Given \n     \n     \n    −0.168 \n  \n\n  \n     \n     \n    (0.088) \n  \n\n rsex92 \n     \n     \n    −0.026 \n  \n\n  \n     \n     \n    (0.022) \n  \n\n region2 \n     \n     \n    0.020 \n  \n\n  \n     \n     \n    (0.051) \n  \n\n region3 \n     \n     \n    −0.029 \n  \n\n  \n     \n     \n    (0.053) \n  \n\n region4 \n     \n     \n    −0.043 \n  \n\n  \n     \n     \n    (0.053) \n  \n\n region5 \n     \n     \n    0.021 \n  \n\n  \n     \n     \n    (0.054) \n  \n\n region6 \n     \n     \n    0.069 \n  \n\n  \n     \n     \n    (0.066) \n  \n\n region7 \n     \n     \n    −0.077 \n  \n\n  \n     \n     \n    (0.053) \n  \n\n region8 \n     \n     \n    −0.029 \n  \n\n  \n     \n     \n    (0.048) \n  \n\n region9 \n     \n     \n    −0.027 \n  \n\n  \n     \n     \n    (0.055) \n  \n\n region10 \n     \n     \n    −0.025 \n  \n\n  \n     \n     \n    (0.063) \n  \n\n region11 \n     \n     \n    −0.028 \n  \n\n  \n     \n     \n    (0.046) \n  \n\n occupation2 \n     \n     \n    0.003 \n  \n\n  \n     \n     \n    (0.077) \n  \n\n occupation3 \n     \n     \n    −0.024 \n  \n\n  \n     \n     \n    (0.086) \n  \n\n occupation4 \n     \n     \n    0.007 \n  \n\n  \n     \n     \n    (0.074) \n  \n\n occupation5 \n     \n     \n    0.037 \n  \n\n  \n     \n     \n    (0.084) \n  \n\n occupation6 \n     \n     \n    0.002 \n  \n\n  \n     \n     \n    (0.075) \n  \n\n occupation7 \n     \n     \n    −0.110 \n  \n\n  \n     \n     \n    (0.084) \n  \n\n Num.Obs. \n    1593 \n    1593 \n    1593 \n  \n\n R2 \n    0.000 \n    0.423 \n    0.512 \n  \n\n R2 Adj. \n    0.000 \n    0.422 \n    0.495 \n  \n\n AIC \n    2300.8 \n    1429.8 \n    1265.2 \n  \n\n BIC \n    2311.6 \n    1451.3 \n    1560.8 \n  \n\n Log.Lik. \n    −1148.411 \n    −710.907 \n    −577.622 \n  \n\n F \n     \n    581.949 \n    30.417 \n  \n\n RMSE \n    0.50 \n    0.38 \n    0.35 \n  \n\n\n\n\n\n\n\n\nQuestions\n\nHow would you interpret the main variables interest (reading a switching newspaper, and having voted Labour in 1992) in the context of this larger model?\nLook at a few other explanatory variables and briefly interpret the meaning of their coefficients.\n\n\nLogit and Probit\nWhile a linear probability model can be a quick way to check relationships in the data, the linear model function is not appropriate for modelling binary outcomes. One reason is that predictions from the linear model can fall outside the 0-1 probability boundary, which is meaningless (we’ll look at what this means, visually, below).\nTwo models are more appropriate in such contexts: logistic regression (or logit model) and probit models. The authors of the paper use a probit model, and we’ll fit both a probit and a logit so we can compare their results. In R we can fit these types of generalised linear models using the glm() function by specifying a family of distributions and the linear link to be used. Both logit and probit are part of the broader binomial family, and the default link function for this family in glm() is “logit”, so we don’t need to specify that information when fitting a logistic model:\n\n# A \"probit\" model that reproduces Table 1A results\nm1_probit <- glm(vote_l_97 ~ ., family = binomial(link = \"probit\"), data = news)\n\n# A \"logit\" model alternative\nm1_logit <- glm(vote_l_97 ~ ., family = binomial, data = news)\n\nBefore examining the models more closely, let’s visualise predictions from the linear probability, probit and logit model to demonstrate the prediction boundary problem posed by the linear probability model. The three graphs below plot the estimated probabilities for voting Labour in 1997 (vote_l_97) based on the respondents’ “leftist” ideological orientation in 1992 (ideo92) alongside the model-specific regression line (the reason for choosing the ideo92 variable is simply that it’s a numeric variable that is easier to visualise here). The code to produce the plots is not the focus here, but you can unfold the code command to check it if interested:\n\nCode# Extract predicted values as data-frames using the marginaleffects::predictions function\nlpm_effects <- marginaleffects::predictions(m1_lpm)\nlogit_effects <- marginaleffects::predictions(m1_logit)\nprobit_effects <- marginaleffects::predictions(m1_probit)\n\n# A package to combine plots together\nlibrary(patchwork)\n\n# lpm\nplot1 <- lpm_effects |> \n  ggplot(aes(\n    y = estimate, \n    x = ideo92)) +\n  geom_point(alpha = 0.1) +\n  stat_smooth(method=\"lm\", color=\"red\", se=FALSE) +\n  labs(\n    x = \"\",\n    y = \"Estimated probability of voting Labour in 1997\"\n  ) +\n  ylim(-0.25, 1.25) +\n  ggtitle(\"Linear probability model\")\n\n# probit\nplot2 <- probit_effects |> \n  ggplot(aes(\n  y = estimate, \n  x = ideo92)) +\n  geom_point(alpha = 0.1) +\n  stat_smooth(method=\"glm\", color=\"blue\", linetype=1, se=FALSE,\n                method.args = list(family=binomial)) +\n  stat_smooth(method=\"lm\", color=\"red\", se=FALSE) +\n    labs(\n    x = \"Leftist political ideology (in 1992)\",\n    y = \"\"\n  ) +\n  ylim(-0.25, 1.25) +\n  ggtitle(\"Probit model\")\n\n# logit\nplot3 <- logit_effects |> \n  ggplot(aes(\n  y = estimate, \n  x = ideo92)) +\n  geom_point(alpha = 0.1) +\n  stat_smooth(method=\"glm\", color=\"blue\", linetype=1, se=FALSE,\n                method.args = list(family=binomial)) +\n  stat_smooth(method=\"lm\", color=\"red\", se=FALSE) +\n    labs(\n    x = \"\",\n    y = \"\"\n  ) +\n  ylim(-0.25, 1.25) +\n  ggtitle(\"Logit model\")\n\n# Combine the three plots\nplot1 + plot2 + plot3\n\n\n\n\nWe can see from the plots that the predictions regression curve from the probit and logit models are nearly identical and are both constrained between the probability limits of 0 and 1, whereas predictions from the linear model can fall outside these limits, even though to speak of probabilities below 0 and above 1 is meaningless, so our estimates are biased. The graphs highlight well the difference between logit/probit and linear regression.\nWe can now compare our three models in more detail using the modelsummary::modelsummary() function. This time, we will first make a list() object in which we give more meaningful titles to our models, and add a further specification to the modelsummary command to request that the coefficients be renamed using their variable labels (which we conveniently already have in this dataset, as we imported it from Stata format):\n\nmodels <- list(\"OLS\" = m1_lpm,\n               \"Probit\" = m1_probit,\n               \"Logit\" = m1_logit)\nmodelsummary::modelsummary(models, coef_rename = TRUE)\n\n\n\n\n\n\n\nThis is what the table should look like:\n\n\n\n\n\n\n\n\n\n\n   \n    OLS \n    Probit \n    Logit \n  \n\n\n (Intercept) \n    0.485 \n    −0.209 \n    −0.592 \n  \n\n  \n    (0.163) \n    (0.845) \n    (1.585) \n  \n\n Treatment (read switching paper) \n    0.105 \n    0.484 \n    0.840 \n  \n\n  \n    (0.028) \n    (0.127) \n    (0.225) \n  \n\n Prior Conservative Identification \n    −0.003 \n    0.039 \n    0.076 \n  \n\n  \n    (0.043) \n    (0.182) \n    (0.320) \n  \n\n Prior Labour Identification \n    0.156 \n    0.569 \n    1.016 \n  \n\n  \n    (0.042) \n    (0.178) \n    (0.313) \n  \n\n Prior Liberal Identification \n    0.083 \n    0.368 \n    0.657 \n  \n\n  \n    (0.044) \n    (0.181) \n    (0.316) \n  \n\n White \n    −0.193 \n    −0.835 \n    −1.653 \n  \n\n  \n    (0.063) \n    (0.319) \n    (0.567) \n  \n\n Working-Class \n    0.041 \n    0.160 \n    0.300 \n  \n\n  \n    (0.022) \n    (0.097) \n    (0.173) \n  \n\n Parents Voted Labour \n    0.032 \n    0.121 \n    0.238 \n  \n\n  \n    (0.021) \n    (0.094) \n    (0.168) \n  \n\n Prior Ideological Moderation \n    0.010 \n    0.526 \n    0.985 \n  \n\n  \n    (0.080) \n    (0.452) \n    (0.873) \n  \n\n Prior Labour Vote \n    0.280 \n    0.879 \n    1.496 \n  \n\n  \n    (0.046) \n    (0.188) \n    (0.330) \n  \n\n Prior Conservative Vote \n    −0.079 \n    −0.251 \n    −0.436 \n  \n\n  \n    (0.048) \n    (0.202) \n    (0.353) \n  \n\n Prior Liberal Vote \n    −0.076 \n    −0.273 \n    −0.502 \n  \n\n  \n    (0.047) \n    (0.190) \n    (0.329) \n  \n\n Prior Labour Party Support \n    0.224 \n    0.878 \n    1.612 \n  \n\n  \n    (0.054) \n    (0.238) \n    (0.423) \n  \n\n Prior Conservative Party Support \n    −0.059 \n    −0.334 \n    −0.455 \n  \n\n  \n    (0.052) \n    (0.233) \n    (0.417) \n  \n\n Prior Political Knowledge [low] \n    0.085 \n    0.367 \n    0.710 \n  \n\n  \n    (0.030) \n    (0.136) \n    (0.244) \n  \n\n Prior Political Knowledge [medium] \n    0.026 \n    0.112 \n    0.233 \n  \n\n  \n    (0.022) \n    (0.101) \n    (0.183) \n  \n\n Prior Television Viewer \n    0.024 \n    0.141 \n    0.266 \n  \n\n  \n    (0.021) \n    (0.096) \n    (0.174) \n  \n\n Prior Daily Newspaper Reader \n    −0.030 \n    −0.178 \n    −0.314 \n  \n\n  \n    (0.021) \n    (0.096) \n    (0.173) \n  \n\n Prior Ideology \n    0.134 \n    1.076 \n    2.044 \n  \n\n  \n    (0.105) \n    (0.599) \n    (1.170) \n  \n\n Authoritarianism \n    −0.002 \n    −0.157 \n    −0.077 \n  \n\n  \n    (0.077) \n    (0.360) \n    (0.647) \n  \n\n Prior Trade Union Member \n    0.007 \n    0.051 \n    0.088 \n  \n\n  \n    (0.023) \n    (0.104) \n    (0.188) \n  \n\n Prior Coping Mortgage \n    −0.051 \n    −0.202 \n    −0.364 \n  \n\n  \n    (0.019) \n    (0.087) \n    (0.154) \n  \n\n Prior Education [College Degree] \n    −0.006 \n    −0.016 \n    0.038 \n  \n\n  \n    (0.034) \n    (0.157) \n    (0.282) \n  \n\n Prior Education [Less than O level (or foreign qual)] \n    −0.028 \n    −0.093 \n    −0.119 \n  \n\n  \n    (0.042) \n    (0.194) \n    (0.348) \n  \n\n Prior Education [O Level or Equivalent] \n    −0.003 \n    0.019 \n    0.049 \n  \n\n  \n    (0.035) \n    (0.160) \n    (0.286) \n  \n\n Prior Education [Some Higher Education] \n    −0.035 \n    −0.133 \n    −0.233 \n  \n\n  \n    (0.034) \n    (0.155) \n    (0.279) \n  \n\n Prior Income [20,000 or More] \n    −0.001 \n    0.019 \n    0.019 \n  \n\n  \n    (0.026) \n    (0.117) \n    (0.211) \n  \n\n Prior Income [5999 or Less] \n    0.001 \n    0.065 \n    0.041 \n  \n\n  \n    (0.033) \n    (0.153) \n    (0.277) \n  \n\n Prior Income [6000-11,999] \n    0.065 \n    0.356 \n    0.611 \n  \n\n  \n    (0.029) \n    (0.132) \n    (0.235) \n  \n\n Prior Income [Not answered] \n    0.003 \n    0.007 \n    0.072 \n  \n\n  \n    (0.039) \n    (0.177) \n    (0.315) \n  \n\n Prior Age [25-34] \n    −0.103 \n    −0.416 \n    −0.701 \n  \n\n  \n    (0.045) \n    (0.199) \n    (0.349) \n  \n\n Prior Age [35-44] \n    −0.068 \n    −0.274 \n    −0.456 \n  \n\n  \n    (0.045) \n    (0.198) \n    (0.348) \n  \n\n Prior Age [45-54] \n    −0.102 \n    −0.424 \n    −0.764 \n  \n\n  \n    (0.046) \n    (0.206) \n    (0.364) \n  \n\n Prior Age [55-59] \n    −0.155 \n    −0.743 \n    −1.283 \n  \n\n  \n    (0.053) \n    (0.243) \n    (0.435) \n  \n\n Prior Age [60-64] \n    −0.082 \n    −0.351 \n    −0.505 \n  \n\n  \n    (0.054) \n    (0.247) \n    (0.438) \n  \n\n Prior Age [65+] \n    −0.113 \n    −0.544 \n    −0.904 \n  \n\n  \n    (0.050) \n    (0.231) \n    (0.410) \n  \n\n Prior Age [Not Given] \n    −0.168 \n    −0.718 \n    −1.273 \n  \n\n  \n    (0.088) \n    (0.422) \n    (0.779) \n  \n\n Male \n    −0.026 \n    −0.130 \n    −0.253 \n  \n\n  \n    (0.022) \n    (0.101) \n    (0.183) \n  \n\n North West \n    0.020 \n    0.054 \n    0.072 \n  \n\n  \n    (0.051) \n    (0.250) \n    (0.451) \n  \n\n Yorks \n    −0.029 \n    −0.147 \n    −0.141 \n  \n\n  \n    (0.053) \n    (0.260) \n    (0.472) \n  \n\n West Midlands \n    −0.043 \n    −0.215 \n    −0.359 \n  \n\n  \n    (0.053) \n    (0.255) \n    (0.463) \n  \n\n East Midlands \n    0.021 \n    0.057 \n    0.063 \n  \n\n  \n    (0.054) \n    (0.260) \n    (0.470) \n  \n\n East Anglia \n    0.069 \n    0.305 \n    0.498 \n  \n\n  \n    (0.066) \n    (0.310) \n    (0.553) \n  \n\n SW England \n    −0.077 \n    −0.390 \n    −0.712 \n  \n\n  \n    (0.053) \n    (0.260) \n    (0.475) \n  \n\n SE England \n    −0.029 \n    −0.153 \n    −0.226 \n  \n\n  \n    (0.048) \n    (0.234) \n    (0.426) \n  \n\n Greater London \n    −0.027 \n    −0.194 \n    −0.260 \n  \n\n  \n    (0.055) \n    (0.269) \n    (0.488) \n  \n\n Wales \n    −0.025 \n    −0.239 \n    −0.308 \n  \n\n  \n    (0.063) \n    (0.299) \n    (0.543) \n  \n\n Scotland \n    −0.028 \n    −0.177 \n    −0.267 \n  \n\n  \n    (0.046) \n    (0.227) \n    (0.412) \n  \n\n Profession: Large Employer \n    0.003 \n    −0.078 \n    −0.079 \n  \n\n  \n    (0.077) \n    (0.368) \n    (0.655) \n  \n\n Profession: Small Employer \n    −0.024 \n    −0.209 \n    −0.388 \n  \n\n  \n    (0.086) \n    (0.413) \n    (0.742) \n  \n\n Profession: Self Employed \n    0.007 \n    −0.050 \n    −0.082 \n  \n\n  \n    (0.074) \n    (0.352) \n    (0.626) \n  \n\n Profession: Employee \n    0.037 \n    0.066 \n    0.143 \n  \n\n  \n    (0.084) \n    (0.395) \n    (0.699) \n  \n\n Profession: Temporary Worker \n    0.002 \n    −0.066 \n    −0.100 \n  \n\n  \n    (0.075) \n    (0.357) \n    (0.634) \n  \n\n Profession: Junior \n    −0.110 \n    −0.603 \n    −1.022 \n  \n\n  \n    (0.084) \n    (0.403) \n    (0.723) \n  \n\n Num.Obs. \n    1593 \n    1593 \n    1593 \n  \n\n R2 \n    0.512 \n     \n     \n  \n\n R2 Adj. \n    0.495 \n     \n     \n  \n\n AIC \n    1265.2 \n    1317.3 \n    1315.6 \n  \n\n BIC \n    1560.8 \n    1607.5 \n    1605.8 \n  \n\n Log.Lik. \n    −577.622 \n    −604.647 \n    −603.825 \n  \n\n F \n    30.417 \n    11.497 \n    8.800 \n  \n\n RMSE \n    0.35 \n    0.34 \n    0.34 \n  \n\n\n\n\n\n\n\n\n\nCompare the coefficients from our probit model to that presented in the first column of Table 1A in the article.\nHow do our three models compare? There are some easy rules-of-thumb for roughly converting the values of coefficients between the linear, logit and probit models. Try the following:\n\n\nDivide the probit coefficients by 1.6 to get an approximation of the logit coefficients\nDivide the logit coefficients by 4 to get an approximation of linear coefficients\nMultiply probit coefficients by 0.4 to get an approximation of linear coefficients\n\nThese rules are not very precise in the case of this large model, but try them out on results from smaller models too."
  },
  {
    "objectID": "materials/worksheets/worksheets_w04.html#exercise-2-predicting-the-frequency-of-social-interaction-using-poisson",
    "href": "materials/worksheets/worksheets_w04.html#exercise-2-predicting-the-frequency-of-social-interaction-using-poisson",
    "title": "Week 4 Computer Lab Worksheet",
    "section": "Exercise 2: Predicting the Frequency of Social Interaction using Poisson",
    "text": "Exercise 2: Predicting the Frequency of Social Interaction using Poisson\nFor this exercise, we will use data from Weiss et al. (2021) to reproduce the results they present in the first column of their Supplementary Table S6. The article can be accessed online here; the table of interest is on page 78, and a brief description of the results is offered on page 32:\n\n… we explored the possibility that trait differences may be associated with the frequency of social interactions … reported during the ESM [experience-sampling method] period. To this end, we conducted … Poisson regression analyses, controlling for overall participant response rate. As Supplementary Table S6 shows, only few measures predicted such possible selection effects: the overall number of social interactions was reliably predicted only by zero-sum beliefs, such that individuals holding more negative views regarding the antagonistic nature of social interactions and relationships reported a lower number of overall interactions.\n\nRead the article to get a better understanding of the data. For our immediate purposes, it’s enough to know that we want to regress a measurement of the number of social interactions on a set of demographic and dispositional predictors.\nWe can load the dataset called EverydayTrust.Rds from the course website. Note that if we are reading in data files from an internet source with the readRDS() function, we need to call it using the url() function. For simplicity, we’ll abreviate the name of the data object as et.\n\net <- readRDS(url(\"https://cgmoreh.github.io/HSS8005-data/EverydayTrust.Rds\"))\n\nHave a look at the dataset. Note that variables with the suffix “_c” are grand-mean-centred versions of the base variables; the Gender variable has a dummy-coded version and a so-called simple effect-coded version (Gender_e, coded as -1 for Female and 1 for Male), with which the intercept estimates the grand mean rather than the value 0. The authors use this latter version when fitting the model. The variable names are similar to the labels that appear in Table S6, so they are easy to identify.\n\n\nFit the Poisson model using the variables shown in Table S6 using the formula skeleton below:\n\n\n... <- glm(... ~ ..., family=\"poisson\", data = ...)\n\n\nCheck the model summary and compare the results with those in Table S6\n\n\n\n\n\nCall:\nglm(formula = Number_of_Social_Interactions ~ Gender_e + Age_c + \n    Political_Orientation_c + Religiosity_c + Trust_Scale_c + \n    Distrust_c + Moral_Identity_c + Zero_Sum_Belief_c + Social_Value_Orienation_c + \n    Signal_Response_Rate, family = \"poisson\", data = et)\n\nDeviance Residuals: \n    Min       1Q   Median       3Q      Max  \n-4.2050  -0.7750  -0.0146   0.7306   3.0191  \n\nCoefficients:\n                           Estimate Std. Error z value Pr(>|z|)    \n(Intercept)                1.017743   0.075622  13.458  < 2e-16 ***\nGender_e                  -0.024457   0.017591  -1.390 0.164427    \nAge_c                      0.002571   0.001550   1.659 0.097146 .  \nPolitical_Orientation_c   -0.004787   0.008308  -0.576 0.564502    \nReligiosity_c              0.014473   0.017348   0.834 0.404112    \nTrust_Scale_c              0.016144   0.021150   0.763 0.445264    \nDistrust_c                -0.037711   0.019505  -1.933 0.053187 .  \nMoral_Identity_c          -0.020356   0.017854  -1.140 0.254216    \nZero_Sum_Belief_c         -0.068133   0.018167  -3.750 0.000177 ***\nSocial_Value_Orienation_c  0.002162   0.001484   1.456 0.145263    \nSignal_Response_Rate       0.068348   0.003474  19.675  < 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n    Null deviance: 1144.66  on 397  degrees of freedom\nResidual deviance:  602.97  on 387  degrees of freedom\n  (29 observations deleted due to missingness)\nAIC: 2272.9\n\nNumber of Fisher Scoring iterations: 4"
  },
  {
    "objectID": "materials/worksheets/worksheets_w05.html",
    "href": "materials/worksheets/worksheets_w05.html",
    "title": "Week 5 Computer Lab Worksheet",
    "section": "",
    "text": "References\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press."
  },
  {
    "objectID": "materials/worksheets/worksheets_w06.html",
    "href": "materials/worksheets/worksheets_w06.html",
    "title": "Week 6 Computer Lab Worksheet",
    "section": "",
    "text": "learn using the lmer() function from the lme4 package to fit multilevel models with two or more levels of nesting\ndealing with survey weights in R"
  },
  {
    "objectID": "materials/worksheets/worksheets_w06.html#exercise-1-economic-and-cultural-drivers-of-immigrant-support",
    "href": "materials/worksheets/worksheets_w06.html#exercise-1-economic-and-cultural-drivers-of-immigrant-support",
    "title": "Week 6 Computer Lab Worksheet",
    "section": "Exercise 1: Economic and Cultural Drivers of Immigrant Support",
    "text": "Exercise 1: Economic and Cultural Drivers of Immigrant Support\nIn this exercise we will replicate a model presented in Table 4 of Valentino et al. (2019) (first column, titled “ALL”), and extend its multilevel framework. Valentino et al. (2019) explore the causal determinants of public opposition to immigration in 11 countries using a series of survey treatments relative to type of employment, racial composition, family status, and country of origin. The data consists of repeated measures taken from the same survey respondents (subjects), and the authors employ “mixed models” (their term) to account for multiple observations per subject, as these are likely to be correlated given that they belong to the same respondent.\nSo, we have clustering at two levels: at the individual level and at a country level. The models fitted by the authors take into considerations and adjust for the individual-level clustering, but not the clustering at the country level. In the model that we will first replicate, the authors pool together respondents from all the countries (see the models named ALL in the regression tables), while also including separate regression models for each country. The “ALL” model allows the identification of global trends that persist regardless of country-specific features; however, it does not account for the fact that respondents - and therefore the errors - are also affected by country-level influences. The second part of the exercise consists of fitting a multilevel model that accounts for both clustering factors. The multilevel model contains information that summarises both the “ALL” and the various separately run country-level regression models.\n1.1. Two-level random intercepts model\nFirst, we will load the original dataset and perform a number of data-cleansing tasks to set up the data in a similar format to that used by the authors:\n\n# Load required packages\npacman::p_load(tidyverse, lme4, sjlabelled)\n\n# Import the dataset\nval <- sjlabelled::read_stata(\"https://cgmoreh.github.io/HSS8005/data/Valentino17.dta\")\n\n\n\n\n\n# Filter out cases\nval <- val |> \n  filter(!is.na(uid)) |>      # remove rows with no uid\n  filter(!is.na(support))     # remove rows where dependent variable is not provided\n\n# Create country codes to match to country name key\ncountryt_key <- data.frame(countryt = c(\"AU\", \"CA\", \"CH\", \"DK\", \"ES\", \"FR\", \"JP\",\n                                        \"KR\", \"NO\", \"UK\", \"US\"),\n                           country_name = c(\"Australia\", \"Canada\", \"Switzerland\",\n                                            \"Denmark\", \"Spain\", \"France\", \"Japan\", \n                                            \"S.Korea\", \"Norway\", \"UK\", \"USA\"))\n\n# Creating new variables\nval <- val |> \n  mutate(candf = factor(cand),     # create factor versions of the treatment variables  \n         Job_f = factor(treat_hstat, levels = c(0, 1), labels = get_labels(treat_hstat)),\n         Complexion_f = factor(treat_drk, levels = c(0, 1), labels = get_labels(treat_drk)),\n         ME_f = factor(treat_nat_me, levels = c(0, 1), labels = get_labels(treat_nat_me)),\n         Children_f = factor(treat_kids, levels = c(0, 1), labels = get_labels(treat_kids)),\n         Gender_f = factor(treat_gender, levels = c(0, 1), labels = get_labels(treat_gender)))\n\n# Recode gender variable to ease interpretation, and add country name\nval <- val |> \n  mutate(female = factor(gender, levels = c(1,0), labels = c(\"female\", \"male\"))) |> \n  left_join(countryt_key, by = \"countryt\")\n\n# Set other categorical variables as factors\n\n## It's useful to first make a list of the variable names:\nfact_vars <- c(\"id\", 'uid', 'country', \"countryt\", 'cand', 'insample', \n               'gender', 'education2', 'occupation2', \"inc2\", 'inc3',\n               'treat_hstat', 'treat_drk', 'treat_nat_me', 'treat_kids', 'treat_gender')\n\n#  Then recode those listed variables as factor variables; \n# and we can also filter the data to treat_gender equals 0 to follow what the authors did; this should leave us with 37,579 cases/respondents in the dataset, the total N that appears in Table 4, Model \"ALL\"\nval_mod <- val |> \n  mutate_at(fact_vars, factor) |> \n  filter(treat_gender == 0)\n\n\nTo inspect and better understand the dataset, apply some of the summary functions we practised in previous weeks to this data (e.g. using modelsummary, gtsummary, summarytools or something else).\n\n...\n\n\nWe are now ready to fit a multilevel model with random/varying intercepts for each respondent to account for intra-class correlation within individuals:\n\n# Varying-intercept model with an individual-level predictors\n## Same model as ALL in Table 4, but with REML = T\nm_all <- lmer(support ~ Job_f + Children_f + ME_f + Complexion_f + candf + \n                (1|uid),                         # This is where we add the clustering variable coding individual ID numbers\n                 data = val_mod, \n                 weights = wt_country, REML = T) # Some additional settings to emulate results from models fit in Stata\n\n1.2. Fully nested model: repeated obs nested within countries\nWe can now add an additional level of clustering, including countryt as an additional level:\n\nm_ct <- lmer(support ~ Job_f + Children_f + ME_f + Complexion_f + candf + \n                    (1 | countryt / uid),         # This is where we add additional levels\n                  data = val_mod, \n                  weights = wt_country, REML = T) # Some additional settings to emulate results from models fit in Stata\n\nWe cna now tabulate the results and compare them. modelsummary::modelsummary() is again a good option, but sjPlot::tab_model() includes some additional transformations to the random effects components out of the box that can be rather useful for a quick glance (we can have more control over those computations and calculate them and include them manually in another table format):\n\n# Modelsummary table\nmodelsummary::modelsummary(list(m_all, m_ct), stars = TRUE)\n\n\n\n\n   \n     (1) \n      (2) \n  \n\n\n (Intercept) \n    0.523*** \n    0.528*** \n  \n\n  \n    (0.004) \n    (0.021) \n  \n\n Job_fHigh status \n    0.120*** \n    0.120*** \n  \n\n  \n    (0.004) \n    (0.004) \n  \n\n Children_fChildren \n    −0.002 \n    −0.002 \n  \n\n  \n    (0.002) \n    (0.002) \n  \n\n ME_fME \n    −0.020*** \n    −0.020*** \n  \n\n  \n    (0.002) \n    (0.002) \n  \n\n Complexion_fDark treatment \n    0.001 \n    0.002 \n  \n\n  \n    (0.004) \n    (0.004) \n  \n\n candf2 \n    −0.038*** \n    −0.038*** \n  \n\n  \n    (0.002) \n    (0.002) \n  \n\n SD (Observations) \n    0.137 \n    0.137 \n  \n\n SD (Intercept uid) \n    0.287 \n     \n  \n\n SD (Intercept uidcountryt) \n     \n    0.279 \n  \n\n SD (Intercept countryt) \n     \n    0.067 \n  \n\n Num.Obs. \n    37579 \n    37579 \n  \n\n R2 Marg. \n    0.039 \n    0.039 \n  \n\n R2 Cond. \n    0.822 \n    0.822 \n  \n\n AIC \n    7392.9 \n    6443.2 \n  \n\n BIC \n    7461.1 \n    6520.0 \n  \n\n ICC \n    0.8 \n    0.8 \n  \n\n RMSE \n    0.11 \n    0.11 \n  \n\n\n + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001\n\n\n# SJplot table\nsjPlot::tab_model(m_all, m_ct, digits = 3)\n\n\n\n\n \nDV:Support\nDV:Support\n\n\nPredictors\nEstimates\nCI\np\nEstimates\nCI\np\n\n\n(Intercept)\n0.523\n0.515 – 0.532\n<0.001\n0.528\n0.488 – 0.569\n<0.001\n\n\nJob f: High status\n0.120\n0.112 – 0.129\n<0.001\n0.120\n0.112 – 0.129\n<0.001\n\n\nChildren f: Children\n-0.002\n-0.006 – 0.002\n0.253\n-0.002\n-0.006 – 0.002\n0.255\n\n\nME f: ME\n-0.020\n-0.023 – -0.017\n<0.001\n-0.020\n-0.023 – -0.017\n<0.001\n\n\nComplexion f: Darktreatment\n0.001\n-0.007 – 0.010\n0.748\n0.002\n-0.006 – 0.011\n0.592\n\n\ncandf: candf 2\n-0.038\n-0.041 – -0.035\n<0.001\n-0.038\n-0.041 – -0.035\n<0.001\n\n\nRandom Effects\n\n\nσ2\n\n0.02\n0.02\n\n\nτ00\n\n0.08 uid\n\n0.08 uid:countryt\n\n\n\n\n\n \n0.00 countryt\n\n\n\n\nICC\n0.81\n0.81\n\n\n\nN\n19734 uid\n\n19734 uid\n\n\n\n\n\n \n11 countryt\n\n\n\nObservations\n37579\n37579\n\n\nMarginal R2 / Conditional R2\n\n0.039 / 0.822\n0.039 / 0.822"
  },
  {
    "objectID": "materials/worksheets/worksheets_w06.html#exercise-2-return-to-osterman",
    "href": "materials/worksheets/worksheets_w06.html#exercise-2-return-to-osterman",
    "title": "Week 6 Computer Lab Worksheet",
    "section": "Exercise 2: Return to Osterman",
    "text": "Exercise 2: Return to Osterman\nAt the end of Worksheet 3 we fitted the full Model 1 reported in Table 3 of Österman (2021) (see also Table A.3. in their Appendix for a more complete reporting on the models). Comparing the model we fitted to the one reported by the author, our results were very close, but not totally equivalent on several coefficients and standard errors. The main reasons for the divergence had to do with two aspects of the published model that we had disregarded: (1) we did not include survey weights to correct for sampling errors, and (2) we did not allow for intra-group correlation in the standard errors among respondents from the same country (and the same age cohort). We will first implement these two additional steps and compare our final results again to those reported in Österman (2021). Then, we will refit the model in a multilevel framework.\nRefitting Model 1\nAs a quick reminder, let’s refit the model as we have done on Worksheet 3.\n\n\n\n\n\n\nTo be noted\n\n\n\nThere was a mistake in the code used to fit the model on Worksheet 3. This has now been corrected there and below.\n\n\nIn the footnotes to Table 3, the author tells us that in addition to what is reported in the table,\n\nAll models include reform FEs [Fixed Effects], a general quadratic birth year trend, and reform-specific trends for birth year (linear) and age (quadratic). Additional controls: foreign-born parents and ESS round dummies.\n\nThe version of the model summary presented in Table A.3 in the Appendix also includes the additional controls, but not the other terms.\nIn the main text, the author further explains some of the reasoning behind their modelling choices:\n\nOne dilemma for the design is that there has been a trend of increasing educational attainment throughout the studied time period, which means that the reform-windows of treated and non-treated cohorts will also pick up the effects of this trend. To counter this, [the list of covariates] includes a general quadratic birth year trend, reform-specific linear controls for birth year and reform-specific quadratic age trends. The quadratic terms are included to allow sufficient flexibility in absorbing possible non-linear trends of increasing education within the reform-window of seven treated and seven untreated birth year cohorts. … The reform-fixed effects are also essential because they imply that only the within-reform-window variation is used to estimate the effects and between-reform differences are factored out, such as pre-reform differences in social trust. (Österman 2021, 221–22)\n\nBefore we fit the model, some of the concepts in the quotation need unpacking. A quadratic term is a second-degree polynomial term - put simply, it’s the square of the variable concerned. The quadratic of a variable such as \\(age\\) is therefore \\(age^2\\), or \\(age \\times age\\). In other words, it is like a variable’s “interaction” with itself. Because of this, there are several ways in which the quadratic terms to be included in a model can be specified:\n\nWe could create the quadratic terms as new variables, and include those in the model, as below:\n\n\n# First, load the original dataset\nosterman <- sjlabelled::read_stata(\"https://cgmoreh.github.io/HSS8005/data/osterman.dta\")\n\n# Create new quadratic variables by multiplication with themselves and add them to the dataset, saving it as a new data object\nosterman2 <- osterman |> \n  mutate(agea_quad = agea*agea,        # quadratic age variable\n         yrbrn_quad = yrbrn*yrbrn)     # quadratic birth-year variable\n\n# We now have two additional variables in the dataset; we can fit the model using those:\nm1_prequad <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d +                         # main covariates reported in Table 3\n                   fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +     # additional controls for foreign-born parents and ESS Round dummies\n                   agea + yrbrn + agea_quad + yrbrn_quad +                              # general quadratic birth year trend and quadratic age\n                   factor(reform_id_num) +                                              # reform fixed effects dummies\n                   yrbrn:factor(reform_id_num) +                                        # reform-specific birth year trend\n                   agea:factor(reform_id_num) +  agea_quad:factor(reform_id_num),       # reform-specific quadratic age trend\n               data = osterman2)                                                        # the new expanded dataset\n\n\nWe can get the same results by creating the quadratic terms directly as part of the modelling function. The one thing we should keep in mind is that if we want to include standard mathematical operations within a formula function, we need to isolate or insulate the operation from R’s formula parsing code using the I() function, which returns the contents of I(...) “as.is”. The model formula would then be:\n\n\nm1_funquad <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d +                         # main covariates reported in Table 3\n                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +       # additional controls for foreign-born parents and ESS Round dummies\n                 agea + yrbrn + I(agea^2) + I(yrbrn^2) +                                # general quadratic birth year trend and quadratic age\n                 factor(reform_id_num) +                                                # reform fixed effects dummies\n                 yrbrn:factor(reform_id_num) +                                          # reform-specific birth year trend\n                 agea:factor(reform_id_num) +  I(agea^2):factor(reform_id_num),         # reform-specific quadratic age trend\n              data = osterman)                                                          # the original dataset\n\n\nIn the two previous options, the quadratic terms will be correlated with the original variables. To avoid this by relying on so-called orthogonal polynomials we should use the poly() function. We can also fit the same correlated polynomial model as the ones above by adding the raw = TRUE option to the poly() function. In the code below, we will fit the correlated version first, then the orthogonal version (This stackoverflow discussion explains in more detail the difference between the two options):\n\n\nm1_polyraw <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d + \n                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + \n                 poly(agea, 2, raw = TRUE) + poly(yrbrn, 2, raw = TRUE) +\n                 factor(reform_id_num) +           \n                 yrbrn:factor(reform_id_num) + \n                 agea:factor(reform_id_num) + poly(agea, 2, raw = TRUE):factor(reform_id_num),\n               data = osterman)\n\nm1_polyorth <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d + \n                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + \n                 poly(agea, 2) + poly(yrbrn, 2) +\n                 factor(reform_id_num) +           \n                 yrbrn:factor(reform_id_num) + \n                 agea:factor(reform_id_num) + poly(agea, 2):factor(reform_id_num),\n               data = osterman)\n\nIt’s worth noting, however, that the Stata routine used by the author fitted correlated/raw coded polynomials, so the orthogonal version below is just for a comparison and we will not use it going forward. For a side-by-side overview comparison of the models we fitted so far, we could use a model summary tabulation function (e.g. sjPlot::tab_model() or modelsummary::modelsummary(), which we are already familiar with from previous lab exercises; some other popular options include stargazer::stargazer() and jtools::export_summs(); you can read through the function documentation and test them out). Below we will use modelsummary():\n\n# It's cleaner to first make a list of the models we want to summarise; we can even name them:\nmodels <- list(\n  \"Pre-calculated quadratic\" = m1_prequad,\n  \"Within-function quadratic\" = m1_funquad,\n  \"poly() with raw coding\" = m1_polyraw,\n  \"poly() with default orthogonal coding\" = m1_polyorth\n)\n\n# modelsummary table with stars for p-values added\nmodelsummary::modelsummary(models, stars = TRUE)\n\nThe results from the modelsummary() are not shown here because it’s a long and ugly table, but it’s useful for perusing to compare the results across the different models. For a cleaner table showing only the results included in Table A.3 in the Appendix to Österman (2021), we can use the coef_map or the coef_omit option in modelsummary() and only include m1_funquad, which will be the polynomial fitting routine that we will use going forward:\n\n# It's cleaner to first make a vector of the coefficients we wish to include; we can name the coefficients as they appear in Table A.3; note that we also leave out the Intercept, as in the published table:\ncoefs <- c(\"reform1_7\" = \"Reform\",\n           \"female\" = \"Female\",\n           \"blgetmg_d\" = \"Ethnic minority\",\n           \"fbrneur\" = \"Foreign-born father, Europe\",\n           \"mbrneur\" = \"Foreign-born mother, Europe\",\n           \"fnotbrneur\" = \"Foreign-born father, outside Europe\",\n           \"mnotbrneur\" = \"Foreign-born mother, outside Europe\",\n           \"factor(essround)2\" = \"ESS Round 2\",\n           \"factor(essround)3\" = \"ESS Round 3\",\n           \"factor(essround)4\" = \"ESS Round 4\",\n           \"factor(essround)5\" = \"ESS Round 5\",\n           \"factor(essround)6\" = \"ESS Round 6\",\n           \"factor(essround)7\" = \"ESS Round 7\",\n           \"factor(essround)8\" = \"ESS Round 8\",\n           \"factor(essround)9\" = \"ESS Round 9\")\n\n# Then we pass the vector to coef_map to select the coefficients to print\nmodelsummary::modelsummary(list(m1_funquad), stars = TRUE, coef_map = coefs)\n\n\n\n\n   \n     (1) \n  \n\n\n Reform \n    0.063* \n  \n\n  \n    (0.027) \n  \n\n Female \n    0.058*** \n  \n\n  \n    (0.013) \n  \n\n Ethnic minority \n    −0.241*** \n  \n\n  \n    (0.054) \n  \n\n Foreign-born father, Europe \n    −0.111** \n  \n\n  \n    (0.042) \n  \n\n Foreign-born mother, Europe \n    −0.108* \n  \n\n  \n    (0.044) \n  \n\n Foreign-born father, outside Europe \n    −0.065 \n  \n\n  \n    (0.073) \n  \n\n Foreign-born mother, outside Europe \n    −0.110 \n  \n\n  \n    (0.078) \n  \n\n ESS Round 2 \n    0.059 \n  \n\n  \n    (0.045) \n  \n\n ESS Round 3 \n    0.162* \n  \n\n  \n    (0.075) \n  \n\n ESS Round 4 \n    0.243* \n  \n\n  \n    (0.108) \n  \n\n ESS Round 5 \n    0.360* \n  \n\n  \n    (0.144) \n  \n\n ESS Round 6 \n    0.397* \n  \n\n  \n    (0.179) \n  \n\n ESS Round 7 \n    0.449* \n  \n\n  \n    (0.212) \n  \n\n ESS Round 8 \n    0.655** \n  \n\n  \n    (0.246) \n  \n\n ESS Round 9 \n    0.816** \n  \n\n  \n    (0.283) \n  \n\n Num.Obs. \n    68796 \n  \n\n R2 \n    0.200 \n  \n\n R2 Adj. \n    0.198 \n  \n\n AIC \n    268913.8 \n  \n\n BIC \n    270056.1 \n  \n\n Log.Lik. \n    −134331.877 \n  \n\n RMSE \n    1.71 \n  \n\n\n + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001\n\n\n\nWe can compare our table to the published one to confirm that the results are similar, but there are some inconsistencies that we can address.\nSurvey weights\nOne of the differences between our coefficients and the ones in the published table is due to the fact that we did not account for survey weights, while in Österman (2021), p. 220:\n\nThe data are weighted using ESS design weights\n\nTo understand in more detail what this means, we need some understanding of how weights are constructed in the European Social Survey (ESS). In a nutshell, ESS data are distributed containing 3 types of weights:\n\ndweight: These are the so-called design weights. Quoting the ESS website: “the main purpose of the design weights is to correct for the fact that in some countries respondents have different probabilities to be part of the sample due to the sampling design used.” These weights have been built to correct for the coverage error, that is, the error created by the different chances that individuals from the target population are covered in the sample frame.\npspwght: These are the post-stratification weights. According the the ESS website, these “are a more sophisticated weighting strategy that uses auxiliary information to reduce the sampling error and potential non-response bias.” These weights have been computed after the data has been collected, to correct from differences between population frequencies observed in the sample and the “true” population frequencies (i.e. those provided by the Labour Force Survey funded by the EU and available on Eurostat). Unlike the design weights, which are based on the probability of inclusion of different groups of individuals in the sample frames, these have been calculated starting from variables that are there in the data, and are really an “adjustment” of the design weight made to reach observed frequencies that match those of the target population.\npweight: These are the population size weights. These weights have the purpose to match the numbers of observations collected in each country to the country populations. They are to be used only when we calculate statistics on multiple countries (for instance, unemployment in Scandinavia). Their value is the same for all observations within the same country.\n\nThere is a lot to be said about survey weights and options for dealing with them, which we will not cover in more detail in this course. But as part of this exercise we will get to know some functions that can help with including survey weights and can be extended to include more complex design weights as well. Specifically, we will look at the survey package:\n\n## Install and load required packages\n\noptions(repos = list(CRAN = \"http://cran.rstudio.com/\")) # Just in case\n\npacman::p_load(tidyverse, survey)     \n\n# For some reason installing/loading the tidyverse installed an earlier version of the dplyr package and \"srvyr\" complained; this installs the latest version of \"dplyr\":\n# install.packages(\"dplyr\")\n\nWe start by creating a weighted data object using the svydesign() function from survey, which includes the dweight as used in the original article:\n\n## Create weighted data\nosterman_w <- svydesign(id = ~1,                 # specifying cluster IDs is needed; ~0 or ~1 means no clusters\n                        weights = ~dweight,      # we apply the design weights\n                        data = osterman)\n\nWe then fit the model using the svyglm() function from the survey package and save the model object; note that we specify a design = option with the weighted data object we created earlier:\n\nm1_w <- svyglm(trustindex3 ~ reform1_7 + female + blgetmg_d + \n           fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + \n           agea + yrbrn + I(agea^2) + I(yrbrn^2) +\n           factor(reform_id_num) +           \n           yrbrn:factor(reform_id_num) + agea:factor(reform_id_num) +  \n           I(agea^2):factor(reform_id_num),\n       design = osterman_w, data = osterman) \n\nTo compare the results from the weighted model to the one we produced earlier, we can check them in a modelsummary() table; we can reuse the list of coefficients to display that we created earlier:\n\n# List and name the models\nmodels <- list(\n  \"Unweighted model\" = m1_funquad,\n  \"Weighted model\" = m1_w)\n\n# Tabulate the models\nmodelsummary::modelsummary(models, stars = TRUE, coef_map = coefs)\n\n\n\n\n   \n    Unweighted model \n    Weighted model \n  \n\n\n Reform \n    0.063* \n    0.063* \n  \n\n  \n    (0.027) \n    (0.029) \n  \n\n Female \n    0.058*** \n    0.061*** \n  \n\n  \n    (0.013) \n    (0.014) \n  \n\n Ethnic minority \n    −0.241*** \n    −0.261*** \n  \n\n  \n    (0.054) \n    (0.066) \n  \n\n Foreign-born father, Europe \n    −0.111** \n    −0.090* \n  \n\n  \n    (0.042) \n    (0.046) \n  \n\n Foreign-born mother, Europe \n    −0.108* \n    −0.092* \n  \n\n  \n    (0.044) \n    (0.046) \n  \n\n Foreign-born father, outside Europe \n    −0.065 \n    −0.053 \n  \n\n  \n    (0.073) \n    (0.079) \n  \n\n Foreign-born mother, outside Europe \n    −0.110 \n    −0.087 \n  \n\n  \n    (0.078) \n    (0.082) \n  \n\n ESS Round 2 \n    0.059 \n    0.052 \n  \n\n  \n    (0.045) \n    (0.047) \n  \n\n ESS Round 3 \n    0.162* \n    0.148+ \n  \n\n  \n    (0.075) \n    (0.080) \n  \n\n ESS Round 4 \n    0.243* \n    0.253* \n  \n\n  \n    (0.108) \n    (0.114) \n  \n\n ESS Round 5 \n    0.360* \n    0.364* \n  \n\n  \n    (0.144) \n    (0.153) \n  \n\n ESS Round 6 \n    0.397* \n    0.403* \n  \n\n  \n    (0.179) \n    (0.190) \n  \n\n ESS Round 7 \n    0.449* \n    0.458* \n  \n\n  \n    (0.212) \n    (0.224) \n  \n\n ESS Round 8 \n    0.655** \n    0.671* \n  \n\n  \n    (0.246) \n    (0.261) \n  \n\n ESS Round 9 \n    0.816** \n    0.835** \n  \n\n  \n    (0.283) \n    (0.299) \n  \n\n Num.Obs. \n    68796 \n    68796 \n  \n\n R2 \n    0.200 \n    0.206 \n  \n\n R2 Adj. \n    0.198 \n    0.204 \n  \n\n AIC \n    268913.8 \n     \n  \n\n BIC \n    270056.1 \n    283455.7 \n  \n\n Log.Lik. \n    −134331.877 \n    −141031.683 \n  \n\n RMSE \n    1.71 \n    1.71 \n  \n\n\n + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001\n\n\n\nWe can check our results again to those published in Table A.3 in the Appendix to Österman (2021), and we will see that all the coefficients are the same now. The remaining differences are in the standard error estimates, and those are due to the fact that we did not use robust standard errors to account for intra-group correlation.\nVariance estimation using robust clustered errors\nÖsterman (2021), p. 222 write that:\n\nAll models are estimated with OLS, … and apply country-by-birth year clustered robust standard errors. [Footnote: An alternative would be to cluster the standard errors on the country level but such an approach would risk to lead to biased standard errors because of too few clusters]\n\nApplying clustered robust standard errors is a more elementary and less flexible way to account for breaches of the iid assumptions of OLS (that our variables are “independent and identically distributed”) than fitting a mixed-effects (multilevel, hierarchical) model. There is a vast literature assessing whether one approach is better suited than the other in different contexts. To get a feel for the issues in question and for a deeper understanding of how these methods are useful, compare the analysis of Cheah (2009), whose results “suggest that modeling the clustering of the data using a multilevel methods is a better approach than fixing the standard errors of the OLS estimate”, to that of McNeish, Stapleton, and Silverman (2017), who discuss a number of cases (focusing on psychology literature) where cluster-robust standard error may be more advantageous than multilevel/hierarchical modelling.\nWe will first fit our model using clustered robust standard errors, as done by Österman (2021), then we will check the results against those we would obtain from a multilevel model design to see how the risk related to the low number of country clusters affects the results.\nThe standard was of applying error corrections is by using the sandwich and {lmtest} packages, but they do not handle well the summary objects produced by the survey package for weighted estimates. Using the unweighed model we fitted earlier, we could do the following:\n\n# Load required packages\npacman::p_load(sandwich, lmtest)\n\n# extract variance-covariance matrix with clustered correction\nvc_cl <- vcovCL(m1_funquad, type='HC1', cluster=~cntry_cohort)\n\n# get coefficients\nm1_funquad_cl <- coeftest(m1_funquad, vc_cl)\n\n## Or, the two steps above can be combined into one call:\n\nm1_funquad_cl2 <- coeftest(m1_funquad, \n                          vcovCL, type='HC1', cluster = ~cntry_cohort)\n\n# And we can check that the two have the same result:\nall.equal(m1_funquad_cl, m1_funquad_cl2)\n\n[1] TRUE\n\n\nLet’s tabulate the results for comparison:\n\n# List and name the models\nmodels <- list(\n  \"Unweighted model\" = m1_funquad,\n  \"Weighted model\" = m1_w,\n  \"Unweighted model with cluster-robust errors\" = m1_funquad_cl)\n\n# Tabulate the models\nmodelsummary::modelsummary(models, stars = TRUE, coef_map = coefs)\n\n\n\n\n   \n    Unweighted model \n    Weighted model \n    Unweighted model with cluster-robust errors \n  \n\n\n Reform \n    0.063* \n    0.063* \n    0.063* \n  \n\n  \n    (0.027) \n    (0.029) \n    (0.029) \n  \n\n Female \n    0.058*** \n    0.061*** \n    0.058*** \n  \n\n  \n    (0.013) \n    (0.014) \n    (0.015) \n  \n\n Ethnic minority \n    −0.241*** \n    −0.261*** \n    −0.241*** \n  \n\n  \n    (0.054) \n    (0.066) \n    (0.060) \n  \n\n Foreign-born father, Europe \n    −0.111** \n    −0.090* \n    −0.111* \n  \n\n  \n    (0.042) \n    (0.046) \n    (0.046) \n  \n\n Foreign-born mother, Europe \n    −0.108* \n    −0.092* \n    −0.108* \n  \n\n  \n    (0.044) \n    (0.046) \n    (0.047) \n  \n\n Foreign-born father, outside Europe \n    −0.065 \n    −0.053 \n    −0.065 \n  \n\n  \n    (0.073) \n    (0.079) \n    (0.077) \n  \n\n Foreign-born mother, outside Europe \n    −0.110 \n    −0.087 \n    −0.110 \n  \n\n  \n    (0.078) \n    (0.082) \n    (0.086) \n  \n\n ESS Round 2 \n    0.059 \n    0.052 \n    0.059 \n  \n\n  \n    (0.045) \n    (0.047) \n    (0.044) \n  \n\n ESS Round 3 \n    0.162* \n    0.148+ \n    0.162* \n  \n\n  \n    (0.075) \n    (0.080) \n    (0.079) \n  \n\n ESS Round 4 \n    0.243* \n    0.253* \n    0.243* \n  \n\n  \n    (0.108) \n    (0.114) \n    (0.115) \n  \n\n ESS Round 5 \n    0.360* \n    0.364* \n    0.360* \n  \n\n  \n    (0.144) \n    (0.153) \n    (0.154) \n  \n\n ESS Round 6 \n    0.397* \n    0.403* \n    0.397* \n  \n\n  \n    (0.179) \n    (0.190) \n    (0.190) \n  \n\n ESS Round 7 \n    0.449* \n    0.458* \n    0.449* \n  \n\n  \n    (0.212) \n    (0.224) \n    (0.224) \n  \n\n ESS Round 8 \n    0.655** \n    0.671* \n    0.655* \n  \n\n  \n    (0.246) \n    (0.261) \n    (0.262) \n  \n\n ESS Round 9 \n    0.816** \n    0.835** \n    0.816** \n  \n\n  \n    (0.283) \n    (0.299) \n    (0.298) \n  \n\n Num.Obs. \n    68796 \n    68796 \n    68796 \n  \n\n R2 \n    0.200 \n    0.206 \n     \n  \n\n R2 Adj. \n    0.198 \n    0.204 \n     \n  \n\n AIC \n    268913.8 \n     \n    406007.8 \n  \n\n BIC \n    270056.1 \n    283455.7 \n    1033594.4 \n  \n\n Log.Lik. \n    −134331.877 \n    −141031.683 \n     \n  \n\n RMSE \n    1.71 \n    1.71 \n     \n  \n\n\n + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001\n\n\n\nThe error-correction does not have an impact on the coefficients, but it did affect the standard errors. Notice that the standard errors came closer to those in the weighted model without any error correction, showing that the weighting procedure already applies an error correction, albeit not specifically on the cluster variable cntry_cohort.\nAn easy way to include information on the clustering directly in the svydesign() function is to add cluster IDs:\n\n# Re-specify the survey design\nosterman_w_cl <- svydesign(id = ~cntry_cohort,        # This time we add cntry_cohort as id\n                         weights = ~dweight, \n                         data = osterman)\n\n# Re-fit the model with the new survey design\nm1_w_cl <- svyglm(trustindex3 ~ reform1_7 + female + blgetmg_d + \n               fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + \n               agea + yrbrn + I(agea^2) + I(yrbrn^2) +\n               factor(reform_id_num) +           \n               yrbrn:factor(reform_id_num) + agea:factor(reform_id_num) +  \n               I(agea^2):factor(reform_id_num),\n           design = osterman_w_cl, data = osterman) \n\n# Tabulate the models\n\n# Add the latest model to the existing list using the append() function to save typing\nmodels <- append(models, \n                 list(\"Weighted with cluster IDs\" = m1_w_cl)\n                 )\n\n# Tabulate the models\nmodelsummary::modelsummary(models, stars = TRUE, coef_map = coefs)\n\n\n\n\n   \n    Unweighted model \n    Weighted model \n    Unweighted model with cluster-robust errors \n     Weighted with cluster IDs \n  \n\n\n Reform \n    0.063* \n    0.063* \n    0.063* \n    0.063* \n  \n\n  \n    (0.027) \n    (0.029) \n    (0.029) \n    (0.030) \n  \n\n Female \n    0.058*** \n    0.061*** \n    0.058*** \n    0.061*** \n  \n\n  \n    (0.013) \n    (0.014) \n    (0.015) \n    (0.015) \n  \n\n Ethnic minority \n    −0.241*** \n    −0.261*** \n    −0.241*** \n    −0.261*** \n  \n\n  \n    (0.054) \n    (0.066) \n    (0.060) \n    (0.067) \n  \n\n Foreign-born father, Europe \n    −0.111** \n    −0.090* \n    −0.111* \n    −0.090+ \n  \n\n  \n    (0.042) \n    (0.046) \n    (0.046) \n    (0.047) \n  \n\n Foreign-born mother, Europe \n    −0.108* \n    −0.092* \n    −0.108* \n    −0.092+ \n  \n\n  \n    (0.044) \n    (0.046) \n    (0.047) \n    (0.047) \n  \n\n Foreign-born father, outside Europe \n    −0.065 \n    −0.053 \n    −0.065 \n    −0.053 \n  \n\n  \n    (0.073) \n    (0.079) \n    (0.077) \n    (0.078) \n  \n\n Foreign-born mother, outside Europe \n    −0.110 \n    −0.087 \n    −0.110 \n    −0.087 \n  \n\n  \n    (0.078) \n    (0.082) \n    (0.086) \n    (0.081) \n  \n\n ESS Round 2 \n    0.059 \n    0.052 \n    0.059 \n    0.052 \n  \n\n  \n    (0.045) \n    (0.047) \n    (0.044) \n    (0.045) \n  \n\n ESS Round 3 \n    0.162* \n    0.148+ \n    0.162* \n    0.148+ \n  \n\n  \n    (0.075) \n    (0.080) \n    (0.079) \n    (0.088) \n  \n\n ESS Round 4 \n    0.243* \n    0.253* \n    0.243* \n    0.253* \n  \n\n  \n    (0.108) \n    (0.114) \n    (0.115) \n    (0.125) \n  \n\n ESS Round 5 \n    0.360* \n    0.364* \n    0.360* \n    0.364* \n  \n\n  \n    (0.144) \n    (0.153) \n    (0.154) \n    (0.169) \n  \n\n ESS Round 6 \n    0.397* \n    0.403* \n    0.397* \n    0.403+ \n  \n\n  \n    (0.179) \n    (0.190) \n    (0.190) \n    (0.209) \n  \n\n ESS Round 7 \n    0.449* \n    0.458* \n    0.449* \n    0.458+ \n  \n\n  \n    (0.212) \n    (0.224) \n    (0.224) \n    (0.246) \n  \n\n ESS Round 8 \n    0.655** \n    0.671* \n    0.655* \n    0.671* \n  \n\n  \n    (0.246) \n    (0.261) \n    (0.262) \n    (0.288) \n  \n\n ESS Round 9 \n    0.816** \n    0.835** \n    0.816** \n    0.835* \n  \n\n  \n    (0.283) \n    (0.299) \n    (0.298) \n    (0.326) \n  \n\n Num.Obs. \n    68796 \n    68796 \n    68796 \n    68796 \n  \n\n R2 \n    0.200 \n    0.206 \n     \n    0.206 \n  \n\n R2 Adj. \n    0.198 \n    0.204 \n     \n    −209.204 \n  \n\n AIC \n    268913.8 \n     \n    406007.8 \n     \n  \n\n BIC \n    270056.1 \n    283455.7 \n    1033594.4 \n    20460875.2 \n  \n\n Log.Lik. \n    −134331.877 \n    −141031.683 \n     \n    −10229741.419 \n  \n\n RMSE \n    1.71 \n    1.71 \n     \n    1.71 \n  \n\n\n + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001\n\n\n\nThis final model takes us the closest to the model reported in Österman (2021).\nTo take that a final step further, we will fit the model in a multilevel framework using the lme4 package. In the next exercises, you will rely on your knowledge of the lmer() function to fit the required models, then add them to the modelsummary() table and compare them to the OLS models we fitted earlier."
  },
  {
    "objectID": "materials/worksheets/worksheets_w06.html#exercise-3-fit-a-random-intercept-model",
    "href": "materials/worksheets/worksheets_w06.html#exercise-3-fit-a-random-intercept-model",
    "title": "Week 6 Computer Lab Worksheet",
    "section": "Exercise 3: Fit a random intercept model",
    "text": "Exercise 3: Fit a random intercept model\nComplete the code below in order to fit a random intercept model with cntry_cohort as the single grouping variable and disregarding survey weights:\n\n# Load the required packages\npacman::p_load(lme4)\n\n# Fit a random intercept model with `cntry_cohort` as the grouping variable\nm2 <- ...\n\n\n# Fit a random intercept model with only `cntry` as the grouping variable\nm2_cntry <- ...\n\n\n# Add the latest models to the existing list using the append() function\nmodels <- append(...)\n\n# Tabulate and compare the models\nmodelsummary::modelsummary(...)\n\n\nQuestions\n\nHow do the models compare?"
  },
  {
    "objectID": "materials/worksheets/worksheets_w07.html#setup",
    "href": "materials/worksheets/worksheets_w07.html#setup",
    "title": "Week 7 Computer Lab Worksheet",
    "section": "Setup",
    "text": "Setup\nIn Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder. If it’s missing, complete Task 2 from the Week 1 Lab.\nYou can create and work in either an .R script or a .qmd/.Rmd for this week’s exercises (e.g. “Lab_7”), but the .qmd/.Rmd is recommended as it allows you to record longer notes and to render/knit your work to an easily readable output format.\nAlso install and load the packages that we will use in this session. Apart from packages that we have already used before, there are a number of new packages that we will explore:\n\n# Load required packages\n# Assumes that the pacman package is already installed; if it's not, install it first\npacman::p_load(tidyverse, \n               sjlabelled, \n               sjmisc,\n               modelsummary,\n               marginaleffects,\n               lme4,\n               psych,\n               lavaan,\n               lavaanPlot,\n               semPlot)"
  },
  {
    "objectID": "materials/worksheets/worksheets_w07.html#exercise-1-exploratory-factor-analysis",
    "href": "materials/worksheets/worksheets_w07.html#exercise-1-exploratory-factor-analysis",
    "title": "Week 7 Computer Lab Worksheet",
    "section": "Exercise 1: (Exploratory) Factor Analysis",
    "text": "Exercise 1: (Exploratory) Factor Analysis\nIn the first exercises we will use a dataset containing data collected from members of a fitness centre in Norway in 2014 (Mehmetoglu and Mittner 2021) (The description of the methods and analysis also relies mostly on Chapters 13 and 14 in Mehmetoglu and Mittner (2021)). The club members were asked to indicate how well having an attractive face and being sexy described them as a person using an ordinal scale (1 = very badly to 6 = very well). Using a similar scale (1 = not at all important to 6 = very important), the members were also asked to indicate how important various motivation for working out were:\n\n\n\n\n\n Name \n    Label \n  \n\n\n lweight \n    How important is following to workout- to loose weight \n  \n\n calories \n    How important is following to workout- to burn calories \n  \n\n cweight \n    How important is following to workout- to control my weight \n  \n\n stress \n    How important is following to workout - to help manage stress \n  \n\n tension \n    How important is following to workout - to release tension \n  \n\n relax \n    How important is following to workout - to mentally relax \n  \n\n body \n    How important is following to workout- to have a good body \n  \n\n appear \n    How important is following to workout- to improve my appearance \n  \n\n attract \n    How important is following to workout- to look more attractive \n  \n\n muscle \n    How important is following to workout- to develop my muscles \n  \n\n strength \n    How important is following to workout- to get stronger \n  \n\n endur \n    How important is following to workout- to increase my endurance \n  \n\n face \n    How well does the following describe you as a person -  attractive face \n  \n\n sexy \n    How well does the following describe you as a person - sexy \n  \n\n\n\n\nThe workout dataset can be loaded from the course website:\n\n# Load data\nworkout <- readRDS(url(\"https://cgmoreh.github.io/HSS8005/data/workout.Rds\"))\n\n\nTo inspect and better understand the dataset, apply some of the summary functions we practised in previous weeks to this data (e.g. using modelsummary, gtsummary, summarytools or something else).\n\n...\n\n\n\n\n\nFor the first exercise, we’ll select a few variables from the dataset and keep only complete cases with non-missing values:\n\nefa_data <- workout |> \n  select(stress:attract) |> \n  drop_na()\n\nThe six selected variables are the following:\n\n\n\n\n\n Name \n    Label \n  \n\n\n stress \n    How important is following to workout - to help manage stress \n  \n\n tension \n    How important is following to workout - to release tension \n  \n\n relax \n    How important is following to workout - to mentally relax \n  \n\n body \n    How important is following to workout- to have a good body \n  \n\n appear \n    How important is following to workout- to improve my appearance \n  \n\n attract \n    How important is following to workout- to look more attractive \n  \n\n\n\n\nWe can notice that the selected variables can roughly be seen to refer to two conceptual dimensions (latent factors): mental well-being and physical appearance.\nA common motivation for an (exploratory) factor analysis is to reduce a large number of variables down to a meaningful and manageable number of “factors” - or latent variables - that can explain the covariance/correlation among that larger set of observed - measured - variables. Each factor corresponds to a subset of observed variables that are relatively highly correlated. The first steps in a factor analysis are, thus, deciding the number of factors to extract and factor extraction. Factor extraction revolves around inspecting the diagonals of a correlation matrix that includes all the correlations among the observed variables of interest. In a raw correlation matrix, the values on the diagonals are all 1s, because each variable is perfectly correlated with itself. For example, a raw correlation matrix for our sample data would look like this:\n\n# rounded to two decimals for readability\ncor(efa_data) |> \n  round(2)\n\n        stress tension relax body appear attract\nstress    1.00    0.83  0.75 0.02  -0.02   -0.05\ntension   0.83    1.00  0.82 0.00  -0.02   -0.08\nrelax     0.75    0.82  1.00 0.12   0.04    0.00\nbody      0.02    0.00  0.12 1.00   0.76    0.71\nappear   -0.02   -0.02  0.04 0.76   1.00    0.86\nattract  -0.05   -0.08  0.00 0.71   0.86    1.00\n\n\nIn the factor analysis process our aim is to modify these diagonal values to best suit our aims - the various factor extraction methods target these diagonal values (apart from the Maximum Likelihood method, which manipulates the off-diagonal values). Two of the most commonly employed extraction methods are the Principal Axis factor (PA) and the Principal Components Analysis (PCA) methods. The PA method inserts estimates of the common/shared variance (also called communality) in the diagonals of the starting correlation matrix instead of 1s. Communality values are obtained by estimating the squared multiple correlation (smc) of each variable with all the other variables in the matrix (i.e. the multiple R-squared values obtained from regressing each variable on the remaining variables). Manually, we could calculate these communalities like this:\n\ncor_table <- cor(efa_data)                               # Save the raw correlation matrix as an object\n\ndiag(cor_table) <- (1 - 1 / diag(solve(cor_table)))      # Calculate and replace the diagonal of  the matrix with the estimated communalities\n## or, alternatively, using the `smc()` function from the {psych} package:\ndiag(cor_table) <- smc(efa_data) \n\ncor_table |> round(2)                                    # Print the correlation matrix with two decimals precision\n\n        stress tension relax body appear attract\nstress    0.70    0.83  0.75 0.02  -0.02   -0.05\ntension   0.83    0.79  0.82 0.00  -0.02   -0.08\nrelax     0.75    0.82  0.71 0.12   0.04    0.00\nbody      0.02    0.00  0.12 0.61   0.76    0.71\nappear   -0.02   -0.02  0.04 0.76   0.79    0.86\nattract  -0.05   -0.08  0.00 0.71   0.86    0.75\n\n\n\n\n\n\n\n\nCompare\n\n\n\nCompare the values on the diagonal to those obtained from regressing each variables in the dataset on all the remaining variables; eg:\n\nlm_stress <- lm(stress ~ tension + relax + body + appear + attract, data = efa_data)\nlm_tension <- lm(tension ~ stress + relax + body + appear + attract, data = efa_data)\n\n... etc.\n\nsummary(lm_stress)$r.squared\nsummary(lm_tension)$r.squared\n\n... etc.\n\n\n\nThe reason why PA uses communalities in the diagonals is that it assumes that some of the variance in the variables is caused by some other unique sources which ideally should be removed from the analysis. Conversely, PCA uses 1s in the starting diagonals in the correlation matrix without sorting out the variance caused by other sources than the factors themselves. In this respect, PCA could be said to analyse variance (1s in the diagonals) whereas PA (and the other extraction methods) analyses covariance (communalities in the diagonals).\nThe next step in factor extraction is to compute eigenvalues and eigenvectors from this correlation matrix which are then used to compute factor loadings. Eigenvectors are sets of weights (\\(w\\)) that generate factors with the largest possible eigenvalues, while eigenvalues (\\(e\\)) are variances captured by these factors. Factor loadings reflect correlations between observed variables and their respective factors, unless factors are correlated. Factors can be assumed correlated when rotating the factor solution using a type of oblique rotation such as “promax” or “oblimin”; otherwise, PA and PCA assume and produce an orthogonal (no correlation between factors) solution initially. As a result, if we square these correlations, the results will show us how much variance of each observed variable is explained by each factor.\nThe eigenvalues are a common way of deciding on the number of factors to extract. In PCA, factors associated with eigenvalues greater than 1 are generally recommended to be retained for interpretation. The idea here is that the factor retained should explain at least as much variance as one observed variable contributes, a situation represented by 1s in the diagonal. In the case of PA, since we use communalities (which are less than 1) instead of 1s in the diagonal, a common strategy is to choose eigenvalues greater than the average of the initial communalities.\nOther strategies for choosing the number of factors to extract are scree tests, parallel analysis and theoretical motivations. The scree test is about examining a plot/curve, produced after factor extraction, containing the eigenvalues on the Y-axis and the factors on the X-axis. The idea of the scree test is that factors along the tail of the curve represent mostly random error variance (trivial eigenvalues), and consequently, the factor solution just before the levelling of the curve should be chosen. Parallel analysis is about estimating the same factor model as the original one using randomly simulated data, resembling the original data in terms of both the number of variables and observations. Eigenvalues obtained from the simulated data are subsequently averaged and contrasted with their counterparts from the original data. A common recommendation is that if the eigenvalue of a factor from the original data proves to be larger than the average (or, alternatively, the 95th percentile) of the eigenvalues from the simulated data, the factor should be retained. Otherwise, the factor would be considered no more substantial than a random factor and thus discarded.\nIn R the psychpackage provides a variety of functions for performing factor analysis. The fa.parallel() function runs a parallel analysis including a scree plot. In the command below we specify that the parallel analysis should be done using the PA extraction method (fm=\"pa\") and the same method should be used for the eigenvalues (fa=\"fa\"). With the argument SMC=\"TRUE\" we can also request that squared multiple correlations be used in the diagonal of the correlation matrix at the start of the estimation:\n\nparallel_analysis <- fa.parallel(efa_data,\n      fm=\"pa\", fa=\"fa\", SMC=\"TRUE\")\n\n\n\n\nParallel analysis suggests that the number of factors =  2  and the number of components =  NA \n\n\nAlternatively, to run a Principal Component Analysis instead, we can write:\n\nfa.parallel(efa_data, fa=\"pc\")\n\n\n\n\nParallel analysis suggests that the number of factors =  NA  and the number of components =  2 \n\n\nThe scree plot suggests the existence of two obvious factors before the curve levels off. In the code above we have saved the results to an object so that we can also print out more detailed numerical results using the print() function:\n\nprint(parallel_analysis)\n\nCall: fa.parallel(x = efa_data, fm = \"pa\", fa = \"fa\", SMC = \"TRUE\")\nParallel analysis suggests that the number of factors =  2  and the number of components =  NA \n\n Eigen Values of \n\n eigen values of factors\n[1]  2.34  2.28  0.00 -0.07 -0.08 -0.12\n\n eigen values of simulated factors\n[1]  0.28  0.15  0.05 -0.03 -0.11 -0.19\n\n eigen values of components \n[1] 2.61 2.56 0.32 0.23 0.16 0.12\n\n eigen values of simulated components\n[1] NA\n\n\nWe could also compare the average smc() scores to the eigenvalues:\n\nsmc(efa_data) |> mean()\n\n[1] 0.7246744\n\n\nAgain, only two of the eigenvalues are greater that the mean of the squared multiple correlations between the variables. Finally, a theoretical reasoning also supports the choice for two factors being drawn out from this set of six variables, given that, as noted earlier, they appear to roughly map onto motivations relating to either mental well-being or physical appearance.\nThe actual extraction of the factors/principal components, once the most appropriate number of factors has been decided, can be done using the fa() and principal() functions from psych. These functions also provide various extraction methods and rotation options. It is generally useful to rotate the initial factor solution in order to obtain a more easily interpretable factor solution. An easily interpretable factor solution is associated with an output that contains a factor loading (pattern) matrix with variables with the maximum possible loading (close to 1) on one factor and the minimum possible loading (close to 0) on the remaining factor(s); this is often referred to as a ‘simple structure’. Orthogonal rotation options impose that a 90-degree angle is kept between factor axes while rotating them. An alternative approach, oblique rotation, relaxes this restriction, and is often preferred given that arguably factors (latent variables) measuring behavioural phenomena common in social science research are somewhat correlated, and oblique rotation can take this correlation into account by first estimating the correlation between the factors and then generating a solution.\nThe most widely used orthogonal rotation technique is “varimax” rotation, which maximizes the variance of the squared loadings for each factor, thus polarizing loadings so that they are either high or low, making it easier to identify factors with specific variables. Rotation does not influence the total variance explained by the factors. However, the total variance explained gets distributed differently among the factors (i.e., eigenvalues change). Suppose that two factors explain 75 per cent variance together, with 48 per cent and 27 per cent of the variance attributable to the first and second factor, respectively. After rotation, the factors would together still explain 75 per cent variance, but now with, say, 44 per cent and 31 per cent of the total variance attributed to the first and second factor, respectively. Relatedly, the total amount of variance of each variable explained by the factors (i.e., communality) would also stay the same but the factor loadings would change after rotation.\nThe most common oblique rotation technique is “promax”, which starts with an orthogonal rotation (varimax) in which the loadings are, first, raised to a specific power (2, 3 or 4) and then the solution is rotated to allow for correlations among the factors.\nLet’s use the fa() function to extract two factors using the PA method and applying a “varimax” rotation. Since our two factors represent two different phenomena (mental well-being or physical appearance) we do not assume a strong correlation between the two factors, and therefore an orthogonal rotation is substantiated. The fa() function also uses an iterative version of PA by default, which we can turn off by specifying max.iter = 0 if we want to. Iterative PA is basically a PA run several times. While PA starts and completes the factor extraction process with one set of estimated communalities (squared multiple correlations), IPA replaces these estimated communalities by new estimates emerging from the factor extraction process each time until the difference between the two communalities (last inserted and last estimated) is minimized. The fa() function uses the default argument min.err = 0.001, meaning that the estimation iterates until the change in communality is less than 0.001.\n\nfactor_model <- fa(efa_data,\n              nfactors = 2, \n              fm=\"pa\",\n              rotate = \"varimax\")\n\nWe can inspect the complete results from the factor analysis by printing the model object:\n\nfactor_model\n\nWe can also extract more specific information rather than inspecting the entire output. For example, we can check the factor loadings and eigenvalues:\n\nprint(factor_model$loadings, digits=4, cutoff=0)\n\n\nLoadings:\n        PA1     PA2    \nstress   0.8662 -0.0287\ntension  0.9522 -0.0480\nrelax    0.8689  0.0465\nbody     0.0575  0.7990\nappear   0.0093  0.9550\nattract -0.0382  0.8943\n\n                  PA1    PA2\nSS loadings    2.4168 2.3554\nProportion Var 0.4028 0.3926\nCumulative Var 0.4028 0.7954\n\n\nWe can see in this output how the first three variables (stress, tension and relax) load strongly on the first factor, and the last three variables (body, appear and attract) load strongly on the second factor. The factor loadings table can be read both vertically and horizontally. When we interpret the matrix vertically, we locate the correlations between a factor and all the observed variables. For example, the loading of stress on factor 1 is 0.8662. Squaring this value - 0.8662^2 = 0.7503 - tells us that roughly 75% per cent of the variance of stress is explained by factor 1. When we interpret the table horizontally, we obtain item communalities. For instance, we see that the loadings of tension on factor 1 and factor 2 are 0.9522 and -0.0480, respectively. Squaring and adding these two values together - 0.9522^2 + (-0.0480)^2 = 0.9089 - would show us the proportion (about 90%) of the total variance of tension that is explained by factor 1 and factor 2 in tandem. The remaining 10% represents the unique or unexplained variance.\nWe can also visualise these results with the fa.diagram() and fa.plot() functions:\n\nfa.diagram(factor_model, digits = 2)\n\n\n\n\n\nfa.plot(factor_model)\n\n\n\n\nIn most cases the aim of an exploratory factor analysis is to use factors in subsequent analysis. For this, we first need a metric for the two hypothetical constructs. We can either use the estimated factor scores directly or we can generate factors by, for example, taking either the sum or average of the variables expressing each factor. We will choose to take the average to keep the factor metric on the same scale (1−6) as the original observed variables, using the scoreItems() function from psych. Then, we can test the reliability of this summed scale based on Cronbach’s \\(\\alpha\\) coefficient.\n\n# First, we compile a list of the variables:\nitemlist <- list(mental=c(\"stress\",\"tension\",\"relax\"),\n                 physical=c(\"body\",\"appear\",\"attract\"))\nsummateds <- scoreItems(itemlist, efa_data, \n                         min=1, max=6, totals = FALSE)\nfactordata <- as_data_frame(summateds$scores)\n\n# Merge the factordata variables with the dataset:\n\nefa_data <- bind_cols(efa_data, factordata)\n\n# Check the variables in the dataset:\nnames(efa_data)\n\n[1] \"stress\"   \"tension\"  \"relax\"    \"body\"     \"appear\"   \"attract\"  \"mental\"  \n[8] \"physical\"\n\n\nTo test the reliability of the factors created, we can use the alpha() function from `{psych}’:\n\n# We run the alpha check on the two items separately:\n\nmental <- efa_data |> select(stress:relax)\nphysical <- efa_data |> select(body:attract)\n\n# We can save the output to objects and inspect them\nalpha_mental <- alpha(mental)\nalpha_physical <- alpha(physical)\n\n# Or we can print out only the overall alpha values directly\nalpha(mental)$total$std.alpha\n\n[1] 0.9234774\n\nalpha(physical)$total$std.alpha\n\n[1] 0.9124331\n\n\nIn common practice an \\(\\alpha\\) level higher than 0.7 is taken as satisfactory, showing a good fit of the items. Both our factors have a high \\(\\alpha\\) value, so we have additional evidence that the variables chosen fit together well on each factor.\n\nEFA practice Try to redo the analysis above, but this time including all the variables relating to “How important is following to workout …” variables from the original workout dataset."
  },
  {
    "objectID": "materials/worksheets/worksheets_w07.html#exercise-2-latent-path-analysis",
    "href": "materials/worksheets/worksheets_w07.html#exercise-2-latent-path-analysis",
    "title": "Week 7 Computer Lab Worksheet",
    "section": "Exercise 2: Latent Path Analysis",
    "text": "Exercise 2: Latent Path Analysis\nConfirmatory Factor Analysis (CFA) and Latent Path Analysis (LPA) are two of the most commonly used Structural Equation Modelling techniques in the social sciences. Confirmatory factor analysis is used to assess a hypothesized latent factor structure containing a set of indicators and one or more latent variables. Latent path analysis is used not only to examine a factor structure but also to test hypothesized structural relationships among the latent variables. The factor structure is concerned with the relationships between indicators and latent variables, whereas the structural relationships concern links between latent variables. The former is referred to as the measurement part, while the latter is named the structural part; together they comprise LPA.\nIn this exercise we will use the same workout data we started off from in Exercise 1. The analysis presented by Mehmetoglu and Mittner (2021) begins by proposing a number of hypotheses concerning motivations for working out in a fitness club based on evolutionary psychology theories:\n\nH1: The more attractive one perceives herself/himself, the more the person wants to work out to improve her/his physical appearance (i.e., Attractive → Appearance).\nH2: The more the person wants to work out to improve her/his physical appearance, the more s/he wants to work out to build up muscles (i.e., Appearance → Muscle).\nH3: The more the person wants to work out to improve her/his physical appearance, the more s/he wants to work out to lose weight (i.e., Appearance → Weight).\nH4: The more attractive one perceives herself/himself will indirectly influence the person to work out more to build up muscles (i.e., Attractive → Appearance → Muscle).\nH5: The more attractive one perceives herself/himself will indirectly influence the person to work out more to lose weight (i.e., Attractive → Appearance → Weight).\n\nIn an SEM framework it is common to visualise this set of hypotheses in a path diagram. Using the lavaan package, we can write the measurement and structural model equations, fit the structural equation model and plot the model like this:\n\n# First, remove all objects from the Environment, except for the \"workout\" data\nrm(list=setdiff(ls(), \"workout\"))\n\n\n# Specify the model structure:\nfull.lpa.mod <- '\n              # Measurement model (latent variables)\n                Attractive =~ face + sexy\n                Appearance =~ body + appear + attract\n                Muscle =~ muscle + strength + endur\n                Weight =~ lweight + calories + cweight\n              # Structural model (regressions)\n                Appearance ~ Attractive\n                Muscle ~ Appearance\n                Weight ~ Appearance\n              '\n\n# Estimate the model\nest.full.lpa.mod <- sem(full.lpa.mod, data=workout)\n\n# Plot the model\n## with lavaanPlot()\nlavaanPlot(model = est.full.lpa.mod)\n\n\n\n\n## or semPaths()\nsemPlot::semPaths(est.full.lpa.mod, title = FALSE,  layout = \"tree2\", nCharNodes = 0)\n\n\n\n\nEstimating a LPA model proceeds in two steps: we first establish a psychometrically sound (i.e., valid and reliable) measurement model and subsequently test the structural model. The measurement part of the LPA model includes the relationship between each of our four latent variables (Attractive, Appearance, Muscle and Weight) and its respective indicators. The measurement model is a standard CFA, and can be fit separately like this:\n\nmeas.lpa.mod <- '\n                Attractive =~ face + sexy\n                Appearance =~ body + appear + attract\n                Muscle =~ muscle + strength + endur\n                Weight =~ lweight + calories + cweight\n                '\nest.meas.lpa.mod <- cfa(meas.lpa.mod, data = workout)\n\nWe can inspect the results using the summary() function, with some additional options included:\n\nsummary(est.meas.lpa.mod, fit.measures = TRUE, standardized = TRUE)\n\nSome of the model fit indices (RMSEA > 0.1, TLI < 0.9, etc.) could be improved. The root mean squared error of approximation (RMSEA) takes the model complexity and sample size into consideration and penalizes models with too many parameters to estimate (i.e., with low degrees of freedom, df) and accordingly favours simpler models (i.e., with higher df). As we decrease the df (i.e., include more parameters to estimate in our model), the RMSEA value increases. High RMSEA values will be a sign of poor model fit, and RMSEA values greater than 0.10 generally indicate poor model fit. The Tucker–Lewis index (TLI) also generally ranges from 0 to 1, and higher values are desirable. TLI values greater than 0.90 are generally associated with acceptable model fit.\nMehmetoglu and Mittner (2021) suggest that correlating the error variances of two different pairs of indicators (“muscle” and “endur”, “lweigh” and “body”) would improve the model fit. This modification involves:\n\nmeas.lpa.mod2 <- '\n                Attractive =~ face + sexy\n                Appearance =~ body + appear + attract\n                Muscle =~ muscle + strength + endur\n                Weight =~ lweight + calories + cweight\n                muscle ~~ endur \n                lweight ~~ body\n                '\nest.meas.lpa.mod2 <- cfa(meas.lpa.mod2, data = workout)\n\nThe revised model is an improvement, although the RMSA is still slihghtly above 0.10.\nThe structural part extends the measurement with the hypothesized relationship among the latent variables.\n\nfull.lpa.mod <- '\n              # Measurement model (latent variables)\n                Attractive =~ face + sexy\n                Appearance =~ body + appear + attract\n                Muscle =~ muscle + strength + endur\n                Weight =~ lweight + calories + cweight\n                muscle ~~ endur \n                lweight ~~ body\n              # Structural model (regressions)\n                Appearance ~ Attractive\n                Muscle ~ Appearance\n                Weight ~ Appearance\n                '\nest.full.lpa.mod <- sem(full.lpa.mod, data = workout)\n\nWe can inspect the full model:\n\nsummary(est.full.lpa.mod, fit.measures = TRUE, standardized = TRUE)\n\nSince the theory underlying the model also statest that the covariance between the Muscle and the Weight factors should be set to 0, we can incorporate this into the model:\n\nfull.lpa.mod2 <- '\n              # Measurement model (latent variables)\n                Attractive =~ face + sexy\n                Appearance =~ body + appear + attract\n                Muscle =~ muscle + strength + endur\n                Weight =~ lweight + calories + cweight\n                muscle ~~ endur \n                lweight ~~ body\n                # Set covariance to 0:\n                Muscle ~~ 0*Weight\n              # Structural model (regressions)\n                Appearance ~ Attractive\n                Muscle ~ Appearance\n                Weight ~ Appearance\n                '\nest.full.lpa.mod2 <- sem(full.lpa.mod2, data = workout)\n\n\nsummary(est.full.lpa.mod2, fit.measures = TRUE, standardized = TRUE)\n\nThe assessment of the structural part is similar to that from a linear regression analysis. First, the sign, significance and size of path coefficients should be considered. Path coefficients are estimates that help us to assess the hypothesized relationships in the structural part. These path coefficients are typically presented in a standardized form, which is equivalent to standardized betas in linear regression. Standardized coefficients range generally between −1 and 1. The closer a path coefficient is to ±1 the stronger the relationship (positive/negative) is. The closer a path coefficient is to 0, the weaker the relationship is. Standardized beta coefficients equal to or less than 0.09 indicate a small effect, coefficients between 0.09 and 0.2 indicate a moderate effect, and coefficients larger than 0.2 indicate large effect. In our model, all the signs of the coefficients are in the hypothesized direction. That is, Attractive has a large and positive effect on Appearance, and Appearance has a large and positive effect on both Muscle and Weight. Finally, all of the standardized coefficients are statistically significant at 0.01. All these findings provide clear support for the first three of our study hypotheses (H1, H2, and H3). To address H4 and H5, we need to estimate the indirect effect of Attractive (via Appearance) on Muscle and Weight. In lavaan, this involves setting labels for the path coefficients (e.g. a, b1, b2), then using these labels to create new parameters (e.g. ind1 and ind2) using the := operator:\n\nfull.lpa.mod3 <- '\n              #Measurement model (latent variables)\n                Attractive =~ face + sexy\n                Appearance =~ body + appear + attract\n                Muscle =~ muscle + strength + endur\n                Weight =~ lweight + calories + cweight\n                muscle ~~ endur \n                lweight ~~ body\n                Muscle ~~ 0*Weight #set covariance to 0\n              #Structural model (regressions)\n                Appearance ~ a*Attractive\n                Muscle ~ b1*Appearance\n                Weight ~ b2*Appearance\n              #Indirect effects\n                # of Attraction on Muscle\n                ind1 := a*b1 \n                # of Attraction on Weight\n                ind2 := a*b2 \n                '\nest.full.lpa.mod3 <- sem(full.lpa.mod3, data = workout)\n\n\nsummary(est.full.lpa.mod3, standardized=TRUE)\n\nFrom this final model, we find that Attractive has a moderate and positive indirect effect on both Muscle and Weight, and that these indirect effects are statistically significant at the 0.01 level."
  },
  {
    "objectID": "materials/worksheets/worksheets_w07.html#exercise-3-multilevel-path-model-coefficients-via-multilevel-modelling",
    "href": "materials/worksheets/worksheets_w07.html#exercise-3-multilevel-path-model-coefficients-via-multilevel-modelling",
    "title": "Week 7 Computer Lab Worksheet",
    "section": "Exercise 3: Multilevel path model coefficients via multilevel modelling",
    "text": "Exercise 3: Multilevel path model coefficients via multilevel modelling\nIn tis exercise we look at the analysis presented in Ejrnæs and Jensen (2021), who use cross-national data from 17 countries collected through the European Social Survey (Round 8, 2016) to identify scenarios, “pathways”, under which EU citizens would vote in a potential referendum for their country to leave the European Union. They begin by positing three theoretical models of “the micro-foundations of ‘hard Euroscepticism’/‘exit scepticism’”: utilitarian, identity and anti-elitist models. Their main method is a path analysis testing these theories, but they also employ multi-level logistic regression, principal component analysis (PCA) and cluster analysis at various points in their analysis. The main analysis is done in Stata (the PCA is done in SPSS).\nIn this exercise, we will look more closely at the following reported results:\n\nFigure 4: Multilevel path model \nAppendix 1: Multilevel logistic regression (marginal effects)\n\nWe will check how the results reported can be reproduced using multilevel modelling using the lme4() package, as practiced in Week 6. In R, lavaan has some limitations when it comes to fitting generalised multilevel SEM models.\nFirst, import the dataset:\n\n# Import the dataset\n# Also, drop unused value labels and select to keep only the variables that will be used\nejr <- sjlabelled::read_stata(\"https://cgmoreh.github.io/HSS8005/data/Ejrnaes21.dta\", drop.labels = TRUE) |> \n  select(c(vteurmmb, \n           trstprl, trstlgl, trstplc, trstplt, trstprt, \n           imueclt, imwbcnt, imbgeco, \n           eduyrs, hinctnta, country_num, \n           mnactic, agea, atchctr, gndr))\n\nprtvtfch, prtvtfee, prtvtbgb, prtvtcil, prtvtbis, prtvtbno, prtvtdru, prtclfch, prtclfee, prtclbgb, prtcldil, prtclbis, prtclbno, prtcldru, rlgdnach, rlgdngb, rlgdnis, rlgdnno, rlgdeach, rlgdegb, rlgdeis, rlgdeno, vteumbgb, vteubcmb, rshpsgb, marstgb, edlvdch, edlvdee, edubgb1, eduagb2, edagegb, edubil1, eduail2, edlvdis, edlvdno, edlvdru, edlvpdch, edlvpdee, edupbgb1, edupagb2, edagepgb, edupbil1, edupail2, edlvpdis, edlvpdno, edlvpdru, edlvfdch, edlvfdee, edufbgb1, edufagb2, edagefgb, edufbil1, edufail2, edlvfdis, edlvfdno, edlvfdru, edlvmdch, edlvmdee, edumbgb1, edumagb2, edagemgb, edumbil1, edumail2, edlvmdis, edlvmdno, edlvmdru\n\n\n\n\n\nData management\nOutcome variable\nThe outcome variable of interest originates from the survey question: “Imagine there were a referendum in [country] tomorrow about membership of the European Union. Would you vote for [country] to remain a member of the European Union or to leave the European Union?”. Responses are coded in the variable vteurmmb, which has five levels (tip: use the sjmisc::frq() function for a labelled frequency table of the variable), including also “Would submit a blank ballot paper”, “Would spoil the ballot paper” and “Would not vote” in addition to “Remain” and “Leave”. We will follow the authors in dichotomising the variable as “1” = Leave vs. “0” = all other non-missing values:\n\n# Recoding \"leave\" variable as dummy variable (with `dplyr`):\nejr <- ejr |> \n  mutate(leave = case_match(vteurmmb, \n                            \"2\" ~ 1,\n                            c(\"1\", \"33\", \"44\", \"55\") ~ 0)) \n\n### Notes: \n### - \"case_match\" has superseded the previous \"recode()\" function in dplyr 1.1.0. If the code above fails, make sure you have dplyr 1.1.0 installed; if not, install it first\n### - \"vteurmmb\" is a factor with 5 labelled levels; to be able to refer to the values when recoding, we need either to treat it as_numeric() or quote the levels using \"\". The resulting variable will be numeric in either case.\n\n## `sjmisc::rec()` is another useful option, especially with labelled data:\n\n# ejr <- ejr |>\n#   mutate(leave = rec(vteurmmb,\n#                      rec = \"2 = 1 [Leave];\n#                      1, 33, 44, 55 = 0\"))\n\nEndogenous variables\nIn the context of structural equation modelling, variables that are influenced by other variables in a model are called endogenous variables. Conversely, those not influenced by other variables are called exogenous. Additionally, we can differentiate manifest variables - those that we have directly observed and measured - from latent variables - those that we have not directly measured, but instead we estimate from other measured variables. Such latent variables can be, for example, the “factors” or “components” obtained from a factor analysis or PCA. A path analysis is a special version of a structural equation model in which all the variables are manifest (observed).\nThe two constructs used as endogenous variables in the Figure 4 model (“trust in the political establishment” and “support for multiculturalism”) are treated by the authors as manifest because they are entered into the model as indices created before the model-building, however, the indices are created as a result of a principal component analysis\n\n# Creating two trust-related composite variables\n\n\n## Select out the variables as data-frames         \ntrust_vars <- ejr |> \n  select(c(trstprl, trstlgl, trstplc, trstplt, trstprt)) |> \n  mutate(across(everything(), as_numeric))\n         \nmulticulti_vars <- ejr |> \n  select(c(imueclt, imwbcnt, imbgeco)) |> \n  mutate(across(everything(), as_numeric))\n\n\n## Create Cronbach's alpha coefficient scores using `alpha()` from the {psych} package \n## and add them to the \"data\" dataframe\n\nejr <- ejr |> \n  mutate(trust = alpha(trust_vars, check.keys = TRUE)$scores,\n         multi_culti = alpha(multiculti_vars, check.keys = TRUE)$scores)\n\nejr <- ejr |> \n  mutate(across(c(eduyrs, hinctnta, trust, multi_culti), as_numeric),\n         across(leave, as_factor))\n\nVariable standardisation\nTha authors also standardize the variables before modelling them. In R this can be done in various ways, including using the sjmisc::std() function:\n\n# Standardize variables using \"sjmisc::std()\"\nejr <- ejr |> \n  mutate(Zeducation1 = std(eduyrs),\n         Zincome1 = std(hinctnta),\n         Zmulti1 = std(multi_culti),\n         Ztrust1 = std(trust)\n         ) |> \n  mutate(across(c(Zeducation1, Zincome1, Zmulti1, Ztrust1, country_num), as_numeric))\n\nMultilevel model fitting\nWe now have all the variables in a format that can be modelled:\n\nF4_direct <- lme4::glmer(leave ~ Zeducation1 + Zincome1 + Ztrust1 + Zmulti1 + (1|country_num), family = binomial(link = \"logit\"), data = ejr)\n\nF4_direct_margins <- marginaleffects::marginaleffects(F4_direct, eps = 0.15)\n\n\nF4_trust <- lme4::lmer(Ztrust1 ~ Zeducation1 + Zincome1 +  (1|country_num), data = ejr)\n\nF4_multi <- lme4::lmer(Zmulti1 ~ Zeducation1 + Zincome1 +  (1|country_num), data = ejr)\n\n\nmodelsummary::modelsummary(list(\"Leave\" = F4_direct_margins, \n                                \"Trust in pol. est.\" = F4_trust, \n                                \"Support multiculturalism\" = F4_multi),\n                           coef_rename = c(\"Zeducation1\" = \"Education\",\n                                           \"Zincome1\" = \"Income\",\n                                           \"Zmulti1\" = \"Support for multiculturalism\",\n                                           \"Ztrust1\" = \"Trust in political establishment\"),\n                           statistic = NULL)\n\n\n\n\n   \n    Leave \n    Trust in pol. est. \n    Support multiculturalism \n  \n\n\n Education \n    −0.012 \n    0.064 \n    0.210 \n  \n\n Income \n    −0.012 \n    0.111 \n    0.074 \n  \n\n Support for multiculturalism \n    −0.075 \n     \n     \n  \n\n Trust in political establishment \n    −0.061 \n     \n     \n  \n\n (Intercept) \n     \n    −0.011 \n    0.008 \n  \n\n SD (Intercept country_num) \n     \n    0.372 \n    0.353 \n  \n\n SD (Observations) \n     \n    0.903 \n    0.888 \n  \n\n Num.Obs. \n    23850 \n    23923 \n    23865 \n  \n\n R2 Marg. \n    0.193 \n    0.022 \n    0.063 \n  \n\n R2 Cond. \n    0.280 \n    0.164 \n    0.191 \n  \n\n AIC \n    19642.0 \n    63101.9 \n    62184.1 \n  \n\n BIC \n    19690.5 \n    63142.3 \n    62224.5 \n  \n\n ICC \n    0.1 \n    0.1 \n    0.1 \n  \n\n RMSE \n    0.36 \n    0.90 \n    0.89 \n  \n\n\n\n\n\nCompare the results in our table to those reported in Ejrnæs and Jensen (2021)"
  },
  {
    "objectID": "plan8.html",
    "href": "plan8.html",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "Topic plan and materials\n\n\n\n\n\n\nWeek\n\n\nTW\n\n\nDate\n\n\nTopic\n\n\nInfo\n\n\nNotes\n\n\nLecture\n\n\nLabs\n\n\nHandouts\n\n\n\n\n\n\n\n\n22\n\n\n30 January\n\n\nIntroduction: module overview\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWeek 1\n\n\n22\n\n\n02 February\n\n\nGamblers, God, Guinness and peas  A brief history of statistics\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\nWeek 2\n\n\n23\n\n\n09 February\n\n\nRevisiting Flatland  A review of general linear models\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n  \n\n\n\n\nWeek 3\n\n\n24\n\n\n16 February\n\n\nDear Prudence, Help! I may be cheating with my X  Interactions and the logic of causal inference\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n  \n\n\n\n\nWeek 4\n\n\n25\n\n\n23 February\n\n\nThe Y question  Generalised linear models\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n  \n\n\n\n\nWeek 5\n\n\n26\n\n\n02 March\n\n\nDo we live in a simulation?  Basic data simulation for statistical inference and power analysis\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n  \n\n\n\n\nWeek 6\n\n\n27\n\n\n09 March\n\n\nChallenging hierarchies  Multilevel models\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n  \n\n\n\n\nWeek 7\n\n\n28\n\n\n16 March\n\n\nThe unobserved  Latent variables and structural models\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n  \n\n\n\n\nWeek 8\n\n\n29\n\n\n23 March\n\n\nWords, words, mere words…  Text as data\n\n\n\n\n\n  \n\n\n  \n\n\n  \n\n\n  \n\n\n\n\n\n\n29\n\n\n24 March\n\n\nConclusions: sum up, divide, and conquer\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTopic details\n\n\n\n\n\nTopic\n\n\nDescription\n\n\n\n\n\n\nWeek 1  Gamblers, God, Guinness and peas  A brief history of statistics\n\n\nIn the first contribution to a series of articles on the history of probability and statistics in the journal Biometrika, Florence Nightingale David (1955) (no linear relationship with the famous social reformer) paraphrased a contemporary archaeologist who quipped that “a symptom of decadence in a civilization is when men become interested in their own history”, giving the interest in his own discipline as proof of the validity of his statement. David, however, thought that this does not stand true also for scientists’ and statisticians’ own emerging interest in their disciplines. He was right, in that the critical examination of the intellectual development of statistics and probability theory that followed has improved the discipline by excavating ideas that had been buried by mainstream statistics, but he was also mistaken, in that this activity threw light on the decadence of mainstream statistical practice. In this lecture we will look back on the development of some basic statistical concepts and learn about the ideas and preoccupations that influenced them over the centuries. The aim of this overview is to build up essential intuition about the concepts and methods that we will learn later. Brains-on activities will include casting astragali, fighting Laplace’s Demon, tasting tea, and comparing peas in a pod. By the end, we will gain a clearer understanding of the limits of statistical analysis and the dangers of not acknowledging those limits.\nThe IT lab will provide a very hands-on practical introduction to the statistical software that will be used in the module.\n\n\n\n\nWeek 2  Revisiting Flatland  A review of general linear models\n\n\nIn Edwin Abbott’s 1884 novella, the inhabitants of Flatland are geometric shapes living in a two-dimensional world, incapable of imagining the existence of higher dimensions. A sphere passing through the plain of their world is a fascinating but incomprehensible event: Flatlanders can only see a dot becoming a circle, increasing in circumference, then shrinking back in size and disappearing. There are, in this universe, worlds with even more limited views, like the one-dimensional Lineland and the zero-dimensional Pointland. Any attempt to expand the perspective of their inhabitant(s) is doomed to failure. But as in any good adventure story, a chosen Flatland native embarks on a journey of discovery and revelation - and ostracism and imprisonment. The story is interpreted as an allegorical criticism of Victorian-age social structure, but can equally describe the limitations of inhabiting uncritically a methodological world in which all data are ‘normal’ and all relationships are linear. Moving beyond linearity and acquiring the statistical intuition needed to think in higher dimensions and perceive more complex relationships is indeed a matter of practice-induced revelation. It’s unlikely that we will reach statistical nirvana in this short course, but we’ll attempt to build some more substantial structures upon the arid plains of linear regression. We start by looking around in the Flat-, Line- and Point-lands of quantitative analysis. Incorrigible procrastinators may want to check out a full-length computer animated film version of Flatland on YouTube. Others may be better served by this brief TED-Ed animation.\n\n\n\n\nWeek 3  Dear Prudence, Help! I may be cheating with my X  Interactions and the logic of causal inference\n\n\nMuch of what we do in quantitative data analysis is about examining relationships. We are often interested in proposing and testing models of relationships between two or more variables. Sometimes our variables cry out to us begging for help, and we turn into agony aunts and uncles to our data. Other times we must psychoanalyse our data to uncover hidden associations and interactions. This is not an easy task. Do it carelessly, and you may unwittingly cheat yourself and the readers of your research. This week we’ll build some intuition for detecting complex and uneasy relationships within the design matrix X - that promiscuous commune on the right-hand-side of our regression equations. We’ll expand on the linear additive models that we looked at in the previous week by considering interactions among our predictor variables, we’ll explore the possibilities and challenges of asking causal questions of observational data, and we’ll think about ways to avoid what evolutionary anthropologist Richard McElreath calls ‘causal salad’. We may get an uncomfortable feeling that we may have cheated with our Xs in the past, but we’ll look towards the future. By the way, Dear Prudence is Slate magazine’s advice column; I like the name because being prudent really is essential in data analysis and interpretation. If you’re done with the readings for this week, you may indulge in some Prudie advice on matters more serious than statistics.\n\n\n\n\nWeek 4  The Y question  Generalised linear models\n\n\nIt wasn’t until the last quarter of the 20th century that a unified vision of statistical modelling emerged, allowing practitioners to see how the general linear model we have explored so far is only a specific case of a more general class of models. We could have had a fancy, memorable name for this class of models - as John Nelder, one of its inventors, acknowledged later in life (Senn 2003:127) - but back then academics were not required to undertake marketing training on the tweetabilty-factor of the chosen names for their theories; so we ended up with “generalised linear models”. These models can be applied to explananda (“explained”, “response”, “outcome”, “dependent” etc. variables, our ys) whose possible values have certain constraints (such as being limited by a lower bound or constrained to discreet choices) that makes the parameters of the Gaussian (‘normal’) distribution inefficient in describing them. Instead, they follow some of the other “exponential distributions” (and not only the exponential: cf. Gelman et al. (2020:264)), of which the Poisson, gamma, beta, binomial and multinomial are probably the most common in human and social sciences research. Their “generalised linear modelling” involves mapping them unto a linear model using a so-called “link function”. We will explore what all of this means in practice and how it can be applied to data that we are interested in most in our respective fields of study.\n\n\n\n\nWeek 5  Do we live in a simulation?  Basic data simulation for statistical inference and power analysis\n\n\nWe have known ever since science-fiction author Philip K. Dick’s memorable “Metz address” of 1977 that our world is a computer simulation. Of course, like some common-currency theories in the social sciences, this knowledge will never be truly verified. We won’t even attempt to get to the bottom of it in class; instead, we’ll practice some basic methods of computer simulation for statistical inference and for generating data that has some idealised characteristics. Such methods play an increasingly important role in computational statistics and are extremely useful for designing robust data collection and analysis plans. If you make a mistake in the code and end up in an infinite loop, but you’re afraid that stopping the process may cause the known universe to implode, you can watch Dick on YouTube while you wait. If something like this can happen to our data, who says it couldn’t happen to us?\n\n\n\n\nWeek 6  Challenging hierarchies  Multilevel models\n\n\nBy now we got a sense that every new thing we learn about turns out to be merely a specific case of a larger class of things. So, all the models we covered so far are specific, single-level, versions of multilevel models, in which our cases can be seen as clustered within larger entities. Sometimes they are part of several cross-cutting clusters and/or the clusters are themselves clustered. In general terms, we must acknowledge that there are dependencies in our data that may influence their behaviour. It turns out that data about humans living in societies look somewhat like humans living in societies. The importance of including information about hierarchical dependencies in our models is probably emphasised by no one else more than McElreath (2020:15), who wants “to convince the reader of something that appears unreasonable: multilevel regression deserves to be the default form of regression. Papers that do not use multilevel models should have to justify not using a multilevel approach.” We will encounter some of the uses and challenges of multilevel modelling.\n\n\n\n\nWeek 7  The unobserved  Latent variables and structural models\n\n\nThe unobserved sounds like the title of a promising horror film; if we have achieved our aims in the module so far, our horror should be ‘merely’ metaphysical by now (Kołakowski anyone? No? Okay, never mind). We have already had to deal with various aspects of latency in our analyses. At the most fundamental level, we speak about population parameters, but we never actually observe them; even a sample statistic can be a purely imaginary case that doesn’t occur in real life. We have discussed the effects of omitted variables, which are thus unobserved by our model, but which we may have access to in our data. And, of course, our most interesting measurements are likely to be proxies of some unobservable theoretical construct (Mulvin (2021) has recently published a wonderfully rich book about proxies in general). This week we pick up an earlier thread from week 4, where we thought about binary and ordered multinomial variables as discretised manifestations of some continuous ‘latent variable’. We expand on this idea by exploring simple and then more complex latent variable models (factor analysis, structural equation modelling), as a further generalisation of the hierarchical perspective introduced earlier. This gives us a few more tools to deal with our radical uncertainty. (n.b. missing data points are another challenge that could fall under this heading, and learning how to deal with them is extremely important; but “The missing” is too good a title not to deserve a high-budget, weak-storyline, full-on special effects sequel somewhere else)\n\n\n\n\nWeek 8  Words, words, mere words…  Text as data\n\n\nAs researchers in humanities and the social sciences, we use words both as tools of analysis and as sources of data. Words, and more broadly, texts, are also increasingly important for quantitative research in an age of so-called ‘big data’, when the digital world is saturated with unstructured textual information. But the statistical inspection of text is neither new, nor restricted to the humanistic tail of the social sciences. For example, a documented interest in the statistical study of literary style for the purposes of attributing authorship dates back to the mid-1850s (see El-Shagi and Jung 2015); and investors can use textual data such as minutes from the Bank of England’s Monetary Policy Committee’s deliberations to estimate future monetary policy decisions before they are actually taken (cf. Lord 1958). Methods for the collection and quantitative analysis of large-scale textual data are increasingly available, but their technical implementation is complex and requires efficient combination of humanistic subject knowledge and statistical expertise. Faced with words, one is understandably caught between Shakespeare’s Troilus and Wilde’s Dorian Gray. “Words, words, mere words, no matter from the heart; th’ effect doth operate another way. … My love with words and errors still she feeds, but edifies another with her deeds” - believed the betrayed Troilus. “Words! Mere words! How terrible they were! How clear, and vivid, and cruel! One could not escape from them. And yet what a subtle magic there was in them! They seemed to be able to give a plastic form to formless things, and to have a music of their own as sweet as that of viol or of lute. Mere words! Was there anything so real as words?” - pondered Dorian.\n\n\n\n\n\n\n\n\n\n\nReferences\n\nDavid, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42(1/2):1–15. doi: 10.2307/2333419.\n\n\nEl-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39:222–34. doi: 10.1016/j.ejpoleco.2015.05.004.\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press.\n\n\nLord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45(1/2):282–82. doi: 10.2307/2333072.\n\n\nMcElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. Boca Raton: Taylor and Francis, CRC Press.\n\n\nMulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Cambridge, Massachusetts: The MIT Press.\n\n\nSenn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18(1):118–31. doi: 10.1214/ss/1056397489."
  },
  {
    "objectID": "references.html",
    "href": "references.html",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "References"
  },
  {
    "objectID": "resources/index.html",
    "href": "resources/index.html",
    "title": "HSS8005 {{< iconify line-md plus >}}",
    "section": "",
    "text": "ReadingsSoftwareTrainingHelpOther\n\n\n\nTextbooks\nThe course does not strictly follow the content of a textbook, but the expectation is that students will read as much as possible of the assigned chapters from the following books:\n\n\n\n\n\n\n\n\n\nGelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press.  ROS\n\n\nFree to download PDF version from the book’s website: https://avehtari.github.io/ROS-Examples\n\n\n\n\n\n\n\nAlexander, Rohan. 2023. Telling Stories with Data. Chapman and Hall/CRC  TSD\n\n\nFree online book: https://tellingstorieswithdata.com\n\n\n\n\n\n\n\nGelman, A., and Hill, J. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press.  ARM\n\n\nNote: ROS is the expanded and updated version of Part 1 (and some of Part 3) of this book. While everyone in the free world eagerly awaits the publication of ROS’s multilevel counterpart, we’ll use ARM as a reference work for the theory underpinning multilevel modelling.  Not freely available. Access it in print or online via the NU library\n\n\n\n\n\n\nRelatively large portions of text will be assigned for reading in each week from these books, referring to them by their acronyms. Don’t worry if you cannot read all the textbook content assigned in any given week! Those for whom the method covered by the assigned readings is new, will be able to refer back to them throughout the semester and beyond, reading thoroughly and completing the applied exercises. Those already familiar to some extent with the methods, should nonetheless read the text as a narrative and will discover hidden gems that will spectacularly improve their understanding and ability to interpret their statistical results.\n\n\nApplication\nIn the IT labs we will practice applying methods by reproducing small bits of published research, using the data and (critically) the modelling approaches used by the authors. To fully understand the context of these data and the methods used, you must read the original journal articles and the available supplementary materials provided alongside. These readings will be listed under each week’s outline (still work in progress!).\nThe articles come from a variety of different fields, so expect them to push you outside your disciplinary comfort zone. The point is to see how methods have been used in practice and learn how to reproduce (and potentially improve) those analyses. This will then enable you to apply this knowledge to your own research questions.\nWhen selecting the articles, the aim was to strike a fine balance between (a) the simplicity of the methods employed, (b) data and analytical transparency, and (c) the strength of the analysis. So don’t take them as examples of all-rounds best practice, but examples of research that gets published while being self-confident enough to open itself up for public scrutiny. Aim for this in your own research!\n\n\nTechnique\nThere will also be various readings relating more closely to the technicalities of coding in R and scientific writing, collaboration and communication in general. These readings will also be listed under each week’s outline as the semester progresses. The generic reading that students are advised to go through on their own is:\n\n\n\n\n\n\n\n\nWickham, Çetinkaya-Rundel and Grolemund. 2022. R for Data Science (2nd ed.)  R4DS\n\n\nFree online book: https://r4ds.hadley.nz/\n\n\n\n\n\n\n\nIntuition\nFinally, there will also be recommended readings listed under certain weeks that help place methods, statistics and probability theory in a broader frame. These are useful readings for everyone, regardless of whether you will be applying quantitative analysis in your research or future work.\n\n\n\n\nRequired software\nWe will use a number of open-source software for data analysis and scientific writing. You need to install these on your personal computers to be able to work away from campus:\n\n\n\n\nR\n(programming language)\nEssential\nR needs to be installed even if we will only use it via the RStudio interface.\nInstall the latest version from here\n\n\n\nRStudio\n(integrated development environment)\nEssential\nYou will need the free desktop version appropriate for your operating system. RStudio combines the R Console - the direct interface to R - with a number of other panels.\nInstall the latest version from here\n\n\n\nTidyverse\n(collection of R packages)\nEssential\nThe tidyverse is a collection of packages that make the R language easier to use by introducing a more consistent grammar. It provides functions that are particularly useful for data manipulation and visualisation. It is the most common ‘dialect’ used among social scientists.\nInstall from within RStudio by executing in the Console:\ninstall.packages(\"tidyverse\")\n\n\n\nQuarto\n(scientific publishing system)\nEssential\nWe will be using Quarto markdown documents (.qmd) throughout the course to document our data analysis. .qmd files extend the plain-text Markdown mark-up language (.md) to allow for data analysis code to be executed and results presented alongside the main text. This is an essential requirement for analytical transparency, reliability and reproducibility.The assignment will also be completed in .qmd.\nIncluded by default in the latest RStudio release; no need to install separately.\nYou can check your installation by executing in the RStudio Terminal :\nquarto check\n\n\n\nZotero\n(reference manager)\nRecommended\nIf you are not yet using a reference manager, I recommend giving Zotero a try. It will make your work much more efficient and it integrates (relatively well) with RMarkdown and Quarto using the the Better BibTeX add-on.\nInstall the latest version and add-ons from here\n\n\n\n\n\n\nStudents with no previous experience using R and/or RStudio are advised to complete the self-paced free online training course R for Social Scientists provided by Data Carpentry at https://datacarpentry.org/r-socialsci/\n\n\n\nThere are several ways to get help with R outside class. If you encounter an error message or are looking for a function to perform a specific task that we have not covered in class, you can do a Google search; for best results, use the https://rseek.org/ search engine, which limits the results to those relating to the R language.\nYou can also search for answers on Stack Overflow, which is a popular help and discussion website for programmers. You can also post a question there, but make sure to follow community standards and advice on how to ask a good question and how to provide a minimal reproducible example. You will need some experience using the site before being able to ask a good question, but it’s more than certain that any quesiton you have at this stage will have an answer already somewhere.\n\n\nAny further study resources will be listed here."
  }
]