OpenIntroStat · hardin47 · Oct 11, 2023 · Oct 11, 2023 · Oct 11, 2023 · Oct 11, 2023
diff --git a/11-foundations-randomization.qmd b/11-foundations-randomization.qmd
@@ -676,7 +676,7 @@ This is also the case with hypothesis testing: *even if we fail to reject the nu
 Failing to find evidence in favor of the alternative hypothesis is not equivalent to finding evidence that the null hypothesis is true.
 We will see this idea in greater detail in Section \@ref(decerr).
 
-### p-value and statistical significance
+### p-value and statistical discernibility
 
 In Section \@ref(caseStudySexDiscrimination) we encountered a study from the 1970's that explored whether there was strong evidence that female candidates were less likely to be promoted than male candidates.
 The research question -- are female candidates discriminated against in promotion decisions?
@@ -716,46 +716,47 @@ The test statistic in the opportunity cost study was the difference in the propo
 In each of these examples, the **point estimate** of the difference in proportions was used as the test statistic.
 :::
 
-When the p-value is small, i.e., less than a previously set threshold, we say the results are **statistically significant**\index{statistically significant}.
-This means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis.
+When the p-value is small, i.e., less than a previously set threshold, we say the results are **statistically discernible**\index{statistically significant}\index{statistically discernible}.
+This means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis.^[Many texts use the phrase "statistically significant" instead of "statistically discernible".  We have chosen to use "discernible" to indicate that a precise statistical event has happened, as opposed to a notable effect which may or may not fit the statistical definition of discernible or significant.]
 The threshold is called the **significance level**\index{hypothesis testing!significance level}\index{significance level} and often represented by $\alpha$ (the Greek letter *alpha*).
 The value of $\alpha$ represents how rare an event needs to be in order for the null hypothesis to be rejected.
 Historically, many fields have set $\alpha = 0.05,$ meaning that the results need to occur less than 5% of the time, if the null hypothesis is to be rejected.
 The value of $\alpha$ can vary depending on the the field or the application.
 
 ```{r}
 #| include: false
-terms_chp_11 <- c(terms_chp_11, "significance level", "statistically significant")
+terms_chp_11 <- c(terms_chp_11, "significance level", "statistically significant", "statistically discernible")
 ```
 
+Note that you may have heard the phrase "statistically significant" as a way to describe "statistically discernible."
 Although in everyday language "significant" would indicate that a difference is large or meaningful, that is not necessarily the case here.
-The term "statistically significant" only indicates that the p-value from a study fell below the chosen significance level.
+The term "statistically discernible" indicates that the p-value from a study fell below the chosen significance level.
 For example, in the sex discrimination study, the p-value was found to be approximately 0.02.
-Using a significance level of $\alpha = 0.05,$ we would say that the data provided statistically significant evidence against the null hypothesis.
+Using a significance level of $\alpha = 0.05,$ we would say that the data provided statistically discernible evidence against the null hypothesis.
 However, this conclusion gives us no information regarding the size of the difference in promotion rates!
 
 ::: {.important data-latex=""}
-**Statistical significance.**
+**Statistical discernibility.**
 
-We say that the data provide **statistically significant**\index{hypothesis testing!statistically significant} evidence against the null hypothesis if the p-value is less than some predetermined threshold (e.g., 0.01, 0.05, 0.1).
+We say that the data provide **statistically discernible**\index{hypothesis testing!statistically discernible.} evidence against the null hypothesis if the p-value is less than some predetermined threshold (e.g., 0.01, 0.05, 0.1).
 :::
 
 ::: {.workedexample data-latex=""}
 In the opportunity cost study in Section \@ref(caseStudyOpportunityCost), we analyzed an experiment where study participants had a 20% drop in likelihood of continuing with a video purchase if they were reminded that the money, if not spent on the video, could be used for other purchases in the future.
 We determined that such a large difference would only occur 6-in-1,000 times if the reminder actually had no influence on student decision-making.
 What is the p-value in this study?
-Would you classify the result as "statistically significant"?
+Would you classify the result as "statistically discernible"?
 
 ------------------------------------------------------------------------
 
 The p-value was 0.006.
-Since the p-value is less than 0.05, the data provide statistically significant evidence that US college students were actually influenced by the reminder.
+Since the p-value is less than 0.05, the data provide statistically discernible evidence that US college students were actually influenced by the reminder.
 :::
 
 ::: {.important data-latex=""}
 **What's so special about 0.05?**
 
-We often use a threshold of 0.05 to determine whether a result is statistically significant.
+We often use a threshold of 0.05 to determine whether a result is statistically discernible.
 But why 0.05?
 Maybe we should use a bigger number, or maybe a smaller number.
 If you're a little puzzled, that probably means you're reading with a critical eye -- good job!

diff --git a/17-inference-two-props.qmd b/17-inference-two-props.qmd
@@ -406,7 +406,7 @@ for(i in 1:k){
 }
 ```
 
-The choice of 95% or 90% or even 99% as a confidence level is admittedly somewhat arbitrary; however, it is related to the logic we used when deciding that a p-value should be declared as "significant" if it is lower than 0.05 (or 0.10 or 0.01, respectively).
+The choice of 95% or 90% or even 99% as a confidence level is admittedly somewhat arbitrary; however, it is related to the logic we used when deciding that a p-value should be declared as "discernible" if it is lower than 0.05 (or 0.10 or 0.01, respectively).
 Indeed, one can show mathematically, that a 95% confidence interval and a two-sided hypothesis test at a cutoff of 0.05 will provide the same conclusion when the same data and mathematical tools are applied for the analysis.
 A full derivation of the explicit connection between confidence intervals and hypothesis tests is beyond the scope of this text.
 

diff --git a/22-inference-many-means.qmd b/22-inference-many-means.qmd
@@ -218,7 +218,7 @@ Consider again the original hypotheses:
 -   $H_0:$ $\mu_{OF} = \mu_{IF} = \mu_{C}$
 -   $H_A:$ The average on-base percentage $(\mu_i)$ varies across some (or all) groups.
 
-Why might it be inappropriate to run the test by simply estimating whether the difference of $\mu_{C}$ and $\mu_{OF}$ is "statistically significant" at a 0.05 significance level?
+Why might it be inappropriate to run the test by simply estimating whether the difference of $\mu_{C}$ and $\mu_{OF}$ is "statistically discernible" at a 0.05 significance level?
 
 ------------------------------------------------------------------------
 

diff --git a/25-inf-model-mlr.qmd b/25-inf-model-mlr.qmd
@@ -9,7 +9,7 @@ source("_common.R")
 
 ::: {.chapterintro data-latex=""}
 In Chapter \@ref(model-mlr), the least squares regression method was used to estimate linear models which predicted a particular response variable given more than one explanatory variable.
-Here, we discuss whether each of the variables individually is a statistically significant predictor of the outcome or whether the model might be just as strong without that variable.
+Here, we discuss whether each of the variables individually is a statistically discernible predictor of the outcome or whether the model might be just as strong without that variable.
 That is, as before, we apply inferential methods to ask whether a variable could have come from a population where the particular coefficient at hand was zero.
 If one of the linear model coefficients is truly zero (in the population), then the estimate of the coefficient (using least squares) will vary around zero.
 The inference task at hand is to decide whether the coefficient's difference from zero is large enough to decide that the data cannot possibly have come from a model where the true population coefficient is zero.
@@ -62,7 +62,7 @@ lm(interest_rate ~ debt_to_income + term + credit_checks, data = loans) %>%
   mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
   kbl(
     linesep = "", booktabs = TRUE,
-    caption = caption_helper("Summary of a linear model for predicting interest rate based on `debt_to_income`, `term`, and `credit_checks`. Each of the variables has its own coefficient estimate as well as p-value significance."),
+    caption = caption_helper("Summary of a linear model for predicting interest rate based on `debt_to_income`, `term`, and `credit_checks`. Each of the variables has its own coefficient estimate as well as a p-value."),
     digits = 2, align = "lrrrr"
   ) %>%
   kable_styling(
@@ -126,7 +126,7 @@ Sometimes a set of predictor variables can impact the model in unusual ways, oft
 
 In practice, there will almost always be some degree of correlation between the explanatory variables in a multiple regression model.
 For regression models, it is important to understand the entire context of the model, particularly for correlated variables.
-Our discussion will focus on interpreting coefficients (and their signs) in relationship to other variables as well as the significance (i.e., the p-value) of each coefficient.
+Our discussion will focus on interpreting coefficients (and their signs) in relationship to other variables as well as the discernibility (i.e., the p-value) of each coefficient.
 
 Consider an example where we would like to predict how much money is in a coin dish based only on the number of coins in the dish.
 We ask 26 students to tell us about their individual coin dishes, collecting data on the total dollar amount, the total number of coins, and the total number of low coins.[^25-inf-model-mlr-2]
@@ -307,7 +307,7 @@ terms_chp_25 <- c(terms_chp_25, "multicollinearity")
 ```
 
 Although diving into the details are beyond the scope of this text, we will provide one more reflection about multicollinearity.
-If the predictor variables have some degree of correlation, it can be quite difficult to interpret the value of the coefficient or evaluate whether the variable is a statistically significant predictor of the outcome.
+If the predictor variables have some degree of correlation, it can be quite difficult to interpret the value of the coefficient or evaluate whether the variable is a statistically discernible predictor of the outcome.
 However, even a model that suffers from high multicollinearity will likely lead to unbiased predictions of the response variable.
 So if the task at hand is only to do prediction, multicollinearity is likely to not cause you substantial problems.
 
@@ -393,7 +393,7 @@ We refer to the model given with only `bill_lengh_mm` as the **smaller** model.
 It is seen in Table \@ref(tab:peng-lm-bill) with coefficient estimates of the parameters as well as standard errors and p-values.
 We refer to the model given with `bill_lengh_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` as the **larger** model.
 It is seen in Table \@ref(tab:peng-lm-all) with coefficient estimates of the parameters as well as standard errors and p-values.
-Given what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables have low p-values, suggesting that each variable is a statistically significant predictor of `body_mass_g`, given all other variables in the model.
+Given what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables have low p-values, suggesting that each variable is a statistically discernible predictor of `body_mass_g`, given all other variables in the model.
 However, in this section, we will go beyond the use of p-values to consider independent predictions of `body_mass_g` as a way to compare the smaller and larger models.
 
 **The smaller model:**

diff --git a/26-inf-model-logistic.qmd b/26-inf-model-logistic.qmd
@@ -271,7 +271,7 @@ Using the example above and focusing on each of the variable p-values (here we w
 -   $H_0: \beta_3 = 0$ given `to_multiple`, `cc`, and `urgent_subj` are included in the model
 -   $H_0: \beta_4 = 0$ given `to_multiple`, `dollar`, and `dollar` are included in the model
 
-The very low p-values from the software output tell us that three of the variables (that is, not `cc`) act as statistically significant predictors in the model at the significance level of 0.05, despite the inclusion of any of the other variables.
+The very low p-values from the software output tell us that three of the variables (that is, not `cc`) act as statistically discernible predictors in the model at the significance level of 0.05, despite the inclusion of any of the other variables.
 Consider the p-value on $H_0: \beta_1 = 0$.
 The low p-value says that it would be extremely unlikely to observe data that yield a coefficient on `to_multiple` at least as far from 0 as -1.91 (i.e. $|b_1| > 1.91$) if the true relationship between `to_multiple` and `spam` was non-existent (i.e., if $\beta_1 = 0$) **and** the model also included `cc` and `dollar` and `urgent_subj`.
 Note also that the coefficient on `dollar` has a small associated p-value, but the magnitude of the coefficient is also seemingly small (0.07).

diff --git a/exercises/_01-ex-data-hello.qmd b/exercises/_01-ex-data-hello.qmd
@@ -239,7 +239,7 @@ They randomly selected 6 of these daycare centers and instituted a monetary fine
 In the remaining 4 daycare centers no fine was introduced. 
 The study period was divided into four: before the fine (weeks 1–4), the first 4 weeks with the fine (weeks 5-8), the last 8 weeks with fine (weeks 9–16), and the after fine period (weeks 17-20).
 Throughout the study, the number of kids who were picked up late was recorded each week for each daycare. 
-The study found that the number of late-coming parents increased significantly when the fine was introduced, and no reduction occurred after the fine was removed.^[The [`daycare_fines`](http://openintrostat.github.io/openintro/reference/daycare_fines.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Gneezy:2000]
+The study found that the number of late-coming parents increased discernibly when the fine was introduced, and no reduction occurred after the fine was removed.^[The [`daycare_fines`](http://openintrostat.github.io/openintro/reference/daycare_fines.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Gneezy:2000]
 
     ```{r}
     library(openintro)

diff --git a/exercises/_02-ex-data-design.qmd b/exercises/_02-ex-data-design.qmd
@@ -226,7 +226,7 @@ To assess the effectiveness of taking large doses of vitamin C in reducing the d
 A quarter of the patients were assigned a placebo, and the rest were evenly divided between 1g Vitamin C, 3g Vitamin C, or 3g Vitamin C plus additives to be taken at onset of a cold for the following two days. 
 All tablets had identical appearance and packaging. 
 The nurses who handed the prescribed pills to the patients knew which patient received which treatment, but the researchers assessing the patients when they were sick did not. 
-No significant differences were observed in any measure of cold duration or severity between the four groups, and the placebo group had the shortest duration of symptoms. [@Audera:2001]
+No discernible differences were observed in any measure of cold duration or severity between the four groups, and the placebo group had the shortest duration of symptoms. [@Audera:2001]
 
     a.  Was this an experiment or an observational study? Why?
 
@@ -283,7 +283,7 @@ In one 2009 study, a team of researchers recruited 38 men and divided them rando
 They also recruited 38 women, and they randomly placed half of these participants into the treatment group and the other half into the control group. 
 One group was given 25 grams of chia seeds twice a day, and the other was given a placebo. 
 The subjects volunteered to be a part of the study. 
-After 12 weeks, the scientists found no significant difference between the groups in appetite or weight loss. [@Nieman:2009]
+After 12 weeks, the scientists found no discernible difference between the groups in appetite or weight loss. [@Nieman:2009]
 
     a.  What type of study is this?
 

diff --git a/exercises/_13-ex-foundations-mathematical.qmd b/exercises/_13-ex-foundations-mathematical.qmd
@@ -9,7 +9,7 @@ In 2013, the Pew Research Foundation reported that "45% of U.S. adults report th
 
         ii. If we repeated this study 1,000 times and constructed a 95% confidence interval for each study, then approximately 950 of those confidence intervals would contain the true fraction of U.S. adults who suffer from chronic illnesses.
 
-        iii. The poll provides statistically significant evidence (at the $\alpha = 0.05$ level) that the percentage of U.S. adults who suffer from chronic illnesses is below 50%.
+        iii. The poll provides statistically discernible evidence (at the $\alpha = 0.05$ level) that the percentage of U.S. adults who suffer from chronic illnesses is below 50%.
 
         iv. Since the standard error is 1.2%, only 1.2% of people in the study communicated uncertainty about their answer.
 
@@ -20,7 +20,7 @@ A poll conducted in 2013 found that 52% of U.S. adult Twitter users get at least
 
     b. Identify each of the following statements as true or false. Provide an explanation to justify each of your answers.
 
-        i. The data provide statistically significant evidence that more than half of U.S. adult Twitter users get some news through Twitter. Use a significance level of $\alpha = 0.01$.
+        i. The data provide statistically discernible evidence that more than half of U.S. adult Twitter users get some news through Twitter. Use a significance level of $\alpha = 0.01$.
 
         ii. Since the standard error is 2.4%, we can conclude that 97.6% of all U.S. adult Twitter users were included in the study.
 

diff --git a/exercises/_14-ex-foundations-errors.qmd b/exercises/_14-ex-foundations-errors.qmd
@@ -42,7 +42,7 @@ Determine if the following statements are true or false, and explain your reason
 
     c. Suppose the null hypothesis is $p = 0.5$ and we fail to reject $H_0$. Under this scenario, the true population proportion is 0.5.
 
-    d. With large sample sizes, even small differences between the null value and the observed point estimate, a difference often called the effect size, will be identified as statistically significant.
+    d. With large sample sizes, even small differences between the null value and the observed point estimate, a difference often called the effect size, will be identified as statistically discernible.
 
     \clearpage