Josh Betz - Biostatistics Consulting Center (jbetz@jhu.edu) 2024-02-13 10:52
This repository is for fitting Empirical Bayes Beta Binomial models to
prevalence values from two populations. Currently, a single component
model and a two-component mixture model are implemented, and are fit
using maximum marginal likelihood. Estimation is done using the
optimx
package.
First, install the R environment for statistical computing and the
Rstudio IDE. This also
requires additional packages to be installed from CRAN: this can be done
using install.packages()
and update.packages()
or using the
Packages
tab of the RStudio IDE. Install the following packages, or
check for updates if these packages are already installed:
callr
- Calling R sessions from within Rconfig
- Reading YAML configuration filesopenssl
- SHA256 File hashes for identifying versions of dependenciesopenxlsx
- Writing output in Excel formatoptimx
- Optimization for fitting modelstidyverse
- Data wrangling and plotting
If you do not use the RStudio Packages
pane, you can install and
update packages from the command line: the update_packages
and
ask_to_update
parameters can be commented or uncommented depending on
your preferences.
required_packages <-
c("callr", "config", "openssl", "openxlsx", "optimx", "tidyverse")
update_packages <- TRUE # Update installed packages
# update_packages <- FALSE # Do not update any installed packages
ask_to_update <- FALSE # Update all packages
# ask_to_update <- TRUE # Ask for each package
installed_packages <-
as.data.frame(installed.packages()[,c(1,3:4)])
packages_to_install <-
setdiff(
x = required_packages,
y = installed_packages$Package
)
packages_to_update <-
setdiff(
x = required_packages,
y = packages_to_install
)
if(length(packages_to_install) > 0){
install.packages(
pkgs = packages_to_install
)
}
if(update_packages & (length(packages_to_update) > 0)){
update.packages(
ask = ask_to_update
)
}
In Rstudio, go to File
> New Project
> Version Control
> Git
.
In the repository URL, enter https://github.com/jbetz-jhu/BayesDLR.
Leave the project directory name as BayesDLR
. Select a directory to
‘clone’ (i.e. download) the repository: this directory will become the
“project root directory”, denoted as project_root
in code. Click the
“Create Project” button. RStudio will download the repository from
GitHub into the folder you specified. In this folder, you should see the
following folder structure:
- (project_root)
- BayesDLR
- analyses
- code
- data
- results
- BayesDLR
Related data and analyses for a research project are grouped together in
subfolders, specified by the variable data_subfolder
. There is a
project for demonstration entitled "example_1"
that is built into the
repository. In the "\BayesDLR\data"
subfolder, you’ll see an
"example_1"
subfolder: inside this is a "raw_data"
subfolder with
two example datasets in .CSV format:
BayesDLR\data\example_1\raw_data\
example_1_component_v1_240110.csv
example_1_component_v2_240110.csv
Once raw data has been put in the
\BayesDLR\data\(project name)\raw_data
subfolder, go to the
\BayesDLR
folder and open 1_configure_analysis.r
.
- Set
data_subfolder
to(project name)
- For the first example,
data_subfolder <- "example_1"
.
- For the first example,
- Set
input_data_file
to the file name that will be used as input. If the input is an Excel sheet (.xls or .xlsx) formats, specify the sheet to be loaded usinginput_data_sheet
.- For the first example,
input_data_file <- "example_1_component_v1_240110.csv"
- since this is a .csv and not an Excel format,input_sheet_name
is ignored.
- For the first example,
- Set the data version name: this is a shortened name that is used to
name processed data and analyses. Since this helps link the raw file
with the processed data and analyses, make sure this label is
descriptive.
- For the first example,
data_version_name <- "V1-240110"
for Version 1 of the data, created on 2024-01-10.
- For the first example,
- Set the threshold for relative sample sizes (
$r_k$ ): for more information on this, see the role of sample sizes in the Beta Binomial Model- For the first example,
r_k_threshold <- 0.10
- The threshold is 10%
- For the first example,
- Set the names of the columns in the input file: this includes the
row_id
, a label for the data in each row, and the numerator and denominator for cases and controls.- For the first example:
row_id <- "event"
case_denominator <- "n_1"
case_count <- "y_1"
control_denominator <- "n_0"
control_count <- "y_0"
- For the first example:
- (Optional) Set a name for the variable which flags data for inclusion,
and the values indicating inclusion. If the data do not have such a
flag, set to
NULL
.- For the first example:
flag_column <- "flag"
andflags_include <- "include"
- Any rows whereflag
does not match"include"
are removed from analytic data.
- For the first example:
- Specify random number generator (RNG) seeds: this allows you to
reproduce the same random number streams used in an earlier analysis.
If set to
NULL
, a new RNG seed is generated from the current state of the RNG, and saved so analyses can be reproduced.
Once these parameters have been set, run the code. If successful, you should see a message similar to:
Successfully wrote V1-240110_r_k_10_pct.yml to ..\BayesDLR\analyses\example_1.
Proceed to 2_run_analysis.R
This indicates that the configuration file V1-240110_r_k_10_pct.yml
was created: This is a plain-text file that the workflow uses to run and
summarize analyses.
Once the configuration file is created for the analysis, open
2_run_analysis.r
. Set the same values of data_subfolder
,
data_version_name
, and r_k_threshold
supplied earlier: This tells
the workflow the path to the configuration file:
..\BayesDLR\analysis\(data_subfolder)\(data_version_name)_r_k_(r_k_threshold)_pct.yml
.
This file contains all the information needed to carry out the analysis, so it does not need to be re-entered.
By default, the workflow will stop if errors are identified in data
marked for inclusion: If you identify any issues in the data, they
should be checked before proceeding: stop_on_data_error <- TRUE
. To
proceed with analyses after excluding erroneous data, make sure
stop_on_data_error <- FALSE
is not commented out.
The configuration file is used to read in the data, select the relevant
columns, check for errors, and write out the processed data to the
..\BayesDLR\data\(data_subfolder)\processed_data
subfolder. The files
will be named
(data_subfolder)-(V1-data_version_name)_processed_v0.0.Rdata
(Rdata
file) and (data_subfolder)-(V1-data_version_name)_processed_v0.0.xlsx
(Excel format). In the example analysis, this becomes
example_1-V1-240110_processed_v0.0.Rdata
and
example_1-V1-240110_processed_v0.0.xlsx
.
If any errors are encountered in the data, the workflow will either stop
or issue a warning, depending on stop_on_data_error
. The example
dataset yields gives:
Warning message:
Data entry errors detected in input file.
File: ../BayesDLR/data/example_1/raw_data/example_1_component_v1_240110.csv
Rows: 1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13
REVIEW DATA CAREFULLY BEFORE PROCEEDING.
This indicates that the data were successfully processed, but errors
were found. These can be seen in tab 4. Invalid data
in the Excel
output.
Analysis results are stored in ../BayesDLR/results/(data_subfolder)
in
R and Excel files named with data_subfolder
, data_version_name
, and
r_k_threshold
.
There is a second version of the example_1
dataset in .csv without the
errors: it is marked v2 for version 2. Try steps 2-3 for this new file.
There is also an Excel format dataset for example_1
: try steps 2-3 for
the Excel spreadsheet.
There is also an example_2
in .csv and .xlsx formats: try steps 2-3 on
these files as well.
When you are ready to use your own data, rename the new_project
folder
and follow steps 1-3 to analyze it.
The prevalence ratio of exposure given disease status is a measure of association comparing the prevalence of an exposure in a population with a condition to the prevalence of an exposure in a population without that condition.
This can be obtained using a 2x2 table, where
Variant | Cases | Controls | Total |
---|---|---|---|
Variant |
|||
Other Variants | |||
Total |
When we have a fixed sample size of
where
Note that since
Let
Parameterizing the model this way allows the modeling multiple variants,
allowing for variation in sample sizes across different variants, as
long as their ratio of sample sizes for any given variant
Since
So we infer about
If the relative sample sizes are all equal, this transformation is identical. When the relative sample sizes vary, this transformation is not the same for all observations. This is why observations with very different relative sample size are exclude from analyses.
The full output is written to an .Rdata file, with detailed output in an .xlsx file. Users should check the model parameters before attempting to interpret model output. Particular attention should be paid to the two component mixture model. Unusually large values of the M-parameter (the ‘prior sample size’) or extremely small value of epsilon (the mixing coefficient) could indicate problems with model fitting. Parameters are constrained by default to be between -30 and 30 on the logit or log scale, except the mixing coefficient, which is constrained between -30 and 0 (i.e. below 50%). Constraints can be reset by the user. Users should check that parameters do not converge on the boundary of the parameter space.
Since the two-component model is fit from a grid of starting values, analysts should look to see whether there are large differences in model fits based on the starting values, and consider whether different starting values may be more appropriate.
Maximum marginal likelihood results can also be compared to a full Bayesian model fit using Stan.