Simulation code for Longitudinal TMLE (LTMLE) with multi-valued treatments.
Please cite the following paper if you use this repo:
@article{doi:10.1002/sim.10003,
title={Targeted learning in observational studies with multi-valued treatments: An evaluation of antipsychotic drug treatment safety.},
author={Poulos, Jason and Horvitz-Lennon, Marcela and Zelevinsky, Katya and Cristea-Platon, Tudor and Huijskens, Thomas and Tyagi, Pooja and Yan, Jiaju and Diaz, Jordi and Normand, Sharon-Lise},
journal={Statistics in Medicine},
year={2024},
publisher={Wiley Online Library}
}
- R (tested on 4.0.1 using a 6.2.0 GCC compiler)
- Required R packages located in package_list.R
- The result of sessionInfo() is in session_info.txt
- For use of 'tmle-lstm' as an estimator: R (4.3.3), python3 (3.10.12), and TensorFlow (2.18.0)
# Create virtual environment within directory
cd multi-ltmle
# Manually set up a virtual environment
python3.10 -m ensurepip --upgrade # Ensure pip is installed/upgraded
python3.10 -m pip install virtualenv # Install virtualenv if not already available
virtualenv myenv # Create a virtual environment named 'myenv'
source myenv/bin/activate # Activate the new environment
# Install TensorFlow
python3 -m pip install --upgrade pip # Upgrade pip to the latest version
# For GPU users:
pip install tensorflow[and-cuda]
# For CPU users:
pip install tensorflow
# Verify the GPU installation:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# Verify the CPU installation:
python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
- The following Python packages are required: numpy (tested on 1.19.5), pandas (1.1.5), and wandb (0.15.12)
python3 -m pip install pandas
python3 -m pip install wandb
-
Additional R package in package_list.R are required. Make sure to install these packages in the virtual environment where python3 and Tensorflow are installed.
-
Ensure that the path in use_python() in simulation.R corresponds to the virtual environment python path, which you can find using
which python
-
Test with R and Reticulate
Load thereticulate
library and bind to the shared Python:library(reticulate) use_python("./myenv/bin/python", required = TRUE) py_config()
-
Important Notes
- If the shared library is still not found, adjust the
LD_LIBRARY_PATH
as required:export LD_LIBRARY_PATH=/path/to/libpython3.10.so:$LD_LIBRARY_PATH
- For consistent performance, ensure all dependencies align with the Python version used in your virtual environment.
- If the shared library is still not found, adjust the
- Installs required R packages.
- doMPI: Logical flag. When
TRUE
, installs packages required for MPI parallel processing (defaults toFALSE
). - keras: Logical flag. When
TRUE
, installs packages required for Keras/TensorFlow (defaults toTRUE
).
- doMPI: Logical flag. When
- Includes miscellaneous functions:
- Bounding predicted probabilities.
- Generating various distributions.
- A function for forest plot visualization.
- Defines distribution functions for the simcausal software.
- Defines the data-generating process for the simcausal software.
- Defines treatment rule functions used for Targeted Maximum Likelihood Estimation (TMLE).
- Implements influence curve (IC)-based methods for variance estimation in TMLE.
- Provides custom learner definitions for use with the SuperLearner framework.
- TMLE implementation for incorporating Long Short-Term Memory (LSTM) neural networks into the TMLE workflow.
- Facilitates integration between R and Python for LSTM-based estimations. Used when
estimator='tmle-lstm'
.
- Python script for training LSTMs and predicting on the same dataset.
- Python script for inference using a trained LSTM model on new datasets.
- Simulates longitudinal data for comparing the performance of multinomial TMLE with LSTM-based approaches.
- Key parameters:
- estimator: Choose between
'tmle'
or'tmle-lstm'
. - treatment.rule: Specify "static", "dynamic", "stochastic", or "all".
- gbound/ybound: Bounds for propensity scores and initial predictions, respectively.
- J: Number of treatments (
J=6
). - n: Sample size (default is
12500
). - t.end: Number of time periods (must be between
4
and36
). - R: Number of simulation runs (default is
325
). - target.gwt: Logical flag to adjust weights in the clever covariate (default is
TRUE
). - use.SL: Logical flag to enable Super Learner (default is
TRUE
). - scale.continuous: Logical flag for scaling continuous variables.
- n.folds: Number of cross-validation folds for Super Learner (default is
5
).
- estimator: Choose between
- Key parameters:
- Aggregates and visualizes the output of
simulation.R
. Includes:- Counterfactual risk estimations.
- Bias, coverage, and confidence interval widths.
Below is a list of files that require user modifications to match their environment or particular settings. Please ensure to make these changes before running the code.
- Update the Python path used by
use_python()
. Modify it to point to the Python interpreter you wish to use. For example:use_python("./myenv/bin/python", required=TRUE)
-
If using a GPU, configure the GPU settings by updating relevant CUDA paths to match your environment. The default settings are:
cuda_base = "/n/app/cuda/12.1-gcc-9.2.0" os.environ.update({ 'CUDA_HOME': cuda_base, 'CUDA_ROOT': cuda_base, 'CUDA_PATH': cuda_base, 'CUDNN_PATH': f"{cuda_base}/lib64/libcudnn.so", 'LD_LIBRARY_PATH': f"{cuda_base}/lib64:{cuda_base}/extras/CUPTI/lib64:{os.environ.get('LD_LIBRARY_PATH', '')}", 'PATH': f"{cuda_base}/bin:{os.environ.get('PATH', '')}", 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CUDA_VISIBLE_DEVICES': '0,1', 'TF_FORCE_GPU_ALLOW_GROWTH': 'true', 'TF_XLA_FLAGS': '--tf_xla_enable_xla_devices', 'XLA_FLAGS': f'--xla_gpu_cuda_data_dir={cuda_base}', 'TF_GPU_THREAD_MODE': 'gpu_private', 'TF_GPU_THREAD_COUNT': '2', 'TF_CPP_MIN_LOG_LEVEL': '3' })
-
If no GPU is used, the script will automatically fall back to CPU-based execution. No additional changes are required in this case.
- Run the following command to install the required R packages:
Rscript package_list.R
- Follow the Python installation instructions provided in the Prerequisites section.
- Ensure you start R within the virtual environment where Python 3 and TensorFlow are installed.
To execute simulations, use the following command:
Rscript simulation.R [arg1] [arg2] [arg3] [arg4]
[arg1]
: Specifies the estimator. Options:"tmle"
: Targeted Maximum Likelihood Estimation."tmle-lstm"
: Targeted Maximum Likelihood Estimation with Long Short-Term Memory.
[arg2]
: A numeric value indicating the treatment rule:1
: Use all treatment rules.
[arg3]
: Logical flag to indicate whether super learner estimation should be used:"TRUE"
or"FALSE"
.
[arg4]
: Logical flag for enabling MPI parallel programming:"TRUE"
or"FALSE"
.
-
Using the
"tmle"
estimator with super learner enabled and no MPI:Rscript simulation.R 'tmle' 1 'TRUE' 'FALSE'
-
Using the
"tmle-lstm"
estimator without super learner or MPI:Rscript simulation.R 'tmle-lstm' 1 'FALSE' 'FALSE'
To visualize simulation results, use the following command:
Rscript long_sim_plots.R [arg1]
[arg1]
: Path to the output directory containing the simulation results.
To plot results from simulations saved in the outputs/20240215
directory:
Rscript long_sim_plots.R 'outputs/20240215'
This section outlines the model weights, intermediate results, and visualizations available for analysis and evaluation. The results pertain to a single simulated longitudinal dataset (r=1) for 12,500 patients, estimating counterfactual diabetes risk under three regimes: static, dynamic, and stochastic. Data and results are saved in the ex_outputs/
directory.
-
Simulated Dataset
- Long format dataset:
tmle_dat_long_R_1_n_12500_J_6.rds
- Long format dataset:
-
Simulation Results
-
TMLE simulation:
longitudinal_simulation_results_estimator_tmle_treatment_rule_all_R_325_n_12500_J_6_n_folds_5_scale_continuous_FALSE_use_SL_TRUE.rds -
RNN-based model simulation:
longitudinal_simulation_results_estimator_tmle_treatment_rule_all_R_325_n_12500_J_6_n_folds_5_scale_continuous_FALSE_use_SL_TRUE.rds
-
-
Validation Predictions
- Multiple binary and categorical treatments:
- Binary
C
predictions:lstm_bin_C_preds.npy
,lstm_bin_C_preds_info.npz
- Categorical
A
predictions:lstm_cat_A_preds.npy
,lstm_cat_A_preds_info.npz
- Binary
A
predictions:lstm_bin_A_preds.npy
,lstm_bin_A_preds_info.npz
- Binary
- Multiple binary and categorical treatments:
-
Test Predictions
- Binary and categorical treatments:
- Binary
C
predictions:test_bin_C_preds.npy
,test_bin_C_preds_info.npz
- Categorical
A
predictions:test_cat_A_preds.npy
,test_cat_A_preds_info.npz
- Binary
A
predictions:test_bin_A_preds.npy
,test_bin_A_preds_info.npz
- Binary
- Binary and categorical treatments:
-
Model Weights
- Weights for RNN models with multiple binary and categorical treatments:
lstm_bin_A_model.h5
lstm_cat_A_model.h5
lstm_bin_C_model.h5
- Weights for RNN models with multiple binary and categorical treatments:
-
Descriptive Plots