Merged.

mmcdermott · Sep 9, 2024 · aed27f1 · aed27f1
2 parents 6c2ba9a + 0db7bd6
commit aed27f1
Show file tree

Hide file tree

Showing 60 changed files with 745 additions and 642 deletions.
diff --git a/.github/workflows/publish-to-pypi.yml b/.github/workflows/publish-to-pypi.yml
@@ -36,7 +36,7 @@ jobs:
     runs-on: ubuntu-latest
     environment:
       name: pypi
-      url: https://pypi.org/p/<package-name> # Replace <package-name> with your PyPI project name
+      url: https://pypi.org/p/meds-tab # Replace <package-name> with your PyPI project name
     permissions:
       id-token: write # IMPORTANT: mandatory for trusted publishing
 

diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -12,17 +12,19 @@ jobs:
 
     strategy:
       fail-fast: false
+      matrix:
+        python-version: ["3.11", "3.12"]
 
     timeout-minutes: 30
 
     steps:
       - name: Checkout
         uses: actions/checkout@v4
 
-      - name: Set up Python
+      - name: Set up Python ${{ matrix.python-version }}
         uses: actions/setup-python@v5
         with:
-          python-version: "3.11"
+          python-version: ${{ matrix.python-version }}
 
       - name: Install packages
         run: |

diff --git a/README.md b/README.md
@@ -84,12 +84,12 @@ By following these steps, you can seamlessly transform your dataset, define nece
 
    ```console
    # Re-shard pipeline
-   # $MIMICIV_MEDS_DIR is the directory containing the input, MEDS v0.3 formatted MIMIC-IV data
+   # $MIMICIV_input_dir is the directory containing the input, MEDS v0.3 formatted MIMIC-IV data
    # $MEDS_TAB_COHORT_DIR is the directory where the re-sharded MEDS dataset will be stored, and where your model
    # will store cached files during processing by default.
    # $N_PATIENTS_PER_SHARD is the number of patients per shard you want to use.
    MEDS_transform-reshard_to_split \
-       input_dir="$MIMICIV_MEDS_DIR" \
+       input_dir="$MIMICIV_input_dir" \
        cohort_dir="$MEDS_TAB_COHORT_DIR" \
        'stages=["reshard_to_split"]' \
        stage="reshard_to_split" \
@@ -103,14 +103,14 @@ By following these steps, you can seamlessly transform your dataset, define nece
    - static codes (codes without timestamps)
    - static numerical codes (codes without timestamps but with numerical values).
 
-   This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
+   This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument.
 
 2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
 
    **Example: Tabularizing static data** with the minimum code frequency of 10, window sizes of `[1d, 30d,  365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
 
    ```console
-   meds-tab-tabularize-static MEDS_cohort_dir="path_to_data" \
+   meds-tab-tabularize-static input_dir="path_to_data" \
                                tabularization.min_code_inclusion_frequency=10 \
                                tabularization.window_sizes=[1d,30d,365d,full] \
                                do_overwrite=False \
@@ -127,19 +127,19 @@ By following these steps, you can seamlessly transform your dataset, define nece
    meds-tab-tabularize-time-series --multirun \
       worker="range(0,$N_PARALLEL_WORKERS)" \
       hydra/launcher=joblib \
-      MEDS_cohort_dir="path_to_data" \
+      input_dir="path_to_data" \
       tabularization.min_code_inclusion_frequency=10 \
       do_overwrite=False \
       tabularization.window_sizes=[1d,30d,365d,full] \
       tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
    ```
 
-4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`.
 
    **Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
 
    ```console
-   meds-tab-cache-task MEDS_cohort_dir="path_to_data" \
+   meds-tab-cache-task input_dir="path_to_data" \
       task_name=$TASK \
       tabularization.min_code_inclusion_frequency=10 \
       do_overwrite=False \
@@ -151,7 +151,7 @@ By following these steps, you can seamlessly transform your dataset, define nece
 
    ```console
    meds-tab-xgboost --multirun \
-      MEDS_cohort_dir="path_to_data" \
+      input_dir="path_to_data" \
       task_name=$TASK \
       output_dir="output_directory" \
       tabularization.min_code_inclusion_frequency=10 \
@@ -436,7 +436,7 @@ A single XGBoost run was completed to profile time and memory usage. This was do
 
 ```console
 meds-tab-xgboost
-      MEDS_cohort_dir="path_to_data" \
+      input_dir="path_to_data" \
       task_name=$TASK \
       output_dir="output_directory" \
       do_overwrite=False \
@@ -506,7 +506,7 @@ The XGBoost sweep was run using the following command for each `$TASK`:
 
 ```console
 meds-tab-xgboost --multirun \
-      MEDS_cohort_dir="path_to_data" \
+      input_dir="path_to_data" \
       task_name=$TASK \
       output_dir="output_directory" \
       tabularization.window_sizes=$(generate-subsets [1d,30d,365d,full]) \
@@ -529,14 +529,14 @@ The hydra sweeper swept over the parameters:
 
 ```yaml
 params:
-  +model_params.model.eta: tag(log, interval(0.001, 1))
-  +model_params.model.lambda: tag(log, interval(0.001, 1))
-  +model_params.model.alpha: tag(log, interval(0.001, 1))
-  +model_params.model.subsample: interval(0.5, 1)
-  +model_params.model.min_child_weight: interval(1e-2, 100)
-  +model_params.model.max_depth: range(2, 16)
-  model_params.num_boost_round: range(100, 1000)
-  model_params.early_stopping_rounds: range(1, 10)
+  model.eta: tag(log, interval(0.001, 1))
+  model.lambda: tag(log, interval(0.001, 1))
+  model.alpha: tag(log, interval(0.001, 1))
+  model.subsample: interval(0.5, 1)
+  model.min_child_weight: interval(1e-2, 100)
+  model.max_depth: range(2, 16)
+  num_boost_round: range(100, 1000)
+  early_stopping_rounds: range(1, 10)
   tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
 ```
 

diff --git a/docs/source/overview.md b/docs/source/overview.md
@@ -38,14 +38,14 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au
    - static codes (codes without timestamps)
    - static numerical codes (codes without timestamps but with numerical values).
 
-   This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
+   This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument.
 
 2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
 
    **Example: Tabularizing static data** with the minimum code frequency of 10, window sizes of `[1d, 30d,  365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`
 
    ```console
-   meds-tab-tabularize-static MEDS_cohort_dir="path_to_data" \
+   meds-tab-tabularize-static input_dir="path_to_data" \
                                tabularization.min_code_inclusion_frequency=10 \
                                tabularization.window_sizes=[1d,30d,365d,full] \
                                do_overwrite=False \
@@ -62,19 +62,19 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au
    meds-tab-tabularize-time-series --multirun \
       worker="range(0,$N_PARALLEL_WORKERS)" \
       hydra/launcher=joblib \
-      MEDS_cohort_dir="path_to_data" \
+      input_dir="path_to_data" \
       tabularization.min_code_inclusion_frequency=10 \
       do_overwrite=False \
       tabularization.window_sizes=[1d,30d,365d,full] \
       tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
    ```
 
-4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
+4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`.
 
    **Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
 
    ```console
-   meds-tab-cache-task MEDS_cohort_dir="path_to_data" \
+   meds-tab-cache-task input_dir="path_to_data" \
       task_name=$TASK \
       tabularization.min_code_inclusion_frequency=10 \
       do_overwrite=False \
@@ -86,7 +86,7 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au
 
    ```console
    meds-tab-xgboost --multirun \
-      MEDS_cohort_dir="path_to_data" \
+      input_dir="path_to_data" \
       task_name=$TASK \
       output_dir="output_directory" \
       tabularization.min_code_inclusion_frequency=10 \

diff --git a/docs/source/prediction.md b/docs/source/prediction.md
@@ -14,7 +14,7 @@ A single XGBoost run was completed to profile time and memory usage. This was do
 
 ```console
 meds-tab-xgboost
-      MEDS_cohort_dir="path_to_data" \
+      input_dir="path_to_data" \
       task_name=$TASK \
       output_dir="output_directory" \
       do_overwrite=False \
@@ -84,7 +84,7 @@ The XGBoost sweep was run using the following command for each `$TASK`:
 
 ```console
 meds-tab-xgboost --multirun \
-      MEDS_cohort_dir="path_to_data" \
+      input_dir="path_to_data" \
       task_name=$TASK \
       output_dir="output_directory" \
       tabularization.window_sizes=$(generate-permutations [1d,30d,365d,full]) \
@@ -107,14 +107,14 @@ The hydra sweeper swept over the parameters:
 
 ```yaml
 params:
-  +model_params.model.eta: tag(log, interval(0.001, 1))
-  +model_params.model.lambda: tag(log, interval(0.001, 1))
-  +model_params.model.alpha: tag(log, interval(0.001, 1))
-  +model_params.model.subsample: interval(0.5, 1)
-  +model_params.model.min_child_weight: interval(1e-2, 100)
-  +model_params.model.max_depth: range(2, 16)
-  model_params.num_boost_round: range(100, 1000)
-  model_params.early_stopping_rounds: range(1, 10)
+  model.eta: tag(log, interval(0.001, 1))
+  model.lambda: tag(log, interval(0.001, 1))
+  model.alpha: tag(log, interval(0.001, 1))
+  model.subsample: interval(0.5, 1)
+  model.min_child_weight: interval(1e-2, 100)
+  model.max_depth: range(2, 16)
+  num_boost_round: range(100, 1000)
+  early_stopping_rounds: range(1, 10)
   tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
 ```
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -15,10 +15,12 @@ classifiers = [
     "Operating System :: OS Independent",
 ]
 dependencies = [
-  "polars", "pyarrow", "loguru", "hydra-core==1.3.2", "numpy", "scipy<1.14.0", "pandas", "tqdm", "xgboost",
+  "polars==1.6.0", "pyarrow", "loguru", "hydra-core==1.3.2", "numpy", "scipy<1.14.0", "pandas", "tqdm", "xgboost",
   "scikit-learn", "hydra-optuna-sweeper", "hydra-joblib-launcher", "ml-mixins", "meds==0.3.3", "meds-transforms==0.0.7",
 ]
 
+[tool.setuptools_scm]
+
 [project.scripts]
 meds-tab-describe = "MEDS_tabular_automl.scripts.describe_codes:main"
 meds-tab-tabularize-static = "MEDS_tabular_automl.scripts.tabularize_static:main"

diff --git a/src/MEDS_tabular_automl/base_model.py b/src/MEDS_tabular_automl/base_model.py
@@ -9,6 +9,8 @@
 
 
 class BaseModel(ABC, TimeableMixin):
+    """Defines the interface for a model that can be trained and evaluated via the launch_model script."""
+
     @abstractmethod
     def __init__(self):
         pass

diff --git a/src/MEDS_tabular_automl/configs/default.yaml b/src/MEDS_tabular_automl/configs/default.yaml
@@ -1,13 +1,13 @@
-MEDS_cohort_dir: ???
-output_cohort_dir: ???
+input_dir: ???
+output_dir: ???
 do_overwrite: False
 seed: 1
 tqdm: False
 worker: 0
 loguru_init: False
 
-log_dir: ${output_cohort_dir}/.logs/
-cache_dir: ${output_cohort_dir}/.cache
+log_dir: ${output_dir}/.logs/
+cache_dir: ${output_dir}/.cache
 
 hydra:
   verbose: False

diff --git a/src/MEDS_tabular_automl/configs/describe_codes.yaml b/src/MEDS_tabular_automl/configs/describe_codes.yaml
@@ -2,8 +2,7 @@ defaults:
   - default
   - _self_
 
-input_dir: ${output_cohort_dir}/data
 # Where to store output code frequency data
-output_filepath: ${output_cohort_dir}/metadata/codes.parquet
+output_filepath: ${output_dir}/metadata/codes.parquet
 
 name: describe_codes
diff --git a/src/MEDS_tabular_automl/configs/launch_autogluon.yaml b/src/MEDS_tabular_automl/configs/launch_autogluon.yaml
diff --git a/src/MEDS_tabular_automl/configs/launch_model.yaml b/src/MEDS_tabular_automl/configs/launch_model.yaml
@@ -1,39 +1,30 @@
 defaults:
-  - _self_
   - default
   - tabularization: default
-  - model: xgboost # This can be changed to sgd_classifier or any other model
-  - imputer: default
-  - normalization: default
-  - override hydra/callbacks: evaluation_callback
+  - model_launcher: xgboost
   - override hydra/sweeper: optuna
-  - override hydra/sweeper/sampler: tpe
+  - override hydra/callbacks: evaluation_callback
   - override hydra/launcher: joblib
+  - _self_
 
-task_name: task
+task_name: ???
 
-# Task cached data dir
-input_dir: ${output_cohort_dir}/${task_name}/task_cache
-# Directory with task labels
-input_label_dir: ${output_cohort_dir}/${task_name}/labels/
+# Location of task, split, and shard specific tabularized data
+input_tabularized_cache_dir: ${output_dir}/${task_name}/task_cache
+# Location of  task, split, and shard specific label data
+input_label_cache_dir: ${output_dir}/${task_name}/labels
 # Where to output the model and cached data
-model_saving:
-  model_dir: ${output_cohort_dir}/model/model_${now:%Y-%m-%d_%H-%M-%S}
-  model_file_stem: model
-  model_file_extension: .json
-  delete_below_top_k: -1
-model_logging:
-  model_log_dir: ${model_saving.model_dir}/.logs/
-  performance_log_stem: performance
-  config_log_stem: config
+output_model_dir: ???
+
+delete_below_top_k: -1
 
 name: launch_model
 
 hydra:
-  verbose: False
-  job:
-    name: MEDS_TAB_${name}_${worker}_${now:%Y-%m-%d_%H-%M-%S}
   sweep:
-    dir: ${model_log_dir}
+    dir: ${output_model_dir}/sweeps/{now:%Y-%m-%d-%H-%M-%S}/
+    subdir: "1"
   run:
-    dir: ${model_log_dir}
+    dir: ${path.model_log_dir}
+  sweeper:
+    direction: "maximize"