Skip to content

Commit

Permalink
Merged.
Browse files Browse the repository at this point in the history
  • Loading branch information
mmcdermott committed Sep 9, 2024
2 parents 6c2ba9a + 0db7bd6 commit aed27f1
Show file tree
Hide file tree
Showing 60 changed files with 745 additions and 642 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/publish-to-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/<package-name> # Replace <package-name> with your PyPI project name
url: https://pypi.org/p/meds-tab # Replace <package-name> with your PyPI project name
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

Expand Down
6 changes: 4 additions & 2 deletions .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,19 @@ jobs:

strategy:
fail-fast: false
matrix:
python-version: ["3.11", "3.12"]

timeout-minutes: 30

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Python
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: "3.11"
python-version: ${{ matrix.python-version }}

- name: Install packages
run: |
Expand Down
36 changes: 18 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,12 @@ By following these steps, you can seamlessly transform your dataset, define nece

```console
# Re-shard pipeline
# $MIMICIV_MEDS_DIR is the directory containing the input, MEDS v0.3 formatted MIMIC-IV data
# $MIMICIV_input_dir is the directory containing the input, MEDS v0.3 formatted MIMIC-IV data
# $MEDS_TAB_COHORT_DIR is the directory where the re-sharded MEDS dataset will be stored, and where your model
# will store cached files during processing by default.
# $N_PATIENTS_PER_SHARD is the number of patients per shard you want to use.
MEDS_transform-reshard_to_split \
input_dir="$MIMICIV_MEDS_DIR" \
input_dir="$MIMICIV_input_dir" \
cohort_dir="$MEDS_TAB_COHORT_DIR" \
'stages=["reshard_to_split"]' \
stage="reshard_to_split" \
Expand All @@ -103,14 +103,14 @@ By following these steps, you can seamlessly transform your dataset, define nece
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).

This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument.

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

**Example: Tabularizing static data** with the minimum code frequency of 10, window sizes of `[1d, 30d, 365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

```console
meds-tab-tabularize-static MEDS_cohort_dir="path_to_data" \
meds-tab-tabularize-static input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
tabularization.window_sizes=[1d,30d,365d,full] \
do_overwrite=False \
Expand All @@ -127,19 +127,19 @@ By following these steps, you can seamlessly transform your dataset, define nece
meds-tab-tabularize-time-series --multirun \
worker="range(0,$N_PARALLEL_WORKERS)" \
hydra/launcher=joblib \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
tabularization.window_sizes=[1d,30d,365d,full] \
tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
```

4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`.

**Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)

```console
meds-tab-cache-task MEDS_cohort_dir="path_to_data" \
meds-tab-cache-task input_dir="path_to_data" \
task_name=$TASK \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
Expand All @@ -151,7 +151,7 @@ By following these steps, you can seamlessly transform your dataset, define nece

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.min_code_inclusion_frequency=10 \
Expand Down Expand Up @@ -436,7 +436,7 @@ A single XGBoost run was completed to profile time and memory usage. This was do

```console
meds-tab-xgboost
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
do_overwrite=False \
Expand Down Expand Up @@ -506,7 +506,7 @@ The XGBoost sweep was run using the following command for each `$TASK`:

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.window_sizes=$(generate-subsets [1d,30d,365d,full]) \
Expand All @@ -529,14 +529,14 @@ The hydra sweeper swept over the parameters:

```yaml
params:
+model_params.model.eta: tag(log, interval(0.001, 1))
+model_params.model.lambda: tag(log, interval(0.001, 1))
+model_params.model.alpha: tag(log, interval(0.001, 1))
+model_params.model.subsample: interval(0.5, 1)
+model_params.model.min_child_weight: interval(1e-2, 100)
+model_params.model.max_depth: range(2, 16)
model_params.num_boost_round: range(100, 1000)
model_params.early_stopping_rounds: range(1, 10)
model.eta: tag(log, interval(0.001, 1))
model.lambda: tag(log, interval(0.001, 1))
model.alpha: tag(log, interval(0.001, 1))
model.subsample: interval(0.5, 1)
model.min_child_weight: interval(1e-2, 100)
model.max_depth: range(2, 16)
num_boost_round: range(100, 1000)
early_stopping_rounds: range(1, 10)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
```

Expand Down
12 changes: 6 additions & 6 deletions docs/source/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,14 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).

This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument.

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

**Example: Tabularizing static data** with the minimum code frequency of 10, window sizes of `[1d, 30d, 365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

```console
meds-tab-tabularize-static MEDS_cohort_dir="path_to_data" \
meds-tab-tabularize-static input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
tabularization.window_sizes=[1d,30d,365d,full] \
do_overwrite=False \
Expand All @@ -62,19 +62,19 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au
meds-tab-tabularize-time-series --multirun \
worker="range(0,$N_PARALLEL_WORKERS)" \
hydra/launcher=joblib \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
tabularization.window_sizes=[1d,30d,365d,full] \
tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
```

4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`.

**Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)

```console
meds-tab-cache-task MEDS_cohort_dir="path_to_data" \
meds-tab-cache-task input_dir="path_to_data" \
task_name=$TASK \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
Expand All @@ -86,7 +86,7 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.min_code_inclusion_frequency=10 \
Expand Down
20 changes: 10 additions & 10 deletions docs/source/prediction.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ A single XGBoost run was completed to profile time and memory usage. This was do

```console
meds-tab-xgboost
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
do_overwrite=False \
Expand Down Expand Up @@ -84,7 +84,7 @@ The XGBoost sweep was run using the following command for each `$TASK`:

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.window_sizes=$(generate-permutations [1d,30d,365d,full]) \
Expand All @@ -107,14 +107,14 @@ The hydra sweeper swept over the parameters:

```yaml
params:
+model_params.model.eta: tag(log, interval(0.001, 1))
+model_params.model.lambda: tag(log, interval(0.001, 1))
+model_params.model.alpha: tag(log, interval(0.001, 1))
+model_params.model.subsample: interval(0.5, 1)
+model_params.model.min_child_weight: interval(1e-2, 100)
+model_params.model.max_depth: range(2, 16)
model_params.num_boost_round: range(100, 1000)
model_params.early_stopping_rounds: range(1, 10)
model.eta: tag(log, interval(0.001, 1))
model.lambda: tag(log, interval(0.001, 1))
model.alpha: tag(log, interval(0.001, 1))
model.subsample: interval(0.5, 1)
model.min_child_weight: interval(1e-2, 100)
model.max_depth: range(2, 16)
num_boost_round: range(100, 1000)
early_stopping_rounds: range(1, 10)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
```

Expand Down
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,12 @@ classifiers = [
"Operating System :: OS Independent",
]
dependencies = [
"polars", "pyarrow", "loguru", "hydra-core==1.3.2", "numpy", "scipy<1.14.0", "pandas", "tqdm", "xgboost",
"polars==1.6.0", "pyarrow", "loguru", "hydra-core==1.3.2", "numpy", "scipy<1.14.0", "pandas", "tqdm", "xgboost",
"scikit-learn", "hydra-optuna-sweeper", "hydra-joblib-launcher", "ml-mixins", "meds==0.3.3", "meds-transforms==0.0.7",
]

[tool.setuptools_scm]

[project.scripts]
meds-tab-describe = "MEDS_tabular_automl.scripts.describe_codes:main"
meds-tab-tabularize-static = "MEDS_tabular_automl.scripts.tabularize_static:main"
Expand Down
2 changes: 2 additions & 0 deletions src/MEDS_tabular_automl/base_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@


class BaseModel(ABC, TimeableMixin):
"""Defines the interface for a model that can be trained and evaluated via the launch_model script."""

@abstractmethod
def __init__(self):
pass
Expand Down
8 changes: 4 additions & 4 deletions src/MEDS_tabular_automl/configs/default.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
MEDS_cohort_dir: ???
output_cohort_dir: ???
input_dir: ???
output_dir: ???
do_overwrite: False
seed: 1
tqdm: False
worker: 0
loguru_init: False

log_dir: ${output_cohort_dir}/.logs/
cache_dir: ${output_cohort_dir}/.cache
log_dir: ${output_dir}/.logs/
cache_dir: ${output_dir}/.cache

hydra:
verbose: False
Expand Down
3 changes: 1 addition & 2 deletions src/MEDS_tabular_automl/configs/describe_codes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@ defaults:
- default
- _self_

input_dir: ${output_cohort_dir}/data
# Where to store output code frequency data
output_filepath: ${output_cohort_dir}/metadata/codes.parquet
output_filepath: ${output_dir}/metadata/codes.parquet

name: describe_codes
28 changes: 0 additions & 28 deletions src/MEDS_tabular_automl/configs/launch_autogluon.yaml

This file was deleted.

41 changes: 16 additions & 25 deletions src/MEDS_tabular_automl/configs/launch_model.yaml
Original file line number Diff line number Diff line change
@@ -1,39 +1,30 @@
defaults:
- _self_
- default
- tabularization: default
- model: xgboost # This can be changed to sgd_classifier or any other model
- imputer: default
- normalization: default
- override hydra/callbacks: evaluation_callback
- model_launcher: xgboost
- override hydra/sweeper: optuna
- override hydra/sweeper/sampler: tpe
- override hydra/callbacks: evaluation_callback
- override hydra/launcher: joblib
- _self_

task_name: task
task_name: ???

# Task cached data dir
input_dir: ${output_cohort_dir}/${task_name}/task_cache
# Directory with task labels
input_label_dir: ${output_cohort_dir}/${task_name}/labels/
# Location of task, split, and shard specific tabularized data
input_tabularized_cache_dir: ${output_dir}/${task_name}/task_cache
# Location of task, split, and shard specific label data
input_label_cache_dir: ${output_dir}/${task_name}/labels
# Where to output the model and cached data
model_saving:
model_dir: ${output_cohort_dir}/model/model_${now:%Y-%m-%d_%H-%M-%S}
model_file_stem: model
model_file_extension: .json
delete_below_top_k: -1
model_logging:
model_log_dir: ${model_saving.model_dir}/.logs/
performance_log_stem: performance
config_log_stem: config
output_model_dir: ???

delete_below_top_k: -1

name: launch_model

hydra:
verbose: False
job:
name: MEDS_TAB_${name}_${worker}_${now:%Y-%m-%d_%H-%M-%S}
sweep:
dir: ${model_log_dir}
dir: ${output_model_dir}/sweeps/{now:%Y-%m-%d-%H-%M-%S}/
subdir: "1"
run:
dir: ${model_log_dir}
dir: ${path.model_log_dir}
sweeper:
direction: "maximize"
Loading

0 comments on commit aed27f1

Please sign in to comment.