Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added autogluon support, more models, more preprocessing strategies #81

Merged
merged 65 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
5fde57a
added autogluon support
Oufattole Aug 19, 2024
d6832cb
updates for autogluon
teyaberg Aug 19, 2024
0612730
[wip] filtering features
teyaberg Aug 20, 2024
2feee79
[wip] filtering features
teyaberg Aug 20, 2024
f3c985a
[wip] sharing for updates only
teyaberg Aug 20, 2024
b65754c
[wip] sharing for updates only
teyaberg Aug 20, 2024
a8d8417
[wip] doctests
teyaberg Aug 20, 2024
d07f6a2
autogluon
teyaberg Aug 20, 2024
2aebd70
added logged warning for static data being empty and added support fo…
Oufattole Aug 20, 2024
8c54317
Merge branch 'generalized_load_model' into dev
Oufattole Aug 20, 2024
ecf9292
Added support via hydra for selecting among four imputation methods (…
Oufattole Aug 21, 2024
e6cf085
fixed xgboost model yaml to load imputer and normalization from the m…
Oufattole Aug 21, 2024
94dfde2
added autogluon test and cli support
Oufattole Aug 21, 2024
527eda5
added three more sklearn models and fixed bug with normalzation and i…
Oufattole Aug 21, 2024
0d7ed27
fixed bugs so correlation code filters work now
Oufattole Aug 21, 2024
9c542ea
sweeper
teyaberg Aug 21, 2024
1a519ff
logging
teyaberg Aug 21, 2024
8fc8863
made tash caching parallelize and updated tests for configs
Oufattole Aug 21, 2024
5724d9b
Merge branch 'dev' of github.com:mmcdermott/MEDS_Tabular_AutoML into dev
Oufattole Aug 21, 2024
3e223bb
added more thourough tests for output file paths of task caching and …
Oufattole Aug 22, 2024
926732b
Merge branch 'main' into dev
Oufattole Aug 25, 2024
299bf6f
setup dynamic versioning
Oufattole Aug 25, 2024
8a7692a
version updates
teyaberg Sep 5, 2024
158b8fa
version updates
teyaberg Sep 6, 2024
e92049f
fix hydra-core version for experimental callback support
teyaberg Sep 6, 2024
0623aaa
eval callback logging
teyaberg Sep 6, 2024
e1be850
added script input args checks, reduced redundancy in model launcher …
Oufattole Sep 7, 2024
0e985ee
eval callback
teyaberg Sep 7, 2024
139870f
eval callback
teyaberg Sep 7, 2024
0d5e9e8
Updated pre-commit config too.
mmcdermott Sep 8, 2024
2563aaf
Removed a function that was not yet implemented.
mmcdermott Sep 8, 2024
2d80905
Removing unused function in evaluation callback.
mmcdermott Sep 8, 2024
d29ece9
eval callback
teyaberg Sep 8, 2024
81b022f
added yaml hierarchy for model_launcher
Oufattole Sep 8, 2024
57a4a81
updated configs, fixed most tests
Oufattole Sep 9, 2024
b704bba
Merged
mmcdermott Sep 9, 2024
2f564e6
Removed unused pass block.
mmcdermott Sep 9, 2024
6f68a4b
Removing unnecessary keys call
mmcdermott Sep 9, 2024
6c2ba9a
Fixed workflow files
mmcdermott Sep 9, 2024
e678145
fixed tabularize tests
Oufattole Sep 9, 2024
d64e237
added integration tests covering multirun for all launch_model models…
Oufattole Sep 9, 2024
8d12aed
merged dev
Oufattole Sep 9, 2024
c631e93
fixed tests
Oufattole Sep 9, 2024
2601fca
Merge pull request #90 from mmcdermott/configs
Oufattole Sep 9, 2024
a4ad03c
resolved review feedback. Added a based_model docstring. Added versio…
Oufattole Sep 9, 2024
0db7bd6
fixed min_code_inclusion_frequency kwarg
Oufattole Sep 9, 2024
b289033
added mimic iv tutorial
Oufattole Sep 9, 2024
9294920
updated tabularization script to fix bugs
Oufattole Sep 9, 2024
d71f9dc
reduced the number of workers for resharding
Oufattole Sep 9, 2024
aed27f1
Merged.
mmcdermott Sep 9, 2024
0dc2bc6
updated tabularize meds to take string input for tasks
Oufattole Sep 9, 2024
c981534
Merge pull request #91 from mmcdermott/improve_test_coverage
mmcdermott Sep 9, 2024
2aa4feb
Improved error handling per https://github.com/mmcdermott/MEDS_Tabula…
mmcdermott Sep 9, 2024
a6d9103
Update README.md
mmcdermott Sep 9, 2024
23eb4d4
added try except around loading 0 codes
Oufattole Sep 10, 2024
be5f723
fixed job name config bug where we were missing the $ so it was not …
Oufattole Sep 10, 2024
4c87e94
Merge branch 'dev' into MIMICIV
Oufattole Sep 10, 2024
d390658
fixed precommit issues
Oufattole Sep 10, 2024
b82ee6d
Merge branch 'dev' into MIMICIV
Oufattole Sep 10, 2024
a564886
fix paths for eval_callback and add check to test_integration
teyaberg Sep 10, 2024
430afba
fixing tests for delete_below_top_k
teyaberg Sep 10, 2024
6a89a9f
Merge pull request #92 from mmcdermott/MIMICIV
Oufattole Sep 10, 2024
9e6d99a
fix out of memory xgboost training and added test
teyaberg Sep 10, 2024
8316365
simplified pathing for results and evaluation callback
Oufattole Sep 10, 2024
f7e03dd
fixed doctest for deleting below top k models
Oufattole Sep 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/code-quality-main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@ jobs:

steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Run pre-commits
uses: pre-commit/action@v3.0.1
6 changes: 4 additions & 2 deletions .github/workflows/code-quality-pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,12 @@ jobs:

steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Find modified files
id: file_changes
Expand Down
28 changes: 2 additions & 26 deletions .github/workflows/publish-to-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.x"
python-version: "3.11"
- name: Install pypa/build
run: >-
python3 -m
Expand All @@ -36,7 +36,7 @@ jobs:
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/<package-name> # Replace <package-name> with your PyPI project name
url: https://pypi.org/p/meds-tab # Replace <package-name> with your PyPI project name
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

Expand Down Expand Up @@ -91,27 +91,3 @@ jobs:
gh release upload
'${{ github.ref_name }}' dist/**
--repo '${{ github.repository }}'

publish-to-testpypi:
name: Publish Python 🐍 distribution 📦 to TestPyPI
needs:
- build
runs-on: ubuntu-latest

environment:
name: testpypi
url: https://test.pypi.org/p/<package-name>

permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/
- name: Publish distribution 📦 to TestPyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/
10 changes: 6 additions & 4 deletions .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,19 @@ jobs:

strategy:
fail-fast: false
matrix:
python-version: ["3.11", "3.12"]

timeout-minutes: 30

steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Set up Python 3.12
uses: actions/setup-python@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: "3.12"
python-version: ${{ matrix.python-version }}

- name: Install packages
run: |
Expand Down
4 changes: 1 addition & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
default_language_version:
python: python3.12

exclude: "sample_data|docs/MIMIC_IV_tutorial/wandb_reports"
python: python3.11

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
Expand Down
44 changes: 22 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,12 @@ By following these steps, you can seamlessly transform your dataset, define nece

```console
# Re-shard pipeline
# $MIMICIV_MEDS_DIR is the directory containing the input, MEDS v0.3 formatted MIMIC-IV data
# $MIMICIV_input_dir is the directory containing the input, MEDS v0.3 formatted MIMIC-IV data
# $MEDS_TAB_COHORT_DIR is the directory where the re-sharded MEDS dataset will be stored, and where your model
# will store cached files during processing by default.
# $N_PATIENTS_PER_SHARD is the number of patients per shard you want to use.
MEDS_transform-reshard_to_split \
input_dir="$MIMICIV_MEDS_DIR" \
input_dir="$MIMICIV_input_dir" \
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
cohort_dir="$MEDS_TAB_COHORT_DIR" \
'stages=["reshard_to_split"]' \
stage="reshard_to_split" \
Expand All @@ -103,14 +103,14 @@ By following these steps, you can seamlessly transform your dataset, define nece
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).

This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument.

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved

**Example: Tabularizing static data** with the minimum code frequency of 10, window sizes of `[1d, 30d, 365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

```console
meds-tab-tabularize-static MEDS_cohort_dir="path_to_data" \
meds-tab-tabularize-static input_dir="path_to_data" \
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
tabularization.min_code_inclusion_frequency=10 \
tabularization.window_sizes=[1d,30d,365d,full] \
do_overwrite=False \
Expand All @@ -119,27 +119,27 @@ By following these steps, you can seamlessly transform your dataset, define nece

- For the exhaustive examples of value aggregations, see [`/src/MEDS_tabular_automl/utils.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/src/MEDS_tabular_automl/utils.py#L24)

3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`. This stage (and the previous stage) uses sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner).
3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `subject_id` x `timestamp`. This stage (and the previous stage) uses sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner).

**Example: Aggregate time-series data** on features across different `window_sizes`

```console
meds-tab-tabularize-time-series --multirun \
worker="range(0,$N_PARALLEL_WORKERS)" \
hydra/launcher=joblib \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
tabularization.window_sizes=[1d,30d,365d,full] \
tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
```

4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`.

**Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)
**Example: Align tabularized data** for a specific task `$TASK` and labels that have been pulled from [ACES](https://github.com/justin13601/ACES)

```console
meds-tab-cache-task MEDS_cohort_dir="path_to_data" \
meds-tab-cache-task input_dir="path_to_data" \
mmcdermott marked this conversation as resolved.
Show resolved Hide resolved
task_name=$TASK \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
Expand All @@ -151,7 +151,7 @@ By following these steps, you can seamlessly transform your dataset, define nece

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.min_code_inclusion_frequency=10 \
Expand Down Expand Up @@ -321,7 +321,7 @@ Now that we have generated tabular features for all the events in our dataset, w
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
- **Use of Sparse Matrices for Efficient Storage**: Sparse matrices are again employed here to store the selected data efficiently, ensuring that only non-zero data points are kept in memory, thus optimizing both storage and retrieval times.

The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard file structure as the input meds data from step (1), and the label parquets need `patient_id`, `timestamp`, and `label` columns.
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard file structure as the input meds data from step (1), and the label parquets need `subject_id`, `timestamp`, and `label` columns.

## 4. XGBoost Training

Expand Down Expand Up @@ -436,7 +436,7 @@ A single XGBoost run was completed to profile time and memory usage. This was do

```console
meds-tab-xgboost
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
do_overwrite=False \
Expand Down Expand Up @@ -506,7 +506,7 @@ The XGBoost sweep was run using the following command for each `$TASK`:

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.window_sizes=$(generate-subsets [1d,30d,365d,full]) \
Expand All @@ -529,14 +529,14 @@ The hydra sweeper swept over the parameters:

```yaml
params:
+model_params.model.eta: tag(log, interval(0.001, 1))
+model_params.model.lambda: tag(log, interval(0.001, 1))
+model_params.model.alpha: tag(log, interval(0.001, 1))
+model_params.model.subsample: interval(0.5, 1)
+model_params.model.min_child_weight: interval(1e-2, 100)
+model_params.model.max_depth: range(2, 16)
model_params.num_boost_round: range(100, 1000)
model_params.early_stopping_rounds: range(1, 10)
model.eta: tag(log, interval(0.001, 1))
model.lambda: tag(log, interval(0.001, 1))
model.alpha: tag(log, interval(0.001, 1))
model.subsample: interval(0.5, 1)
model.min_child_weight: interval(1e-2, 100)
model.max_depth: range(2, 16)
num_boost_round: range(100, 1000)
early_stopping_rounds: range(1, 10)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
```

Expand Down
2 changes: 1 addition & 1 deletion docs/source/implementation.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Now that we have generated tabular features for all the events in our dataset, w
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
- **Use of Sparse Matrices for Efficient Storage**: Sparse matrices are again employed here to store the selected data efficiently, ensuring that only non-zero data points are kept in memory, thus optimizing both storage and retrieval times.

The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard filestructure as the input meds data from step (1), and the label parquets need `patient_id`, `timestamp`, and `label` columns.
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard filestructure as the input meds data from step (1), and the label parquets need `subject_id`, `timestamp`, and `label` columns.

## 4. XGBoost Training

Expand Down
16 changes: 8 additions & 8 deletions docs/source/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,14 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au
- static codes (codes without timestamps)
- static numerical codes (codes without timestamps but with numerical values).

This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `MEDS_cohort_dir` argument specified as a hydra-style command line argument.
This script further caches feature names and frequencies in a dataset stored in a `code_metadata.parquet` file within the `input_dir` argument specified as a hydra-style command line argument.

2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `patient_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.
2. **`meds-tab-tabularize-static`**: Filters and processes the dataset based on the frequency of codes, generating a tabular vector for each patient at each timestamp in the shards. Each row corresponds to a unique `subject_id` and `timestamp` combination, thus rows are duplicated across multiple timestamps for the same patient.

**Example: Tabularizing static data** with the minimum code frequency of 10, window sizes of `[1d, 30d, 365d, full]`, and value aggregation methods of `[static/present, static/first, code/count, value/count, value/sum, value/sum_sqd, value/min, value/max]`

```console
meds-tab-tabularize-static MEDS_cohort_dir="path_to_data" \
meds-tab-tabularize-static input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
tabularization.window_sizes=[1d,30d,365d,full] \
do_overwrite=False \
Expand All @@ -54,27 +54,27 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au

- For the exhaustive examples of value aggregations, see [`/src/MEDS_tabular_automl/utils.py`](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/main/src/MEDS_tabular_automl/utils.py#L24)

3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `patient_id` x `timestamp`. This stage (and the previous stage) uses sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner).
3. **`meds-tab-tabularize-time-series`**: Iterates through combinations of a shard, `window_size`, and `aggregation` to generate feature vectors that aggregate patient data for each unique `subject_id` x `timestamp`. This stage (and the previous stage) uses sparse matrix formats to efficiently handle the computational and storage demands of rolling window calculations on large datasets. We support parallelization through Hydra's [`--multirun`](https://hydra.cc/docs/intro/#multirun) flag and the [`joblib` launcher](https://hydra.cc/docs/plugins/joblib_launcher/#internaldocs-banner).

**Example: Aggregate time-series data** on features across different `window_sizes`

```console
meds-tab-tabularize-time-series --multirun \
worker="range(0,$N_PARALLEL_WORKERS)" \
hydra/launcher=joblib \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
tabularization.window_sizes=[1d,30d,365d,full] \
tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]
```

4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`patient_id`, `timestamp`, `label`) structured similarly to the `MEDS_cohort_dir`.
4. **`meds-tab-cache-task`**: Aligns task-specific labels with the nearest prior event in the tabularized data. It requires a labeled dataset directory with three columns (`subject_id`, `timestamp`, `label`) structured similarly to the `input_dir`.

**Example: Align tabularized data** for a specific task `$TASK` and labels that has pulled from [ACES](https://github.com/justin13601/ACES)

```console
meds-tab-cache-task MEDS_cohort_dir="path_to_data" \
meds-tab-cache-task input_dir="path_to_data" \
task_name=$TASK \
tabularization.min_code_inclusion_frequency=10 \
do_overwrite=False \
Expand All @@ -86,7 +86,7 @@ See [`/tests/test_integration.py`](https://github.com/mmcdermott/MEDS_Tabular_Au

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.min_code_inclusion_frequency=10 \
Expand Down
20 changes: 10 additions & 10 deletions docs/source/prediction.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ A single XGBoost run was completed to profile time and memory usage. This was do

```console
meds-tab-xgboost
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
do_overwrite=False \
Expand Down Expand Up @@ -84,7 +84,7 @@ The XGBoost sweep was run using the following command for each `$TASK`:

```console
meds-tab-xgboost --multirun \
MEDS_cohort_dir="path_to_data" \
input_dir="path_to_data" \
task_name=$TASK \
output_dir="output_directory" \
tabularization.window_sizes=$(generate-permutations [1d,30d,365d,full]) \
Expand All @@ -107,14 +107,14 @@ The hydra sweeper swept over the parameters:

```yaml
params:
+model_params.model.eta: tag(log, interval(0.001, 1))
+model_params.model.lambda: tag(log, interval(0.001, 1))
+model_params.model.alpha: tag(log, interval(0.001, 1))
+model_params.model.subsample: interval(0.5, 1)
+model_params.model.min_child_weight: interval(1e-2, 100)
+model_params.model.max_depth: range(2, 16)
model_params.num_boost_round: range(100, 1000)
model_params.early_stopping_rounds: range(1, 10)
model.eta: tag(log, interval(0.001, 1))
model.lambda: tag(log, interval(0.001, 1))
model.alpha: tag(log, interval(0.001, 1))
model.subsample: interval(0.5, 1)
model.min_child_weight: interval(1e-2, 100)
model.max_depth: range(2, 16)
num_boost_round: range(100, 1000)
early_stopping_rounds: range(1, 10)
tabularization.min_code_inclusion_frequency: tag(log, range(10, 1000000))
```

Expand Down
Loading
Loading