Skip to content

Commit

Permalink
Update docs for multishard and predicates (#126)
Browse files Browse the repository at this point in the history
* Remove hardcoded versions in docs

* Update links to MEDS-DEV

* Partially update readme, update sample MEDS data

* Update README and MEDS sample data

* Updates --help per #131

* Information about override predicates using predicates-only files

* Fix issue when label and index_timestamp predicates are not referenced elsewhere

* Updated links

* Update links

* Fix the prior fix for cases where no label or index

* Restructured text links

* Index references are not applicable, only label

* Undo restructured text links

* Separate the link cells and update technical details of plain predicates (other_cols)

* Update links again

* Update links final

* Update README
  • Loading branch information
justin13601 authored Sep 24, 2024
1 parent 4c066d7 commit 4472d9f
Show file tree
Hide file tree
Showing 16 changed files with 152 additions and 109 deletions.
154 changes: 81 additions & 73 deletions README.md

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@
copyright = "2024, Justin Xu & Matthew McDermott"
author = "Justin Xu & Matthew McDermott"

release = "0.2.5"
version = "0.2.5"
# release = "0.2.5"
# version = "0.2.5"


def ensure_pandoc_installed(_):
Expand Down Expand Up @@ -256,7 +256,7 @@ def ensure_pandoc_installed(_):

# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
html_title = f"ACES v{version} Documentation"
html_title = "ACES Documentation"

# A shorter title for the navigation bar. Default is the same as html_title.
html_short_title = "ACES Documentation"
Expand Down Expand Up @@ -386,7 +386,7 @@ def ensure_pandoc_installed(_):
# -- Options for EPUB output
epub_show_urls = "footnote"

print(f"loading configurations for {project} {version} ...", file=sys.stderr)
print(f"loading configurations for {project} ...", file=sys.stderr)


def setup(app):
Expand Down
4 changes: 2 additions & 2 deletions docs/source/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ These configs consist of the following four fields:
will be used).
- `value_min_inclusive`: See `value_min`
- `value_max_inclusive`: See `value_max`
- `other_cols`: This optional field accepts a 1-to-1 dictionary of column names to column values, and can be
used to specify further constraints on other columns (ie., not `code`) for this predicate.

A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending
on its source format.
Expand Down Expand Up @@ -191,5 +193,3 @@ to achieve the result. Instead, this bound is always interpreted to be inclusive
the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
to be included.

______________________________________________________________________
10 changes: 5 additions & 5 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,19 @@ If you have a dataset and want to leverage it for machine learning tasks, the AC

- Task-Specific Concepts: Identify the predicates (data concepts) required for your specific machine learning tasks.
- Pre-Defined Criteria: Utilize our pre-defined criteria across various tasks and clinical areas to expedite this process.
- [PIE-MD](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/criteria): Access our repository of tasks to find relevant predicates!
- [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main): Access our benchmark of tasks to find relevant predicates!

### III. Set Dataset-Agnostic Criteria

- Standardization: Combine the identified predicates with standardized, dataset-agnostic criteria files.
- Examples: Refer to the [MIMIC-IV](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/MIMIC-IV) and [eICU](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/eICU) examples for guidance on how to structure your criteria files for your private datasets!
- Examples: Refer to the [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main/src/MEDS_DEV/tasks/criteria) examples for guidance on how to structure your criteria files for your private datasets!

### IV. Run ACES

- Run the ACES Command-Line Interface tool (`aces-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://eventstreamaces.readthedocs.io/en/latest/usage.html)!
- Run the ACES Command-Line Interface tool (`aces-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://eventstreamaces.readthedocs.io/en/latest/usage.html) for more information!

### V. Run MEDS-Tab

- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!
- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!

By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES and MEDS ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
12 changes: 9 additions & 3 deletions docs/source/notebooks/examples.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,13 @@
"source": [
"# Task Examples\n",
"\n",
"Provided below are two examples of mortality prediction tasks that ACES could easily extract subject cohorts for. The configurations have been tested all the provided synthetic data in the repository ([`sample_data/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data)), as well as the MIMIC-IV dataset loaded using MEDS & ESGPT (with very minor changes to the below predicate definition). The configuration files for both of these tasks are provided in the repository ([`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs)), and cohorts can be extracted using the `aces-cli` tool:\n",
"\n",
"Provided below are two examples of mortality prediction tasks that ACES could easily extract subject cohorts for. The configurations have been tested all the provided synthetic data in the repository ([sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data)), as well as the MIMIC-IV dataset loaded using MEDS & ESGPT (with very minor changes to the below predicate definition). The configuration files for both of these tasks are provided in the repository ([sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs)), and cohorts can be extracted using the `aces-cli` tool:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```bash\n",
"aces-cli data.path='/path/to/MIMIC/ESGPT/schema/' data.standard='esgpt' cohort_dir='sample_configs/' cohort_name='...'\n",
"```"
Expand Down Expand Up @@ -269,6 +274,7 @@
"source": [
"imminent_mortality_cfg_path = f\"{config_path}/imminent_mortality.yaml\"\n",
"cfg = config.TaskExtractorConfig.load(config_path=imminent_mortality_cfg_path)\n",
"\n",
"tree = cfg.window_tree\n",
"print_tree(tree)"
]
Expand All @@ -279,7 +285,7 @@
"source": [
"## Other Examples\n",
"\n",
"A few other examples are provided in [`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs) of the repository. We will continue to add task configurations to this folder or to a benchmarking effort for EHR representation learning. More information can be found [here](https://github.com/mmcdermott/PIE_MD/tree/main) - stay tuned!"
"A few other examples are provided in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) of the repository. We will continue to add task configurations to [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main), a benchmarking effort for EHR representation learning - stay tuned!"
]
}
],
Expand Down
6 changes: 3 additions & 3 deletions docs/source/notebooks/predicates.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@
"source": [
"## Sample Predicates DataFrame\n",
"\n",
"A sample predicates dataframe is provided in the repository ([`sample_data/sample_data.csv`](https://github.com/justin13601/ACES/blob/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data/sample_data.csv)). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository ([`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs)) could be directly extracted."
"A sample predicates dataframe is provided in the repository ([sample_data/sample_data.csv](https://github.com/justin13601/ACES/blob/main/sample_data/sample_data.csv)). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository ([sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs)) could be directly extracted."
]
},
{
Expand Down Expand Up @@ -100,7 +100,7 @@
"\n",
"ACES is able to automatically compute the predicates dataframe from your dataset and the fields defined in your task configuration if you are using the MEDS or ESGPT data standard. Should you choose to not transform your dataset into one of these two currently supported standards, you may also navigate the transformation yourself by creating your own predicates dataframe.\n",
"\n",
"Again, it is acceptable if your own predicates dataframe only contains `plain` predicate columns, as ACES can automatically create `derived` predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of `and/or`) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see [#47](https://github.com/justin13601/ACES/issues/47)).\n",
"Again, it is acceptable if your own predicates dataframe only contains `plain` predicate columns, as ACES can automatically create `derived` predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of `and/or`) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see [#66](https://github.com/justin13601/ACES/issues/66)).\n",
"\n",
"**Note**: When creating `plain` predicate columns directly, you must still define them in the configuration file (they could be with an arbitrary value in the `code` field) - ACES will verify their existence after data loading (ie., by validating that a column exists with the predicate name in your dataframe). You will also need them for referencing in your windows."
]
Expand All @@ -109,7 +109,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Example of the `derived` predicate `discharge_or_death`, expressed as an `or()` relationship between `plain` predicates `discharge` and `death, which have been directly defined (ie., arbitrary values for their codes are present).\n",
"Example of the `derived` predicate `discharge_or_death`, expressed as an `or()` relationship between `plain` predicates `discharge` and `death`, which have been directly defined (ie., arbitrary values for their codes, `defined in data`, are present).\n",
"\n",
"```yaml\n",
"predicates:\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/source/notebooks/tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
"source": [
"### Directories\n",
"\n",
"Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs) and [`sample_data/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data) folders in the project root, respectively."
"Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) and [sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data) folders in the project root, respectively."
]
},
{
Expand Down
46 changes: 28 additions & 18 deletions docs/source/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,43 +149,47 @@ Hydra configuration files are leveraged for cohort extraction runs. All fields c

#### Data Configuration

To set a data standard:
**To set a data standard**:

`data.standard`: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct'
***`data.standard`***: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct'

To query from a single MEDS shard:
**To query from a single MEDS shard**:

`data.path`: Path to the `.parquet`shard file
***`data.path`***: Path to the `.parquet` shard file

To query from multiple MEDS shards, you must set `data=sharded`. Additionally:
**To query from multiple MEDS shards**, you must set `data=sharded`. Additionally:

`data.root`: Root directory of MEDS dataset containing shard directories
***`data.root`***: Root directory of MEDS dataset containing shard directories

`data.shard`: Expression specifying MEDS shards (`$(expand_shards <str>/<int>)`)
***`data.shard`***: Expression specifying MEDS shards using [expand_shards](https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py) (`$(expand_shards <str>/<int>)`)

To query from an ESGPT dataset:
**To query from an ESGPT dataset**:

`data.path`: Directory of the full ESGPT dataset
***`data.path`***: Directory of the full ESGPT dataset

To query from a direct predicates dataframe:
**To query from a direct predicates dataframe**:

`data.path` Path to the `.csv` or `.parquet` file containing the predicates dataframe
***`data.path`*** Path to the `.csv` or `.parquet` file containing the predicates dataframe

`data.ts_format`: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M"
***`data.ts_format`***: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M"

#### Task Configuration

`cohort_dir`: Directory of your task configuration file
***`cohort_dir`***: Directory of your task configuration file

***`cohort_name`***: Name of the task configuration file

The above two fields are used below for automatically loading task configurations, saving results, and logging:

`cohort_name`: Name of the task configuration file
***`config_path`***: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml`

The above two fields are used for automatically loading task configurations, saving results, and logging:
***`output_filepath`***: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_**dir}/${cohort_name}.parquet` otherwise

`config_path`: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml`
***`log_dir`***: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs`

`output_filepath`: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_dir}/${cohort_name}.parquet` otherwise
Additionally, predicates may be specified in a separate predicates configuration file and loaded for overrides:

`log_dir`: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs`
***`predicates_path`***: Path to the [separate predicates-only file](https://eventstreamaces.readthedocs.io/en/latest/usage.html#separate-predicates-only-file). Defaults to null

#### Tab Completion

Expand Down Expand Up @@ -257,6 +261,8 @@ For example, to query an in-hospital mortality task on the sample data (both the
>>> query.query(cfg=cfg, predicates_df=predicates_df)
```

### Separate Predicates-Only File

For more complex tasks involving a large number of predicates, a separate predicates-only "database" file can
be created and passed into `TaskExtractorConfig.load()`. Only referenced predicates will have a predicate
column computed and evaluated, so one could create a dataset-specific deposit file with many predicates and
Expand All @@ -266,4 +272,8 @@ reference as needed to ensure the cleanliness of the dataset-agnostic task crite
>>> cfg = config.TaskExtractorConfig.load(config_path="criteria.yaml", predicates_path="predicates.yaml")
```

If the same predicates are defined in both the task configuration file and the predicates-only file, the
predicates-only definition takes precedent and will be used to override previous definitions. As such, one may
create a predicates-only "database" file for a particular dataset, and override accordingly for various tasks.

______________________________________________________________________
Binary file added sample_data/meds_sample/held_out/0.parquet
Binary file not shown.
Binary file modified sample_data/meds_sample/sample_shard.parquet
Binary file not shown.
Binary file removed sample_data/meds_sample/test/0.parquet
Binary file not shown.
Binary file modified sample_data/meds_sample/train/0.parquet
Binary file not shown.
Binary file modified sample_data/meds_sample/train/1.parquet
Binary file not shown.
3 changes: 3 additions & 0 deletions src/aces/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1273,6 +1273,9 @@ def load(cls, config_path: str | Path, predicates_path: str | Path = None) -> Ta

referenced_predicates = {pred for w in windows.values() for pred in w.referenced_predicates}
referenced_predicates.add(trigger.predicate)
label_reference = [w.label for w in windows.values() if w.label]
if label_reference:
referenced_predicates.update(set(label_reference))
current_predicates = set(referenced_predicates)
special_predicates = {ANY_EVENT_COLUMN, START_OF_RECORD_KEY, END_OF_RECORD_KEY}
for pred in current_predicates - special_predicates:
Expand Down
8 changes: 8 additions & 0 deletions src/aces/configs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,17 @@
(`.csv` or `.parquet`) if using `direct`
- standard (required): data standard, one of 'meds', 'esgpt', or 'direct'
- ts_format (required if data.standard is 'direct'): timestamp format for the data
- root (required, applicable when data=sharded): root directory for the data shards
- shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset.
Note: data.shard can be expanded using the `expand_shards` function. Please refer to
https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and
https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information.
cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging
cohort_name (required): cohort name, used to automatically load configs, saving results, and logging
config_path (optional): path to the task configuration file, defaults to '<cohort_dir>/<cohort_name>.yaml'
predicates_path (optional): path to a separate predicates-only configuration file for overriding
output_filepath (optional): path to the output file, defaults to '<cohort_dir>/<cohort_name>.parquet'
---------------- Default Config ----------------
Expand Down
8 changes: 8 additions & 0 deletions src/aces/configs/_aces.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,17 @@ hydra:
(`.csv` or `.parquet`) if using `direct`
- standard (required): data standard, one of 'meds', 'esgpt', or 'direct'
- ts_format (required if data.standard is 'direct'): timestamp format for the data
- root (required, applicable when data=sharded): root directory for the data shards
- shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset.
Note: data.shard can be expanded using the `expand_shards` function. Please refer to
https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and
https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information.
cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging
cohort_name (required): cohort name, used to automatically load configs, saving results, and logging
config_path (optional): path to the task configuration file, defaults to '<cohort_dir>/<cohort_name>.yaml'
predicates_path (optional): path to a separate predicates-only configuration file for overriding
output_filepath (optional): path to the output file, defaults to '<cohort_dir>/<cohort_name>.parquet'
---------------- Default Config ----------------
Expand Down

0 comments on commit 4472d9f

Please sign in to comment.