Update docs for multishard and predicates (#126)

* Remove hardcoded versions in docs * Update links to MEDS-DEV * Partially update readme, update sample MEDS data * Update README and MEDS sample data * Updates --help per #131 * Information about override predicates using predicates-only files * Fix issue when label and index_timestamp predicates are not referenced elsewhere * Updated links * Update links * Fix the prior fix for cases where no label or index * Restructured text links * Index references are not applicable, only label * Undo restructured text links * Separate the link cells and update technical details of plain predicates (other_cols) * Update links again * Update links final * Update README
justin13601 · Sep 24, 2024 · 4472d9f · 4472d9f
1 parent 4c066d7
commit 4472d9f
Show file tree

Hide file tree

Showing 16 changed files with 152 additions and 109 deletions.
diff --git a/README.md b/README.md
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -28,8 +28,8 @@
 copyright = "2024, Justin Xu & Matthew McDermott"
 author = "Justin Xu & Matthew McDermott"
 
-release = "0.2.5"
-version = "0.2.5"
+# release = "0.2.5"
+# version = "0.2.5"
 
 
 def ensure_pandoc_installed(_):
@@ -256,7 +256,7 @@ def ensure_pandoc_installed(_):
 
 # The name for this set of Sphinx documents.  If None, it defaults to
 # "<project> v<release> documentation".
-html_title = f"ACES v{version} Documentation"
+html_title = "ACES Documentation"
 
 # A shorter title for the navigation bar.  Default is the same as html_title.
 html_short_title = "ACES Documentation"
@@ -386,7 +386,7 @@ def ensure_pandoc_installed(_):
 # -- Options for EPUB output
 epub_show_urls = "footnote"
 
-print(f"loading configurations for {project} {version} ...", file=sys.stderr)
+print(f"loading configurations for {project} ...", file=sys.stderr)
 
 
 def setup(app):

diff --git a/docs/source/configuration.md b/docs/source/configuration.md
@@ -63,6 +63,8 @@ These configs consist of the following four fields:
   will be used).
 - `value_min_inclusive`: See `value_min`
 - `value_max_inclusive`: See `value_max`
+- `other_cols`: This optional field accepts a 1-to-1 dictionary of column names to column values, and can be
+  used to specify further constraints on other columns (ie., not `code`) for this predicate.
 
 A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending
 on its source format.
@@ -191,5 +193,3 @@ to achieve the result. Instead, this bound is always interpreted to be inclusive
 the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
 `name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
 to be included.
-
-______________________________________________________________________
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -38,19 +38,19 @@ If you have a dataset and want to leverage it for machine learning tasks, the AC
 
 - Task-Specific Concepts: Identify the predicates (data concepts) required for your specific machine learning tasks.
 - Pre-Defined Criteria: Utilize our pre-defined criteria across various tasks and clinical areas to expedite this process.
-- [PIE-MD](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/criteria): Access our repository of tasks to find relevant predicates!
+- [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main): Access our benchmark of tasks to find relevant predicates!
 
 ### III. Set Dataset-Agnostic Criteria
 
 - Standardization: Combine the identified predicates with standardized, dataset-agnostic criteria files.
-- Examples: Refer to the [MIMIC-IV](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/MIMIC-IV) and [eICU](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/eICU) examples for guidance on how to structure your criteria files for your private datasets!
+- Examples: Refer to the [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main/src/MEDS_DEV/tasks/criteria) examples for guidance on how to structure your criteria files for your private datasets!
 
 ### IV. Run ACES
 
-- Run the ACES Command-Line Interface tool (`aces-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://eventstreamaces.readthedocs.io/en/latest/usage.html)!
+- Run the ACES Command-Line Interface tool (`aces-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://eventstreamaces.readthedocs.io/en/latest/usage.html) for more information!
 
 ### V. Run MEDS-Tab
 
-- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!
+- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!
 
-By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
+By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES and MEDS ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
diff --git a/docs/source/notebooks/examples.ipynb b/docs/source/notebooks/examples.ipynb
@@ -6,8 +6,13 @@
    "source": [
     "# Task Examples\n",
     "\n",
-    "Provided below are two examples of mortality prediction tasks that ACES could easily extract subject cohorts for. The configurations have been tested all the provided synthetic data in the repository ([`sample_data/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data)), as well as the MIMIC-IV dataset loaded using MEDS & ESGPT (with very minor changes to the below predicate definition). The configuration files for both of these tasks are provided in the repository ([`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs)), and cohorts can be extracted using the `aces-cli` tool:\n",
-    "\n",
+    "Provided below are two examples of mortality prediction tasks that ACES could easily extract subject cohorts for. The configurations have been tested all the provided synthetic data in the repository ([sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data)), as well as the MIMIC-IV dataset loaded using MEDS & ESGPT (with very minor changes to the below predicate definition). The configuration files for both of these tasks are provided in the repository ([sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs)), and cohorts can be extracted using the `aces-cli` tool:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "```bash\n",
     "aces-cli data.path='/path/to/MIMIC/ESGPT/schema/' data.standard='esgpt' cohort_dir='sample_configs/' cohort_name='...'\n",
     "```"
@@ -269,6 +274,7 @@
    "source": [
     "imminent_mortality_cfg_path = f\"{config_path}/imminent_mortality.yaml\"\n",
     "cfg = config.TaskExtractorConfig.load(config_path=imminent_mortality_cfg_path)\n",
+    "\n",
     "tree = cfg.window_tree\n",
     "print_tree(tree)"
    ]
@@ -279,7 +285,7 @@
    "source": [
     "## Other Examples\n",
     "\n",
-    "A few other examples are provided in [`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs) of the repository. We will continue to add task configurations to this folder or to a benchmarking effort for EHR representation learning. More information can be found [here](https://github.com/mmcdermott/PIE_MD/tree/main) - stay tuned!"
+    "A few other examples are provided in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) of the repository. We will continue to add task configurations to [MEDS-DEV](https://github.com/mmcdermott/MEDS-DEV/tree/main), a benchmarking effort for EHR representation learning - stay tuned!"
    ]
   }
  ],

diff --git a/docs/source/notebooks/predicates.ipynb b/docs/source/notebooks/predicates.ipynb
@@ -71,7 +71,7 @@
    "source": [
     "## Sample Predicates DataFrame\n",
     "\n",
-    "A sample predicates dataframe is provided in the repository ([`sample_data/sample_data.csv`](https://github.com/justin13601/ACES/blob/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data/sample_data.csv)). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository ([`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs)) could be directly extracted."
+    "A sample predicates dataframe is provided in the repository ([sample_data/sample_data.csv](https://github.com/justin13601/ACES/blob/main/sample_data/sample_data.csv)). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository ([sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs)) could be directly extracted."
    ]
   },
   {
@@ -100,7 +100,7 @@
     "\n",
     "ACES is able to automatically compute the predicates dataframe from your dataset and the fields defined in your task configuration if you are using the MEDS or ESGPT data standard. Should you choose to not transform your dataset into one of these two currently supported standards, you may also navigate the transformation yourself by creating your own predicates dataframe.\n",
     "\n",
-    "Again, it is acceptable if your own predicates dataframe only contains `plain` predicate columns, as ACES can automatically create `derived` predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of `and/or`) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see [#47](https://github.com/justin13601/ACES/issues/47)).\n",
+    "Again, it is acceptable if your own predicates dataframe only contains `plain` predicate columns, as ACES can automatically create `derived` predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of `and/or`) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see [#66](https://github.com/justin13601/ACES/issues/66)).\n",
     "\n",
     "**Note**: When creating `plain` predicate columns directly, you must still define them in the configuration file (they could be with an arbitrary value in the `code` field) - ACES will verify their existence after data loading (ie., by validating that a column exists with the predicate name in your dataframe). You will also need them for referencing in your windows."
    ]
@@ -109,7 +109,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Example of the `derived` predicate `discharge_or_death`, expressed as an `or()` relationship between `plain` predicates `discharge` and `death, which have been directly defined (ie., arbitrary values for their codes are present).\n",
+    "Example of the `derived` predicate `discharge_or_death`, expressed as an `or()` relationship between `plain` predicates `discharge` and `death`, which have been directly defined (ie., arbitrary values for their codes, `defined in data`, are present).\n",
     "\n",
     "```yaml\n",
     "predicates:\n",

diff --git a/docs/source/notebooks/tutorial.ipynb b/docs/source/notebooks/tutorial.ipynb
@@ -47,7 +47,7 @@
    "source": [
     "### Directories\n",
     "\n",
-    "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs) and [`sample_data/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data) folders in the project root, respectively."
+    "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) and [sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data) folders in the project root, respectively."
    ]
   },
   {

diff --git a/docs/source/usage.md b/docs/source/usage.md
@@ -149,43 +149,47 @@ Hydra configuration files are leveraged for cohort extraction runs. All fields c
 
 #### Data Configuration
 
-To set a data standard:
+**To set a data standard**:
 
-`data.standard`: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct'
+***`data.standard`***: String specifying the data standard, must be 'meds' OR 'esgpt' OR 'direct'
 
-To query from a single MEDS shard:
+**To query from a single MEDS shard**:
 
-`data.path`: Path to the `.parquet`shard file
+***`data.path`***: Path to the `.parquet` shard file
 
-To query from multiple MEDS shards, you must set `data=sharded`. Additionally:
+**To query from multiple MEDS shards**, you must set `data=sharded`. Additionally:
 
-`data.root`: Root directory of MEDS dataset containing shard directories
+***`data.root`***: Root directory of MEDS dataset containing shard directories
 
-`data.shard`: Expression specifying MEDS shards (`$(expand_shards <str>/<int>)`)
+***`data.shard`***: Expression specifying MEDS shards using [expand_shards](https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py) (`$(expand_shards <str>/<int>)`)
 
-To query from an ESGPT dataset:
+**To query from an ESGPT dataset**:
 
-`data.path`: Directory of the full ESGPT dataset
+***`data.path`***: Directory of the full ESGPT dataset
 
-To query from a direct predicates dataframe:
+**To query from a direct predicates dataframe**:
 
-`data.path` Path to the `.csv` or `.parquet` file containing the predicates dataframe
+***`data.path`*** Path to the `.csv` or `.parquet` file containing the predicates dataframe
 
-`data.ts_format`: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M"
+***`data.ts_format`***: Timestamp format for predicates. Defaults to "%m/%d/%Y %H:%M"
 
 #### Task Configuration
 
-`cohort_dir`: Directory of your task configuration file
+***`cohort_dir`***: Directory of your task configuration file
+
+***`cohort_name`***: Name of the task configuration file
+
+The above two fields are used below for automatically loading task configurations, saving results, and logging:
 
-`cohort_name`: Name of the task configuration file
+***`config_path`***: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml`
 
-The above two fields are used for automatically loading task configurations, saving results, and logging:
+***`output_filepath`***: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_**dir}/${cohort_name}.parquet` otherwise
 
-`config_path`: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml`
+***`log_dir`***: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs`
 
-`output_filepath`: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_dir}/${cohort_name}.parquet` otherwise
+Additionally, predicates may be specified in a separate predicates configuration file and loaded for overrides:
 
-`log_dir`: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs`
+***`predicates_path`***: Path to the [separate predicates-only file](https://eventstreamaces.readthedocs.io/en/latest/usage.html#separate-predicates-only-file). Defaults to null
 
 #### Tab Completion
 
@@ -257,6 +261,8 @@ For example, to query an in-hospital mortality task on the sample data (both the
 >>> query.query(cfg=cfg, predicates_df=predicates_df)
 ```
 
+### Separate Predicates-Only File
+
 For more complex tasks involving a large number of predicates, a separate predicates-only "database" file can
 be created and passed into `TaskExtractorConfig.load()`. Only referenced predicates will have a predicate
 column computed and evaluated, so one could create a dataset-specific deposit file with many predicates and
@@ -266,4 +272,8 @@ reference as needed to ensure the cleanliness of the dataset-agnostic task crite
 >>> cfg = config.TaskExtractorConfig.load(config_path="criteria.yaml", predicates_path="predicates.yaml")
 ```
 
+If the same predicates are defined in both the task configuration file and the predicates-only file, the
+predicates-only definition takes precedent and will be used to override previous definitions. As such, one may
+create a predicates-only "database" file for a particular dataset, and override accordingly for various tasks.
+
 ______________________________________________________________________
diff --git a/sample_data/meds_sample/held_out/0.parquet b/sample_data/meds_sample/held_out/0.parquet
diff --git a/sample_data/meds_sample/sample_shard.parquet b/sample_data/meds_sample/sample_shard.parquet
diff --git a/sample_data/meds_sample/test/0.parquet b/sample_data/meds_sample/test/0.parquet
diff --git a/sample_data/meds_sample/train/0.parquet b/sample_data/meds_sample/train/0.parquet
diff --git a/sample_data/meds_sample/train/1.parquet b/sample_data/meds_sample/train/1.parquet
diff --git a/src/aces/config.py b/src/aces/config.py
@@ -1273,6 +1273,9 @@ def load(cls, config_path: str | Path, predicates_path: str | Path = None) -> Ta
 
         referenced_predicates = {pred for w in windows.values() for pred in w.referenced_predicates}
         referenced_predicates.add(trigger.predicate)
+        label_reference = [w.label for w in windows.values() if w.label]
+        if label_reference:
+            referenced_predicates.update(set(label_reference))
         current_predicates = set(referenced_predicates)
         special_predicates = {ANY_EVENT_COLUMN, START_OF_RECORD_KEY, END_OF_RECORD_KEY}
         for pred in current_predicates - special_predicates:

diff --git a/src/aces/configs/__init__.py b/src/aces/configs/__init__.py
@@ -37,9 +37,17 @@
         (`.csv` or `.parquet`) if using `direct`
         - standard (required): data standard, one of  'meds', 'esgpt', or 'direct'
         - ts_format (required if data.standard is 'direct'): timestamp format for the data
+        - root (required, applicable when data=sharded): root directory for the data shards
+        - shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset.
+
+        Note: data.shard can be expanded using the `expand_shards` function. Please refer to
+        https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and
+        https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information.
+
     cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging
     cohort_name (required): cohort name, used to automatically load configs, saving results, and logging
     config_path (optional): path to the task configuration file, defaults to '<cohort_dir>/<cohort_name>.yaml'
+    predicates_path (optional): path to a separate predicates-only configuration file for overriding
     output_filepath (optional): path to the output file, defaults to '<cohort_dir>/<cohort_name>.parquet'
 
     ---------------- Default Config ----------------

diff --git a/src/aces/configs/_aces.yaml b/src/aces/configs/_aces.yaml
@@ -55,9 +55,17 @@ hydra:
           (`.csv` or `.parquet`) if using `direct`
           - standard (required): data standard, one of  'meds', 'esgpt', or 'direct'
           - ts_format (required if data.standard is 'direct'): timestamp format for the data
+          - root (required, applicable when data=sharded): root directory for the data shards
+          - shard (required, applicable when data=sharded): shard number of specific shard from a MEDS dataset.
+
+          Note: data.shard can be expanded using the `expand_shards` function. Please refer to
+          https://eventstreamaces.readthedocs.io/en/latest/usage.html#multiple-shards and
+          https://github.com/justin13601/ACES/blob/main/src/aces/expand_shards.py for more information.
+
       cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging
       cohort_name (required): cohort name, used to automatically load configs, saving results, and logging
       config_path (optional): path to the task configuration file, defaults to '<cohort_dir>/<cohort_name>.yaml'
+      predicates_path (optional): path to a separate predicates-only configuration file for overriding
       output_filepath (optional): path to the output file, defaults to '<cohort_dir>/<cohort_name>.parquet'
 
       ---------------- Default Config ----------------