diff --git a/README.md b/README.md index 75d42b9..f9c0fe2 100644 --- a/README.md +++ b/README.md @@ -101,7 +101,7 @@ predicates_df = predicates.get_predicates_df(cfg=cfg, data_config=data_config) df_result = query.query(cfg=cfg, predicates_df=predicates_df) ``` -**Results**: The output will be a dataframe of subjects who satisfy the conditions defined in your task configuration file. Timestamps for the start/end boundaries of each window specified in the task configuration, as well as predicate counts for each window, are also provided. Below are sample logs for the successful extraction of an in-hospital mortality cohort using the ESGPT standard: +4. **Results**: The output will be a dataframe of subjects who satisfy the conditions defined in your task configuration file. Timestamps for the start/end boundaries of each window specified in the task configuration, as well as predicate counts for each window, are also provided. Below are sample logs for the successful extraction of an in-hospital mortality cohort using the ESGPT standard: ```log aces-cli cohort_name="inhospital_mortality" cohort_dir="sample_configs" data.standard="esgpt" data.path="MIMIC_ESD_new_schema_08-31-23-1/" @@ -300,9 +300,21 @@ The `has` field specifies constraints relating to predicates within the window. Support for static data depends on your data standard and those variables are expressed. For instance, in MEDS, it is feasible to express static data as a predicate, and thus criteria can be set normally. However, this is not yet incorporated for ESGPT. If a predicates dataframe is directly used, you may create a predicate column that specifies your static variable. +### Complementary Tools + +ACES is an integral part of the MEDS ecosystem. To fully leverage its capabilities, you can utilize it alongside other complementary MEDS tools, such as: + +- [MEDS-ETL](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used to transform various data schemas, including some command data models, into the MEDS format. +- [MEDS-TAB](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used generate automated tabular baseline methods (ie., XGBoost over ACES-defined tasks). +- [MEDS-Polars](https://github.com/Medical-Event-Data-Standard/meds_etl), which contains polars-based ETL scripts. + ### Alternative Tools -TODO +There are existing alternatives for cohort extraction that focus on specific common data models, such as [i2b2 PIC-SURE](https://pic-sure.org/) and [OHDSI ATLAS](https://atlas.ohdsi.org/). + +ACES serves as a middle ground between PIC-SURE and ATLAS. While it may offer less capability than PIC-SURE, it compensates with greater ease of use and improved communication value. Compared to ATLAS, ACES provides greater capability, though with slightly lower ease of use, yet it still maintains a higher communication value. + +Finally, ACES is not tied to a particular common data model. Built on a flexible event-stream format, ACES is a no-code solution with a descriptive input format, permitting easy and wide iteration over task definitions, and can be applied to a variety of schemas, making it a versatile tool suitable for diverse research needs. ## Future Roadmap diff --git a/docs/source/conf.py b/docs/source/conf.py index 9c1cddb..ab4cef1 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -321,7 +321,7 @@ def ensure_pandoc_installed(_): # -- Options for LaTeX output - +# latex_engine = "xelatex" latex_elements = { # type: ignore # The paper size ("letterpaper" or "a4paper"). "papersize": "letterpaper", diff --git a/docs/source/configuration.md b/docs/source/configuration.md index 0dbef4d..aae4dc6 100644 --- a/docs/source/configuration.md +++ b/docs/source/configuration.md @@ -1,6 +1,4 @@ -# Configuration Language Specification - -## Introduction and Terminology +## Configuration Language Specification This document specifies the configuration language for the automatic extraction of task dataframes and cohorts from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds) @@ -27,9 +25,7 @@ contain events that satisfy certain aggregation functions over predicates for th ______________________________________________________________________ -## Machine Form (ACES) - -In the machine form, the configuration file consists of three parts: +In the machine form used by ACES, the configuration file consists of three parts: - `predicates`, stored as a dictionary from string predicate names (which must be unique) to either `PlainPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or @@ -38,7 +34,9 @@ In the machine form, the configuration file consists of three parts: - `windows`, stored as a dictionary from string window names (which must be unique) to `WindowConfig` objects. -Next, we will detail each of these configuration objects. +Below, we will detail each of these configuration objects. + +______________________________________________________________________ ### Predicates: `PlainPredicateConfig` and `DerivedPredicateConfig` @@ -68,11 +66,13 @@ on its source format. 1. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format (recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min` - and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax - does not currently support defining predicates that also rely on matching other, optional fields in the - MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull - request or upvoting any existing issue/PR that requests/implements this feature, and we will add support - for this capability. + and `value_max` constraints will be compared against MEDS' `numerical_value` field. + + **Note**: This syntax does not currently support defining predicates that also rely on matching other, + optional fields in the MEDS syntax; if this is a desired feature for you, please let us know by filing a + GitHub issue or pull request or upvoting any existing issue/PR that requests/implements this feature, + and we will add support for this capability. + 2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the `code` will be interpreted in the following manner: a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the @@ -95,7 +95,7 @@ accepted operations that can be applied to other predicates, containing precisel - `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true. - `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true. -Note that, currently, `and`'s and `or`'s cannot be nested. Upon user request, we may support further advanced +**Note**: Currently, `and`'s and `or`'s cannot be nested. Upon user request, we may support further advanced analytic operations over predicates. ______________________________________________________________________ @@ -138,17 +138,22 @@ following rules: In this case, the referencing event (either the start or end of the window) will be defined as occurring exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the end or start of the window). - Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if + + **Note**: If `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if `$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of the window fields. + 2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE` In this case, the referencing event will be defined as the next or previous event satisfying the - predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate + predicate, `$PREDICATE`. + + **Note**: If the `$REFERENCED` is the `start` field, then the "next predicate ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the "previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of - the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the + the window fields. These forms can lead to windows being defined as single point events, if the `$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and inclusive values are set. + 3. `$REFERENCING = $REFERENCED` In this case, the referencing event will be defined as the same event as the referenced event. @@ -175,8 +180,9 @@ the `start` event itself. The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)` that define the valid range the count of observations of the named predicate that must be found in a window for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case -those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note -that as predicate counts are always integral, this specification does not need an additional +those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. + +**Note**: As predicate counts are always integral, this specification does not need an additional inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate diff --git a/docs/source/index.md b/docs/source/index.md index 4cd300f..a052972 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -11,14 +11,13 @@ ACES is a library designed for the automatic extraction of cohorts from event-st glob: maxdepth: 2 --- -GitHub README +README Usage Guide Task Examples -Sample Data Tutorial Predicates DataFrame -Configuration Language -Algorithm & Terminology -Profiling +Sample Data Tutorial +Technical Details +Computational Profile Module API Reference License ``` @@ -29,29 +28,29 @@ ______________________________________________________________________ If you have a dataset and want to leverage it for machine learning tasks, the ACES ecosystem offers a streamlined and user-friendly approach. Here's how you can easily transform, prepare, and utilize your dataset with MEDS and ACES for efficient and effective machine learning: -### 1. Transform to MEDS +### I. Transform to MEDS - Simplicity: Converting your dataset to the Medical Event Data Standard (MEDS) is straightforward and user-friendly compared to other Common Data Models (CDMs). - Minimal Bias: This conversion process ensures that your data remains as close to its raw form as possible, minimizing the introduction of biases. - [MEDS-ETL](https://github.com/Medical-Event-Data-Standard/meds_etl): Follow this link for detailed instructions and ETLs to transform your dataset into the MEDS format! -### 2. Identify Predicates +### II. Identify Predicates - Task-Specific Concepts: Identify the predicates (data concepts) required for your specific machine learning tasks. - Pre-Defined Criteria: Utilize our pre-defined criteria across various tasks and clinical areas to expedite this process. - [PIE-MD](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/criteria): Access our repository of tasks to find relevant predicates! -### 3. Set Dataset-Agnostic Criteria +### III. Set Dataset-Agnostic Criteria - Standardization: Combine the identified predicates with standardized, dataset-agnostic criteria files. - Examples: Refer to the [MIMIC-IV](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/MIMIC-IV) and [eICU](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/eICU) examples for guidance on how to structure your criteria files for your private datasets! -### 4. Run ACES +### IV. Run ACES - Run the ACES Command-Line Interface tool (`aces-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://eventstreamaces.readthedocs.io/en/latest/usage.html)! -### 5. Run MEDS-Tab +### V. Run MEDS-Tab - Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space! -By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform steps 1-5 on new datasets in reasonable raw formulations! +By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations! diff --git a/docs/source/license.md b/docs/source/license.md index a97e0e0..442c1d5 100644 --- a/docs/source/license.md +++ b/docs/source/license.md @@ -5,5 +5,3 @@ language: text --- ``` - -______________________________________________________________________ diff --git a/docs/source/notebooks/examples.ipynb b/docs/source/notebooks/examples.ipynb index a1a52d3..1ccbdd1 100644 --- a/docs/source/notebooks/examples.ipynb +++ b/docs/source/notebooks/examples.ipynb @@ -138,11 +138,11 @@ "\n", "The windows section contains the remaining three windows we defined previously - `input`, `gap`, and `target`.\n", "\n", - "`input` begins at the start of a patient's record (ie., `NULL`), and ends 24 hours past `trigger` (ie., `admission`). As we'd like to include the events specified at both the start and end of `input`, if present, we can set both `start_inclusive` and `end_inclusive` as `True`. Our constraint on the number of records is specified in `has` using the `_ANY_EVENT` predicate, with its value set to be greater or equal to 5 (ie., unbounded parameter on the right as seen in `(5, None)`). **Note**: since we'd like to make a prediction at the end of `input`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `trigger + 24h`.\n", + "`input` begins at the start of a patient's record (ie., `NULL`), and ends 24 hours past `trigger` (ie., `admission`). As we'd like to include the events specified at both the start and end of `input`, if present, we can set both `start_inclusive` and `end_inclusive` as `True`. Our constraint on the number of records is specified in `has` using the `_ANY_EVENT` predicate, with its value set to be greater or equal to 5 (ie., unbounded parameter on the right as seen in `(5, None)`). **Note**: Since we'd like to make a prediction at the end of `input`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `trigger + 24h`.\n", "\n", "`gap` also begins at `trigger`, and ends 48 hours after. As we have included included the left boundary event in `trigger` (ie., `admission`), it would be reasonable to not include it again as it should not play a role in `gap`. As such, we set `start_inclusive` to `False`. As we'd like our admission to be at least 48 hours long, we can place constraints specifying that there cannot be any `admission`, `discharge`, or `death` in `gap` (ie., right-bounded parameter at `0` as seen in `(None, 0)`).\n", "\n", - "`target` beings at the end of `gap`, and ends at the next discharge or death event (ie., `discharge_or_death` predicate). We can use this arrow notation which ACES recognizes as event references (ie., `->` and `<-`; see [Time Range Fields](https://eventstreamaces--39.org.readthedocs.build/en/39/configuration.html#time-range-fields)). In our case, we end `target` at the next `discharge_or_death`. Similarly, as we included the event at the end of `gap`, if any, already in `gap`, we can set `start_inclusive` to `False`. **Note**: since we'd like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`." + "`target` beings at the end of `gap`, and ends at the next discharge or death event (ie., `discharge_or_death` predicate). We can use this arrow notation which ACES recognizes as event references (ie., `->` and `<-`; see [Time Range Fields](https://eventstreamaces--39.org.readthedocs.build/en/39/configuration.html#time-range-fields)). In our case, we end `target` at the next `discharge_or_death`. Similarly, as we included the event at the end of `gap`, if any, already in `gap`, we can set `start_inclusive` to `False`. **Note**: Since we'd like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`." ] }, { @@ -237,9 +237,9 @@ "\n", "### Windows\n", "\n", - "The windows section contains the two windows we defined - `gap` and `target`. In this case, the `gap` and `target` windows are defined relative to every single event (ie., `_ANY_EVENT`).\n", + "The windows section contains the two windows we defined - `gap` and `target`. In this case, the `gap` and `target` windows are defined relative to every single event (ie., `_ANY_EVENT`). `gap` begins at `trigger`, and ends 2 hours after. `target` beings at the end of `gap`, and ends 24 hours after. \n", "\n", - "`gap` begins at `trigger`, and ends 2 hours after. `target` beings at the end of `gap`, and ends 24 hours after. **Note**: since we'd again like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`. Additionally, since a prediction would be made at the end of each `gap`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `_ANY_EVENT + 24h`." + "**Note**: Since we'd again like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`. Additionally, since a prediction would be made at the end of each `gap`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `_ANY_EVENT + 24h`." ] }, { diff --git a/docs/source/notebooks/predicates.ipynb b/docs/source/notebooks/predicates.ipynb index 6c367b3..08f062c 100644 --- a/docs/source/notebooks/predicates.ipynb +++ b/docs/source/notebooks/predicates.ipynb @@ -26,19 +26,19 @@ "\n", "| subject_id | timestamp | code | value |\n", "|------------|---------------------|-------------------------|-------------------------|\n", - "| 1 | 1989-01-01 00:00:00 | ADMISSION | |\n", + "| 1 | 1989-01-01 00:00:00 | ADMISSION | null |\n", "| 1 | 1989-01-01 01:00:00 | LAB//HR | 90 |\n", - "| 1 | 1989-01-01 01:00:00 | PROCEDURE_START | |\n", - "| 1 | 1989-01-01 02:00:00 | DISCHARGE | |\n", - "| 1 | 1989-01-01 02:00:00 | PROCEDURE_END | |\n", - "| 2 | 1991-05-06 12:00:00 | ADMISSION | |\n", - "| 2 | 1991-05-06 20:00:00 | DEATH | |\n", - "| 3 | 1980-10-17 22:00:00 | ADMISSION | |\n", + "| 1 | 1989-01-01 01:00:00 | PROCEDURE_START | null |\n", + "| 1 | 1989-01-01 02:00:00 | DISCHARGE | null |\n", + "| 1 | 1989-01-01 02:00:00 | PROCEDURE_END | null |\n", + "| 2 | 1991-05-06 12:00:00 | ADMISSION | null |\n", + "| 2 | 1991-05-06 20:00:00 | DEATH | null |\n", + "| 3 | 1980-10-17 22:00:00 | ADMISSION | null |\n", "| 3 | 1980-10-17 22:00:00 | LAB//HR | 120 |\n", "| 3 | 1980-10-18 01:00:00 | LAB//temp | 37 |\n", - "| 3 | 1980-10-18 09:00:00 | DISCHARGE | |\n", - "| 3 | 1982-02-02 02:00:00 | ADMISSION | |\n", - "| 3 | 1982-02-02 04:00:00 | DEATH | |\n", + "| 3 | 1980-10-18 09:00:00 | DISCHARGE | null |\n", + "| 3 | 1982-02-02 02:00:00 | ADMISSION | null |\n", + "| 3 | 1982-02-02 04:00:00 | DEATH | null |\n", "\n", "The `code` column contains a string of an event that occurred at the given `timestamp` for a given `subject_id`. You may then create a series of predicate columns depending on what suits your needs. For instance, here are some plausible predicate columns that could be created:\n", "\n", @@ -55,7 +55,7 @@ "| 3 | 1982-02-02 02:00:00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |\n", "| 3 | 1982-02-02 04:00:00 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |\n", "\n", - "Note that this set of predicates are all `plain` predicates (ie., explicitly expressed as a value in the dataset), with the exception of the `derived` predicate `discharge_or_death`, which can be expressed by applying boolean logic on the `discharge` and `death` predicates (ie., `or(discharge, death)`). You may choose to create these columns for `derived` predicates explicitly (as you would `plain` predicates). Or, ACES can automatically create them from `plain` predicates if the boolean logic is provided in the task configuration file. Please see [Predicates](https://eventstreamaces.readthedocs.io/en/latest/configuration.html#predicates-plainpredicateconfig-and-derivedpredicateconfig) for more information.\n", + "**Note**: This set of predicates are all `plain` predicates (ie., explicitly expressed as a value in the dataset), with the exception of the `derived` predicate `discharge_or_death`, which can be expressed by applying boolean logic on the `discharge` and `death` predicates (ie., `or(discharge, death)`). You may choose to create these columns for `derived` predicates explicitly (as you would `plain` predicates). Or, ACES can automatically create them from `plain` predicates if the boolean logic is provided in the task configuration file. Please see [Predicates](https://eventstreamaces.readthedocs.io/en/latest/configuration.html#predicates-plainpredicateconfig-and-derivedpredicateconfig) for more information.\n", "\n", "Additionally, you may notice that the tables differ in shape. In the original raw data, (`subject_id`, `timestamp`) is not unique. However, a final predicates dataframe must have unique (`subject_id`, `timestamp`) pairs. If the MEDS or ESGPT standard is used, ACES will automatically collapse rows down into unique per-patient per-timestamp levels (ie., grouping by these two columns and aggregating by summing predicate counts). However, if creating predicate columns directly, please ensure your dataframe is unique over (`subject_id`, `timestamp`)." ] @@ -95,9 +95,9 @@ "\n", "ACES is able to automatically compute the predicates dataframe from your dataset and the fields defined in your task configuration if you are using the MEDS or ESGPT data standard. Should you choose to not transform your dataset into one of these two currently supported standards, you may also navigate the transformation yourself by creating your own predicates dataframe.\n", "\n", - "Again, it is acceptable if your own predicates dataframe only contains `plain` predicate columns, as ACES can automatically create `derived` predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be difficult to express using the simple boolean formulas in the configuration file, we recommend also creating them manually prior to using ACES.\n", + "Again, it is acceptable if your own predicates dataframe only contains `plain` predicate columns, as ACES can automatically create `derived` predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of `and/or`) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see [#47](https://github.com/justin13601/ACES/issues/47)).\n", "\n", - "**Note**: when creating `plain` predicate columns directly, you must still define them in the configuration file (they could be with an arbitrary value in the `code` field) - ACES will verify their existence after data loading (ie., by validating that a column exists with the predicate name in your dataframe). You will also need them for referencing in your windows." + "**Note**: When creating `plain` predicate columns directly, you must still define them in the configuration file (they could be with an arbitrary value in the `code` field) - ACES will verify their existence after data loading (ie., by validating that a column exists with the predicate name in your dataframe). You will also need them for referencing in your windows." ] }, { @@ -108,11 +108,10 @@ "\n", "```yaml\n", "predicates:\n", - " ...\n", " death:\n", - " code: foo\n", + " code: defined in data\n", " discharge:\n", - " code: bar\n", + " code: defined in data\n", " discharge_or_death:\n", " expr: or(discharge, death)\n", " ...\n", diff --git a/docs/source/notebooks/tutorial.ipynb b/docs/source/notebooks/tutorial.ipynb index 34fefb9..381a451 100644 --- a/docs/source/notebooks/tutorial.ipynb +++ b/docs/source/notebooks/tutorial.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Code Example with ESGPT Synthetic Data" + "# Code Example with Synthetic Data" ] }, { @@ -137,7 +137,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## ESGPT Data" + "## Data" ] }, { @@ -209,7 +209,7 @@ "\n", "Each row of the resulting dataframe is a valid realization of our task tree. Hence, each instance can be included in our cohort used for the prediction of in-hospital mortality as defined in our task configuration file. The output contains:\n", "\n", - "- `subject_id`: subject IDs of our cohort (note: since we'd like to treat individual admissions as separate samples, there will be duplicate subject IDs)\n", + "- `subject_id`: subject IDs of our cohort (since we'd like to treat individual admissions as separate samples, there will be duplicate subject IDs)\n", "- `index_timestamp`: timestamp of when a prediction is made, which coincides with the `end` timestamp of the `input` window (as specified in our task configuration)\n", "- `label`: binary label of mortality, which is derived from the `death` predicate of the `target` window (as specified in our task configuration)\n", "- `trigger`: timestamp of the `trigger` event, which is the `admission` predicate (as specified in our task configuration)\n", diff --git a/docs/source/profiling.md b/docs/source/profiling.md index 0447161..77ec4ed 100644 --- a/docs/source/profiling.md +++ b/docs/source/profiling.md @@ -1 +1,16 @@ -# TODO - include the table from supplementary +# Computational Profile + +To establish an overview of the computational profile of ACES, a collection of various common tasks was queried on the MIMIC-IV dataset in MEDS format. + +The MIMIC-IV MEDS schema has approximately 50,000 patients per shard with an average of approximately 80,500,000 total event rows per shard over five shards. + +All tests were executed on a Linux server with 36 cores and 340 GBs of RAM available. A single MEDS shard was used, which provides a bounded computational overview of ACES. For instance, if one shard costs $M$ memory and $T$ time, then $N$ shards may be executed in parallel with $N*M$ memory and $T$ time, or in series with $M$ memory and $T*N$ time. + +| Task | # Patients | # Samples | Total Time (secs) | Max Memory (MiBs) | +| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | --------- | ----------------- | ----------------- | +| [First 24h in-hospital mortality](https://github.com/mmcdermott/PIE_MD/blob/e94189864080f957fcf2b7416c1dde401dfe4c15/tasks/MIMIC-IV/mortality/in_hospital/first_24h.yaml) | 20,971 | 58,823 | 363.09 | 106,367.14 | +| [First 48h in-hospital mortality](https://github.com/mmcdermott/PIE_MD/blob/e94189864080f957fcf2b7416c1dde401dfe4c15/tasks/MIMIC-IV/mortality/in_hospital/first_48h.yaml) | 18,847 | 60,471 | 364.62 | 108,913.95 | +| [First 24h in-ICU mortality](https://github.com/mmcdermott/PIE_MD/blob/e94189864080f957fcf2b7416c1dde401dfe4c15/tasks/MIMIC-IV/mortality/in_icu/first_24h.yaml) | 4,768 | 7,156 | 216.81 | 39,594.37 | +| [First 48h in-ICU mortality](https://github.com/mmcdermott/PIE_MD/blob/e94189864080f957fcf2b7416c1dde401dfe4c15/tasks/MIMIC-IV/mortality/in_icu/first_48h.yaml) | 4,093 | 7,112 | 217.98 | 39,451.86 | +| [30d post-hospital-discharge mortality](https://github.com/mmcdermott/PIE_MD/blob/e94189864080f957fcf2b7416c1dde401dfe4c15/tasks/MIMIC-IV/mortality/post_hospital_discharge/30d.yaml) | 28,416 | 68,547 | 182.91 | 30,434.86 | +| [30d re-admission](https://github.com/mmcdermott/PIE_MD/blob/e94189864080f957fcf2b7416c1dde401dfe4c15/tasks/MIMIC-IV/readmission/30d.yaml) | 18,908 | 464,821 | 367.41 | 106,064.04 | diff --git a/docs/source/technical.md b/docs/source/technical.md new file mode 100644 index 0000000..28b3dcd --- /dev/null +++ b/docs/source/technical.md @@ -0,0 +1,7 @@ +# ACES Technical Details + +```{include} configuration.md +``` + +```{include} terminology.md +``` diff --git a/docs/source/terminology.md b/docs/source/terminology.md index f99ebf6..a74f1e1 100644 --- a/docs/source/terminology.md +++ b/docs/source/terminology.md @@ -1,6 +1,4 @@ -# Algorithm & Design - -## Introduction +## Terminology & Design We will assume that we are given a dataframe `df` which details events that have happened to subjects. Each row in the dataframe will have a `subject_id` column which identifies the subject, and a `timestamp` column @@ -63,21 +61,25 @@ follows: ```yaml trigger: admission -gap: - start: trigger - end: trigger + 48h - excludes: - - discharge - - death - - covid_dx -target: - start: gap.end - end: discharge | death - label: death - excludes: - - covid_dx -input: - end: trigger + 24h + +windows: + input: + start: + end: trigger + 24h + gap: + start: trigger + end: start + 48h + has: + admission: (None, 0) + discharge: (None, 0) + death: (None, 0) + covid_dx: (None, 0) + target: + start: gap.end + end: start -> discharge_or_death + has: + covid_dx: (None, 0) + label: death ``` Given that our machine learning model seeks to predict in-hospital mortality, our dataset should include both @@ -85,14 +87,13 @@ positive and negative samples (patients that died in the hospital and patients t `target` "window" concludes at either a `"death"` event (patients that died) or a`"discharge"` event (patients that didn't die). -Note that this is conceptual pseudocode and not the actual configuration language, which may look slightly -different. Nevertheless, we can see that this set of specifications can be realized in a "valid" form for a +We can see that this set of specifications can be realized in a "valid" form for a patient if there exist a set of time points such that, within 48 hours after an admission, there are no discharges, deaths, or COVID diagnoses, and that there exists a discharge or death event after the first 48 hours of an admission where there were no COVID diagnoses between the end of that first 48 hours and the subsequent discharge or death event. -Note that these windows form a naturally hierarchical, tree-based structure based on their relative +These windows form a naturally hierarchical, tree-based structure based on their relative dependencies on one another. In particular, we can realize the following tree structure constructed by nodes inferred for the above configuration: @@ -120,7 +121,7 @@ about the true configuration language that is used in practice to specify "windo ______________________________________________________________________ -## Algorithm Terminology +## Terminology #### Event @@ -177,9 +178,9 @@ perform temporal and event-based aggregations to determine whether windows satis ______________________________________________________________________ -## Algorithm Design +## Design -### Initialization +### I. Initialization #### Inputs @@ -212,7 +213,7 @@ With this dataframe, we can proceed to traverse the tree and recurse over each s ______________________________________________________________________ -### Recursive Step +### II. Recursive Step #### Inputs @@ -304,7 +305,7 @@ subtree. ______________________________________________________________________ -### Clean-Up +### III. Clean-Up #### Inputs diff --git a/docs/source/usage.md b/docs/source/usage.md index 7ed4e1f..83102ca 100644 --- a/docs/source/usage.md +++ b/docs/source/usage.md @@ -18,7 +18,7 @@ pip install es-aces **Example: `inhospital_mortality.yaml`** -Please see the [Task Configuration File Overview](https://eventstreamaces.readthedocs.io/en/latest/overview.html#task-configuration-file) for details on how to create this configuration for your own task! More examples are available [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/examples.html). +Please see the [Task Configuration File Overview](https://eventstreamaces.readthedocs.io/en/latest/overview.html#task-configuration-file) for details on how to create this configuration for your own task! More examples are available [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/examples.html) and in the [GitHub repository](https://github.com/justin13601/ACES/tree/main/sample_configs). This particular task configuration defines a cohort for the binary prediction of in-hospital mortality 48 hours after admission. Patients with 5 or more records between the start of their record and 24 hours after the admission will be included. The cohort includes both those that have been discharged (label=`0`) and those that have died (label=`1`). @@ -61,6 +61,8 @@ windows: label: death ``` +**Note**: Each configuration file contains [`predicates`](https://eventstreamaces.readthedocs.io/en/latest/readme.html#predicates), a [`trigger`](https://eventstreamaces.readthedocs.io/en/latest/readme.html#trigger-event), and [`windows`](https://eventstreamaces.readthedocs.io/en/latest/readme.html#windows). Additionally, the `label` field is used to extract the predicate count from the window it was defined in, which acts as the task label. This has been set to the `death` predicate from the `target` window in this example. The `index_timestamp` is used to specify the timestamp at which a prediction is made and can be set to `start` or `end` of a particular window. In most tasks, including this one, it can be set to `end` in the window containing input data (`input` in this example). + ### Run the CLI You can now run `aces-cli` in your terminal. Suppose we have a directory structure like the following: diff --git a/src/aces/configs/__init__.py b/src/aces/configs/__init__.py index e69de29..6f2de57 100644 --- a/src/aces/configs/__init__.py +++ b/src/aces/configs/__init__.py @@ -0,0 +1,58 @@ +"""This subpackage contains the Hydra configuration groups for ACES, which can be used for `aces-cli`. + +Configuration Group File Structure: + +.. code-block:: text + + config/ + ├─ data/ + │ ├─ single_file.yaml + │ ├─ defaults.yaml + │ ├─ sharded.yaml + ├─ aces.yaml + + +`aces-cli` help message: + +.. code-block:: text + + ================== aces-cli =================== + Welcome to the command-line interface for ACES! + + This end-to-end tool extracts a cohort from the external dataset based on a defined task configuration + file and saves the output file(s). Several data standards are supported, including `meds` (requires a + dataset in the MEDS format, either with a single shard or multiple shards), `esgpt` (requires a dataset + in the ESGPT format), and `direct` (requires a pre-computed predicates dataframe as well as a timestamp + format). Hydra multi-run (`-m`) and sweep capabilities are supported, and launchers can be configured. + + ------------- Configuration Groups ------------ + $APP_CONFIG_GROUPS + `data` is defaulted to `data=single_file`. Use `data=sharded` to enable extraction with multiple shards + on MEDS. + + ------------------ Arguments ------------------ + data.*: + - path (required): path to the data directory if using MEDS with multiple shards or ESGPT, or path to + the data `.parquet` if using MEDS with a single shard, or path to the predicates dataframe + (`.csv` or `.parquet`) if using `direct` + - standard (required): data standard, one of 'meds', 'esgpt', or 'direct' + - ts_format (required if data.standard is 'direct'): timestamp format for the data + cohort_dir (required): cohort directory, used to automatically load configs, saving results, and logging + cohort_name (required): cohort name, used to automatically load configs, saving results, and logging + config_path (optional): path to the task configuration file, defaults to '/.yaml' + output_filepath (optional): path to the output file, defaults to '/.parquet' + + ---------------- Default Config ---------------- + $CONFIG + ------------------------------------------------ + All fields may be overridden via the command-line interface. For example: + + aces-cli cohort_name="..." cohort_dir="..." data.standard="..." data="..." data.root="..." + "data.shard=$$(expand_shards .../...)" ... + + For more information, visit: https://eventstreamaces.readthedocs.io/en/latest/usage.html + + Powered by Hydra (https://hydra.cc) + Use --hydra-help to view Hydra specific help + =============================================== +"""