Skip to content

Commit

Permalink
Doc changes for supplementary (#57)
Browse files Browse the repository at this point in the history
* unicode?

* DUC looks better I think

* Remove lines

* Docs

* Reorg titles

* Add new title

* Remove eICU tasks and change title in technical

* Docstring with subpackage?

* profiling numbers except memory

* Update init config docstring

* Changes

* Reformat init

* Add memory

* Small fixes
  • Loading branch information
justin13601 authored Jun 13, 2024
1 parent 664122c commit 1bc9f2e
Show file tree
Hide file tree
Showing 13 changed files with 182 additions and 85 deletions.
16 changes: 14 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ predicates_df = predicates.get_predicates_df(cfg=cfg, data_config=data_config)
df_result = query.query(cfg=cfg, predicates_df=predicates_df)
```

**Results**: The output will be a dataframe of subjects who satisfy the conditions defined in your task configuration file. Timestamps for the start/end boundaries of each window specified in the task configuration, as well as predicate counts for each window, are also provided. Below are sample logs for the successful extraction of an in-hospital mortality cohort using the ESGPT standard:
4. **Results**: The output will be a dataframe of subjects who satisfy the conditions defined in your task configuration file. Timestamps for the start/end boundaries of each window specified in the task configuration, as well as predicate counts for each window, are also provided. Below are sample logs for the successful extraction of an in-hospital mortality cohort using the ESGPT standard:

```log
aces-cli cohort_name="inhospital_mortality" cohort_dir="sample_configs" data.standard="esgpt" data.path="MIMIC_ESD_new_schema_08-31-23-1/"
Expand Down Expand Up @@ -300,9 +300,21 @@ The `has` field specifies constraints relating to predicates within the window.

Support for static data depends on your data standard and those variables are expressed. For instance, in MEDS, it is feasible to express static data as a predicate, and thus criteria can be set normally. However, this is not yet incorporated for ESGPT. If a predicates dataframe is directly used, you may create a predicate column that specifies your static variable.

### Complementary Tools

ACES is an integral part of the MEDS ecosystem. To fully leverage its capabilities, you can utilize it alongside other complementary MEDS tools, such as:

- [MEDS-ETL](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used to transform various data schemas, including some command data models, into the MEDS format.
- [MEDS-TAB](https://github.com/Medical-Event-Data-Standard/meds_etl), which can be used generate automated tabular baseline methods (ie., XGBoost over ACES-defined tasks).
- [MEDS-Polars](https://github.com/Medical-Event-Data-Standard/meds_etl), which contains polars-based ETL scripts.

### Alternative Tools

TODO
There are existing alternatives for cohort extraction that focus on specific common data models, such as [i2b2 PIC-SURE](https://pic-sure.org/) and [OHDSI ATLAS](https://atlas.ohdsi.org/).

ACES serves as a middle ground between PIC-SURE and ATLAS. While it may offer less capability than PIC-SURE, it compensates with greater ease of use and improved communication value. Compared to ATLAS, ACES provides greater capability, though with slightly lower ease of use, yet it still maintains a higher communication value.

Finally, ACES is not tied to a particular common data model. Built on a flexible event-stream format, ACES is a no-code solution with a descriptive input format, permitting easy and wide iteration over task definitions, and can be applied to a variety of schemas, making it a versatile tool suitable for diverse research needs.

## Future Roadmap

Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,7 @@ def ensure_pandoc_installed(_):


# -- Options for LaTeX output

# latex_engine = "xelatex"
latex_elements = { # type: ignore
# The paper size ("letterpaper" or "a4paper").
"papersize": "letterpaper",
Expand Down
42 changes: 24 additions & 18 deletions docs/source/configuration.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# Configuration Language Specification

## Introduction and Terminology
## Configuration Language Specification

This document specifies the configuration language for the automatic extraction of task dataframes and cohorts
from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)
Expand All @@ -27,9 +25,7 @@ contain events that satisfy certain aggregation functions over predicates for th

______________________________________________________________________

## Machine Form (ACES)

In the machine form, the configuration file consists of three parts:
In the machine form used by ACES, the configuration file consists of three parts:

- `predicates`, stored as a dictionary from string predicate names (which must be unique) to either
`PlainPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or
Expand All @@ -38,7 +34,9 @@ In the machine form, the configuration file consists of three parts:
- `windows`, stored as a dictionary from string window names (which must be unique) to `WindowConfig`
objects.

Next, we will detail each of these configuration objects.
Below, we will detail each of these configuration objects.

______________________________________________________________________

### Predicates: `PlainPredicateConfig` and `DerivedPredicateConfig`

Expand Down Expand Up @@ -68,11 +66,13 @@ on its source format.

1. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format
(recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`
and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax
does not currently support defining predicates that also rely on matching other, optional fields in the
MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull
request or upvoting any existing issue/PR that requests/implements this feature, and we will add support
for this capability.
and `value_max` constraints will be compared against MEDS' `numerical_value` field.

**Note**: This syntax does not currently support defining predicates that also rely on matching other,
optional fields in the MEDS syntax; if this is a desired feature for you, please let us know by filing a
GitHub issue or pull request or upvoting any existing issue/PR that requests/implements this feature,
and we will add support for this capability.

2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the
`code` will be interpreted in the following manner:
a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the
Expand All @@ -95,7 +95,7 @@ accepted operations that can be applied to other predicates, containing precisel
- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.

Note that, currently, `and`'s and `or`'s cannot be nested. Upon user request, we may support further advanced
**Note**: Currently, `and`'s and `or`'s cannot be nested. Upon user request, we may support further advanced
analytic operations over predicates.

______________________________________________________________________
Expand Down Expand Up @@ -138,17 +138,22 @@ following rules:
In this case, the referencing event (either the start or end of the window) will be defined as occurring
exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
end or start of the window).
Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if

**Note**: If `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
`$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
the window fields.

2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
In this case, the referencing event will be defined as the next or previous event satisfying the
predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate
predicate, `$PREDICATE`.

**Note**: If the `$REFERENCED` is the `start` field, then the "next predicate
ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
"previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the
the window fields. These forms can lead to windows being defined as single point events, if the
`$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
inclusive values are set.

3. `$REFERENCING = $REFERENCED`
In this case, the referencing event will be defined as the same event as the referenced event.

Expand All @@ -175,8 +180,9 @@ the `start` event itself.
The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`
that define the valid range the count of observations of the named predicate that must be found in a window
for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note
that as predicate counts are always integral, this specification does not need an additional
those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained.

**Note**: As predicate counts are always integral, this specification does not need an additional
inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
Expand Down
21 changes: 10 additions & 11 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,13 @@ ACES is a library designed for the automatic extraction of cohorts from event-st
glob:
maxdepth: 2
---
GitHub README <readme>
README <readme>
Usage Guide <usage>
Task Examples <notebooks/examples>
Sample Data Tutorial <notebooks/tutorial>
Predicates DataFrame <notebooks/predicates>
Configuration Language <configuration>
Algorithm & Terminology <terminology>
Profiling <profiling>
Sample Data Tutorial <notebooks/tutorial>
Technical Details <technical>
Computational Profile <profiling>
Module API Reference <api/modules>
License <license>
```
Expand All @@ -29,29 +28,29 @@ ______________________________________________________________________

If you have a dataset and want to leverage it for machine learning tasks, the ACES ecosystem offers a streamlined and user-friendly approach. Here's how you can easily transform, prepare, and utilize your dataset with MEDS and ACES for efficient and effective machine learning:

### 1. Transform to MEDS
### I. Transform to MEDS

- Simplicity: Converting your dataset to the Medical Event Data Standard (MEDS) is straightforward and user-friendly compared to other Common Data Models (CDMs).
- Minimal Bias: This conversion process ensures that your data remains as close to its raw form as possible, minimizing the introduction of biases.
- [MEDS-ETL](https://github.com/Medical-Event-Data-Standard/meds_etl): Follow this link for detailed instructions and ETLs to transform your dataset into the MEDS format!

### 2. Identify Predicates
### II. Identify Predicates

- Task-Specific Concepts: Identify the predicates (data concepts) required for your specific machine learning tasks.
- Pre-Defined Criteria: Utilize our pre-defined criteria across various tasks and clinical areas to expedite this process.
- [PIE-MD](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/criteria): Access our repository of tasks to find relevant predicates!

### 3. Set Dataset-Agnostic Criteria
### III. Set Dataset-Agnostic Criteria

- Standardization: Combine the identified predicates with standardized, dataset-agnostic criteria files.
- Examples: Refer to the [MIMIC-IV](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/MIMIC-IV) and [eICU](https://github.com/mmcdermott/PIE_MD/tree/main/tasks/eICU) examples for guidance on how to structure your criteria files for your private datasets!

### 4. Run ACES
### IV. Run ACES

- Run the ACES Command-Line Interface tool (`aces-cli`) to extract cohorts based on your task - check out the [Usage Guide](https://eventstreamaces.readthedocs.io/en/latest/usage.html)!

### 5. Run MEDS-Tab
### V. Run MEDS-Tab

- Painless Reproducibility: Use [MEDS-Tab](https://github.com/mmcdermott/MEDS_TAB_MIMIC_IV/tree/main/tasks) to obtain comparable, reproducible, and well-tuned XGBoost results tailored to your dataset-specific feature space!

By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform steps 1-5 on new datasets in reasonable raw formulations!
By following these steps, you can seamlessly transform your dataset, define necessary criteria, and leverage powerful machine learning tools within the ACES ecosystem. This approach not only simplifies the process but also ensures high-quality, reproducible results for your machine learning for health projects. It can reliably take no more than a week of full-time human effort to perform Steps I-V on new datasets in reasonable raw formulations!
2 changes: 0 additions & 2 deletions docs/source/license.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,3 @@
language: text
---
```

______________________________________________________________________
8 changes: 4 additions & 4 deletions docs/source/notebooks/examples.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -138,11 +138,11 @@
"\n",
"The windows section contains the remaining three windows we defined previously - `input`, `gap`, and `target`.\n",
"\n",
"`input` begins at the start of a patient's record (ie., `NULL`), and ends 24 hours past `trigger` (ie., `admission`). As we'd like to include the events specified at both the start and end of `input`, if present, we can set both `start_inclusive` and `end_inclusive` as `True`. Our constraint on the number of records is specified in `has` using the `_ANY_EVENT` predicate, with its value set to be greater or equal to 5 (ie., unbounded parameter on the right as seen in `(5, None)`). **Note**: since we'd like to make a prediction at the end of `input`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `trigger + 24h`.\n",
"`input` begins at the start of a patient's record (ie., `NULL`), and ends 24 hours past `trigger` (ie., `admission`). As we'd like to include the events specified at both the start and end of `input`, if present, we can set both `start_inclusive` and `end_inclusive` as `True`. Our constraint on the number of records is specified in `has` using the `_ANY_EVENT` predicate, with its value set to be greater or equal to 5 (ie., unbounded parameter on the right as seen in `(5, None)`). **Note**: Since we'd like to make a prediction at the end of `input`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `trigger + 24h`.\n",
"\n",
"`gap` also begins at `trigger`, and ends 48 hours after. As we have included included the left boundary event in `trigger` (ie., `admission`), it would be reasonable to not include it again as it should not play a role in `gap`. As such, we set `start_inclusive` to `False`. As we'd like our admission to be at least 48 hours long, we can place constraints specifying that there cannot be any `admission`, `discharge`, or `death` in `gap` (ie., right-bounded parameter at `0` as seen in `(None, 0)`).\n",
"\n",
"`target` beings at the end of `gap`, and ends at the next discharge or death event (ie., `discharge_or_death` predicate). We can use this arrow notation which ACES recognizes as event references (ie., `->` and `<-`; see [Time Range Fields](https://eventstreamaces--39.org.readthedocs.build/en/39/configuration.html#time-range-fields)). In our case, we end `target` at the next `discharge_or_death`. Similarly, as we included the event at the end of `gap`, if any, already in `gap`, we can set `start_inclusive` to `False`. **Note**: since we'd like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`."
"`target` beings at the end of `gap`, and ends at the next discharge or death event (ie., `discharge_or_death` predicate). We can use this arrow notation which ACES recognizes as event references (ie., `->` and `<-`; see [Time Range Fields](https://eventstreamaces--39.org.readthedocs.build/en/39/configuration.html#time-range-fields)). In our case, we end `target` at the next `discharge_or_death`. Similarly, as we included the event at the end of `gap`, if any, already in `gap`, we can set `start_inclusive` to `False`. **Note**: Since we'd like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`."
]
},
{
Expand Down Expand Up @@ -237,9 +237,9 @@
"\n",
"### Windows\n",
"\n",
"The windows section contains the two windows we defined - `gap` and `target`. In this case, the `gap` and `target` windows are defined relative to every single event (ie., `_ANY_EVENT`).\n",
"The windows section contains the two windows we defined - `gap` and `target`. In this case, the `gap` and `target` windows are defined relative to every single event (ie., `_ANY_EVENT`). `gap` begins at `trigger`, and ends 2 hours after. `target` beings at the end of `gap`, and ends 24 hours after. \n",
"\n",
"`gap` begins at `trigger`, and ends 2 hours after. `target` beings at the end of `gap`, and ends 24 hours after. **Note**: since we'd again like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`. Additionally, since a prediction would be made at the end of each `gap`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `_ANY_EVENT + 24h`."
"**Note**: Since we'd again like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`. Additionally, since a prediction would be made at the end of each `gap`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `_ANY_EVENT + 24h`."
]
},
{
Expand Down
Loading

0 comments on commit 1bc9f2e

Please sign in to comment.