From 2645664be49232849fc26e0d5dadca2bd78f4e5e Mon Sep 17 00:00:00 2001 From: Justin Xu Date: Thu, 13 Jun 2024 09:10:23 +0100 Subject: [PATCH 1/5] Attempt reference to module api? --- docs/source/usage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/usage.md b/docs/source/usage.md index 83102ca..d2425a5 100644 --- a/docs/source/usage.md +++ b/docs/source/usage.md @@ -237,7 +237,7 @@ You can also use the `aces.query.query()` function to extract a cohort in Python .. autofunction:: aces.query.query ``` -The `cfg` parameter must be of type `config.TaskExtractorConfig`, and the `predicates_df` parameter must be of type `polars.DataFrame`. +The `cfg` parameter must be of type :py:mod:`aces.config.TaskExtractorConfig`, and the `predicates_df` parameter must be of type `polars.DataFrame`. Details about the configuration language used to define the `cfg` parameter can be found in {doc}`/configuration`. From 9a6433a2b13f4a83272796cc774f04bfac9585a6 Mon Sep 17 00:00:00 2001 From: Justin Xu Date: Thu, 13 Jun 2024 09:20:18 +0100 Subject: [PATCH 2/5] Using mystparser --- docs/source/usage.md | 27 +++++++++------------------ 1 file changed, 9 insertions(+), 18 deletions(-) diff --git a/docs/source/usage.md b/docs/source/usage.md index d2425a5..9f7e042 100644 --- a/docs/source/usage.md +++ b/docs/source/usage.md @@ -67,22 +67,13 @@ windows: You can now run `aces-cli` in your terminal. Suppose we have a directory structure like the following: -``` -ACES/ -├── sample_data/ -│ ├── esgpt_sample/ -│ │ ├── ... -│ │ ├── events_df.parquet -│ │ └── dynamic_measurements_df.parquet -│ ├── meds_sample/ -│ │ ├── shards/ -│ │ │ ├── 0.parquet -│ │ │ └── 1.parquet -│ │ └── sample_shard.parquet -│ └── sample_data.csv -├── sample_configs/ -│ └── inhospital_mortality.yaml -└── ... +```yaml +ACES/ ├── sample_data/ │ ├── esgpt_sample/ │ │ ├── ... │ │ ├── events_df.parquet +│ │ └── dynamic_measurements_df.parquet │ ├── meds_sample/ │ │ ├── shards/ +│ │ │ ├── 0.parquet │ │ │ └── 1.parquet │ │ └── sample_shard.parquet +│ └── sample_data.csv ├── sample_configs/ │ └── inhospital_mortality.yaml └── +... +... ``` **To query from a single MEDS shard**: @@ -175,7 +166,7 @@ To query from a direct predicates dataframe: #### Task Configuration -`cohort_dir`: Directory the your task configuration file +`cohort_dir`: Directory of your task configuration file `cohort_name`: Name of the task configuration file @@ -237,7 +228,7 @@ You can also use the `aces.query.query()` function to extract a cohort in Python .. autofunction:: aces.query.query ``` -The `cfg` parameter must be of type :py:mod:`aces.config.TaskExtractorConfig`, and the `predicates_df` parameter must be of type `polars.DataFrame`. +The `cfg` parameter must be of type {py:class}`aces.config.TaskExtractorConfig`, and the `predicates_df` parameter must be of type `polars.DataFrame`. Details about the configuration language used to define the `cfg` parameter can be found in {doc}`/configuration`. From da12084065644742e69ec2fc5e892e1f9306f330 Mon Sep 17 00:00:00 2001 From: Justin Xu Date: Thu, 13 Jun 2024 09:36:20 +0100 Subject: [PATCH 3/5] Add reference to module api --- docs/source/{terminology.md => algorithm.md} | 2 +- docs/source/configuration.md | 16 ++++++++-------- docs/source/technical.md | 2 +- 3 files changed, 10 insertions(+), 10 deletions(-) rename docs/source/{terminology.md => algorithm.md} (99%) diff --git a/docs/source/terminology.md b/docs/source/algorithm.md similarity index 99% rename from docs/source/terminology.md rename to docs/source/algorithm.md index 45ce804..1fb90ab 100644 --- a/docs/source/terminology.md +++ b/docs/source/algorithm.md @@ -188,7 +188,7 @@ During initialization, we will be given the following inputs: ##### `cfg` -`cfg` is a `TaskExtractorConfig` object containing our task definition, include all information about +`cfg` is a {py:class}`aces.config.TaskExtractorConfig` object containing our task definition, include all information about predicates, the trigger event, and windows. ##### `predicates_df` diff --git a/docs/source/configuration.md b/docs/source/configuration.md index aae4dc6..6d51a26 100644 --- a/docs/source/configuration.md +++ b/docs/source/configuration.md @@ -6,7 +6,7 @@ format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/late system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more -details on the recursive algorithm, see the `terminology.md` file. +details on the recursive algorithm, see [Algorithm Design](https://eventstreamaces.readthedocs.io/en/latest/technical.html#algorithm-design). As indicated above, these cohorts are specified through a combination of concepts (realized as event _predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling @@ -28,10 +28,10 @@ ______________________________________________________________________ In the machine form used by ACES, the configuration file consists of three parts: - `predicates`, stored as a dictionary from string predicate names (which must be unique) to either - `PlainPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or - `DerivedPredicateConfig` objects, which store predicates that build on other predicates. + {py:class}`aces.config.PlainPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or + {py:class}`aces.config.DerivedPredicateConfig` objects, which store predicates that build on other predicates. - `trigger`, stored as a string to `EventConfig` -- `windows`, stored as a dictionary from string window names (which must be unique) to `WindowConfig` +- `windows`, stored as a dictionary from string window names (which must be unique) to {py:class}`aces.config.WindowConfig` objects. Below, we will detail each of these configuration objects. @@ -40,7 +40,7 @@ ______________________________________________________________________ ### Predicates: `PlainPredicateConfig` and `DerivedPredicateConfig` -#### `PlainPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data +#### {py:class}`aces.config.PlainPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data These configs consist of the following four fields: @@ -87,7 +87,7 @@ on its source format. be of the univariate regression type and its value, if needed, will be pulled from the corresponding column. -#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates +#### {py:class}`aces.config.DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates These configuration objects consist of only a single string field--`expr`--which contains a limited grammar of accepted operations that can be applied to other predicates, containing precisely the following: @@ -100,7 +100,7 @@ analytic operations over predicates. ______________________________________________________________________ -### Events: `EventConfig` +### Events: {py:class}`aces.config.EventConfig` The event config consists of only a single field, `predicate`, which specifies the predicate that must be observed with value greater than one to satisfy the event. There can only be one defined "event" with an @@ -110,7 +110,7 @@ The value of its field can be any defined predicate. ______________________________________________________________________ -### Windows: `WindowConfig` +### Windows: {py:class}`aces.config.WindowConfig` Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window, diff --git a/docs/source/technical.md b/docs/source/technical.md index 28b3dcd..9afb124 100644 --- a/docs/source/technical.md +++ b/docs/source/technical.md @@ -3,5 +3,5 @@ ```{include} configuration.md ``` -```{include} terminology.md +```{include} algorithm.md ``` From 457ffcf62310bf78321dc1109a37fd334fbb2781 Mon Sep 17 00:00:00 2001 From: Justin Xu Date: Thu, 13 Jun 2024 09:56:22 +0100 Subject: [PATCH 4/5] More reference and fixes --- docs/source/notebooks/examples.ipynb | 14 ++++++++----- docs/source/notebooks/predicates.ipynb | 2 +- docs/source/notebooks/tutorial.ipynb | 4 ++-- docs/source/usage.md | 27 ++++++++++++++++++-------- 4 files changed, 31 insertions(+), 16 deletions(-) diff --git a/docs/source/notebooks/examples.ipynb b/docs/source/notebooks/examples.ipynb index 1ccbdd1..1d52248 100644 --- a/docs/source/notebooks/examples.ipynb +++ b/docs/source/notebooks/examples.ipynb @@ -6,10 +6,10 @@ "source": [ "# Task Examples\n", "\n", - "Provided below are two examples of mortality prediction tasks that ACES could easily extract subject cohorts for. The configurations have been tested all the provided synthetic data in the repository (`../../../sample_data/`), as well as the MIMIC-IV dataset loaded using MEDS & ESGPT (with very minor changes to the below predicate definition). The configuration files for both of these tasks are provided in the repository (`../../../sample_configs`), and cohorts can be extracted using the `aces-cli` tool:\n", + "Provided below are two examples of mortality prediction tasks that ACES could easily extract subject cohorts for. The configurations have been tested all the provided synthetic data in the repository ([`sample_data/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data)), as well as the MIMIC-IV dataset loaded using MEDS & ESGPT (with very minor changes to the below predicate definition). The configuration files for both of these tasks are provided in the repository ([`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs)), and cohorts can be extracted using the `aces-cli` tool:\n", "\n", "```bash\n", - "aces-cli data.path='/path/to/MIMIC/ESGPT/schema/' data.standard='esgpt' cohort_dir='../../../sample_configs' cohort_name='...'\n", + "aces-cli data.path='/path/to/MIMIC/ESGPT/schema/' data.standard='esgpt' cohort_dir='sample_configs/' cohort_name='...'\n", "```" ] }, @@ -138,11 +138,15 @@ "\n", "The windows section contains the remaining three windows we defined previously - `input`, `gap`, and `target`.\n", "\n", - "`input` begins at the start of a patient's record (ie., `NULL`), and ends 24 hours past `trigger` (ie., `admission`). As we'd like to include the events specified at both the start and end of `input`, if present, we can set both `start_inclusive` and `end_inclusive` as `True`. Our constraint on the number of records is specified in `has` using the `_ANY_EVENT` predicate, with its value set to be greater or equal to 5 (ie., unbounded parameter on the right as seen in `(5, None)`). **Note**: Since we'd like to make a prediction at the end of `input`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `trigger + 24h`.\n", + "`input` begins at the start of a patient's record (ie., `NULL`), and ends 24 hours past `trigger` (ie., `admission`). As we'd like to include the events specified at both the start and end of `input`, if present, we can set both `start_inclusive` and `end_inclusive` as `True`. Our constraint on the number of records is specified in `has` using the `_ANY_EVENT` predicate, with its value set to be greater or equal to 5 (ie., unbounded parameter on the right as seen in `(5, None)`). \n", + "\n", + "**Note**: Since we'd like to make a prediction at the end of `input`, we can set `index_timestamp` to be `end`, which corresponds to the timestamp of `trigger + 24h`.\n", "\n", "`gap` also begins at `trigger`, and ends 48 hours after. As we have included included the left boundary event in `trigger` (ie., `admission`), it would be reasonable to not include it again as it should not play a role in `gap`. As such, we set `start_inclusive` to `False`. As we'd like our admission to be at least 48 hours long, we can place constraints specifying that there cannot be any `admission`, `discharge`, or `death` in `gap` (ie., right-bounded parameter at `0` as seen in `(None, 0)`).\n", "\n", - "`target` beings at the end of `gap`, and ends at the next discharge or death event (ie., `discharge_or_death` predicate). We can use this arrow notation which ACES recognizes as event references (ie., `->` and `<-`; see [Time Range Fields](https://eventstreamaces--39.org.readthedocs.build/en/39/configuration.html#time-range-fields)). In our case, we end `target` at the next `discharge_or_death`. Similarly, as we included the event at the end of `gap`, if any, already in `gap`, we can set `start_inclusive` to `False`. **Note**: Since we'd like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`." + "`target` beings at the end of `gap`, and ends at the next discharge or death event (ie., `discharge_or_death` predicate). We can use this arrow notation which ACES recognizes as event references (ie., `->` and `<-`; see [Time Range Fields](https://eventstreamaces.readthedocs.io/en/latest/technical.html#time-range-fields)). In our case, we end `target` at the next `discharge_or_death`. Similarly, as we included the event at the end of `gap`, if any, already in `gap`, we can set `start_inclusive` to `False`. \n", + "\n", + "**Note**: Since we'd like to make a binary mortality prediction, we can extract the `death` predicate as a label from `target`, by specifying the `label` field to be `death`." ] }, { @@ -269,7 +273,7 @@ "source": [ "## Other Examples\n", "\n", - "A few other examples are provided in `../../../sample_configs/` of the repository. We will continue to add task configurations to this folder or to a benchmarking effort for EHR representation learning. More information can be found [here](https://github.com/mmcdermott/PIE_MD/tree/main) - stay tuned!" + "A few other examples are provided in [`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs) of the repository. We will continue to add task configurations to this folder or to a benchmarking effort for EHR representation learning. More information can be found [here](https://github.com/mmcdermott/PIE_MD/tree/main) - stay tuned!" ] } ], diff --git a/docs/source/notebooks/predicates.ipynb b/docs/source/notebooks/predicates.ipynb index 08f062c..c6be5d1 100644 --- a/docs/source/notebooks/predicates.ipynb +++ b/docs/source/notebooks/predicates.ipynb @@ -66,7 +66,7 @@ "source": [ "## Sample Predicates DataFrame\n", "\n", - "A sample predicates dataframe is provided in the repository (`../../../sample_data/sample_data.csv`). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository (`../../../sample_configs`) could be directly extracted." + "A sample predicates dataframe is provided in the repository ([`sample_data/sample_data.csv`](https://github.com/justin13601/ACES/blob/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data/sample_data.csv)). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository ([`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs)) could be directly extracted." ] }, { diff --git a/docs/source/notebooks/tutorial.ipynb b/docs/source/notebooks/tutorial.ipynb index 381a451..622ccff 100644 --- a/docs/source/notebooks/tutorial.ipynb +++ b/docs/source/notebooks/tutorial.ipynb @@ -47,7 +47,7 @@ "source": [ "### Directories\n", "\n", - "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in `sample_configs` and `sample_data` folders in the project root, respectively." + "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the ESGPT synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [`sample_configs/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_configs) and [`sample_data/`](https://github.com/justin13601/ACES/tree/5cf0261ad22c22972b0bd553ab5bb826cb9e637d/sample_data) folders in the project root, respectively." ] }, { @@ -102,7 +102,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We now load our configuration file by passing its path (`str`) into `config.TaskExtractorConfig.load()`. This parses the configuration file for each of the three key sections indicated above and prepares ACES for extraction based on our defined constraints (inclusion/exclusion criteria for each window)." + "We now load our configuration file by passing its path (`str`) into {py:func}`aces.config.TaskExtractorConfig.load()`. This parses the configuration file for each of the three key sections indicated above and prepares ACES for extraction based on our defined constraints (inclusion/exclusion criteria for each window)." ] }, { diff --git a/docs/source/usage.md b/docs/source/usage.md index 9f7e042..9f514a1 100644 --- a/docs/source/usage.md +++ b/docs/source/usage.md @@ -67,13 +67,22 @@ windows: You can now run `aces-cli` in your terminal. Suppose we have a directory structure like the following: -```yaml -ACES/ ├── sample_data/ │ ├── esgpt_sample/ │ │ ├── ... │ │ ├── events_df.parquet -│ │ └── dynamic_measurements_df.parquet │ ├── meds_sample/ │ │ ├── shards/ -│ │ │ ├── 0.parquet │ │ │ └── 1.parquet │ │ └── sample_shard.parquet -│ └── sample_data.csv ├── sample_configs/ │ └── inhospital_mortality.yaml └── -... -... +``` +ACES/ +├── sample_data/ +│ ├── esgpt_sample/ +│ │ ├── ... +│ │ ├── events_df.parquet +│ │ └── dynamic_measurements_df.parquet +│ ├── meds_sample/ +│ │ ├── shards/ +│ │ │ ├── 0.parquet +│ │ │ └── 1.parquet +│ │ └── sample_shard.parquet +│ └── sample_data.csv +├── sample_configs/ +│ └── inhospital_mortality.yaml +└── ... ``` **To query from a single MEDS shard**: @@ -174,7 +183,9 @@ The above two fields are used for automatically loading task configurations, sav `config_path`: Path to the task configuration file. Defaults to `${cohort_dir}/${cohort_name}.yaml` -`output_filepath`: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_dir}/${cohort_name}.parquet` otherwise. +`output_filepath`: Path to store the outputs. Defaults to `${cohort_dir}/${cohort_name}/${data.shard}.parquet` for MEDS with multiple shards, and `${cohort_dir}/${cohort_name}.parquet` otherwise + +`log_dir`: Path to store logs. Defaults to `${cohort_dir}/${cohort_name}/.logs` #### Tab Completion From 62e6878ccb4beea46841ae24513f2947c0751fc7 Mon Sep 17 00:00:00 2001 From: Justin Xu Date: Thu, 13 Jun 2024 10:03:28 +0100 Subject: [PATCH 5/5] Undo reference in notebooks, fix links --- docs/source/algorithm.md | 4 ++-- docs/source/notebooks/tutorial.ipynb | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/algorithm.md b/docs/source/algorithm.md index 1fb90ab..f916102 100644 --- a/docs/source/algorithm.md +++ b/docs/source/algorithm.md @@ -114,8 +114,8 @@ In the rest of this document, we will detail how our algorithm automatically ext these criteria and the terminology we use to describe our algorithm (both here and in the raw source code and code comments). There are certain limitations of this algorithm where some kinds of tasks cannot yet be expressed directly (more information available in the -[FAQs](https://eventstreamaces.readthedocs.io/en/latest/overview.html#faqs) and the -[Future Roadmap](https://eventstreamaces.readthedocs.io/en/latest/overview.html#future-roadmap)). Details +[FAQs](https://eventstreamaces.readthedocs.io/en/latest/readme.html#faqs) and the +[Future Roadmap](https://eventstreamaces.readthedocs.io/en/latest/readme.html#future-roadmap)). Details about the true configuration language that is used in practice to specify "windows" can be found in {doc}`/configuration`. Some task examples are available in {doc}`/notebooks/examples`. diff --git a/docs/source/notebooks/tutorial.ipynb b/docs/source/notebooks/tutorial.ipynb index 622ccff..97f2eb5 100644 --- a/docs/source/notebooks/tutorial.ipynb +++ b/docs/source/notebooks/tutorial.ipynb @@ -102,7 +102,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We now load our configuration file by passing its path (`str`) into {py:func}`aces.config.TaskExtractorConfig.load()`. This parses the configuration file for each of the three key sections indicated above and prepares ACES for extraction based on our defined constraints (inclusion/exclusion criteria for each window)." + "We now load our configuration file by passing its path (`str`) into `config.TaskExtractorConfig.load()`. This parses the configuration file for each of the three key sections indicated above and prepares ACES for extraction based on our defined constraints (inclusion/exclusion criteria for each window)." ] }, {