diff --git a/README.md b/README.md index 4e0ba70..eb335ea 100644 --- a/README.md +++ b/README.md @@ -178,7 +178,7 @@ windows: ... ``` -Sample task configuration files for 6 common tasks are provided in `sample_configs/`. All task configurations can be directly extracted using `'direct'` mode on `sample_data/sample_data.csv` as this predicates dataframe was designed specifically to capture concepts needed for all tasks. However, only `inhospital_mortality.yaml` and `imminent-mortality.yaml` would be able to be extracted on `sample_data/esgpt_sample` and `sample_data/meds_sample` due to a lack of required concepts in the datasets. +Sample task configuration files for 6 common tasks are provided in `sample_configs/`. All task configurations can be directly extracted using `'direct'` mode on `sample_data/sample_data.csv` as this predicates dataframe was designed specifically to capture concepts needed for all tasks. However, only `inhospital_mortality.yaml` and `imminent-mortality.yaml` would be able to be extracted on `sample_data/esgpt_sample` and `sample_data/meds_sample` due to a lack of required concepts in the datasets (predicates are defined as per the MEDS sample data by default; modifications will be needed for ESGPT). ### Predicates diff --git a/docs/source/index.md b/docs/source/index.md index ab38c37..69a381f 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -15,7 +15,7 @@ README Usage Guide Task Examples Predicates DataFrame -Sample Data Tutorial +MEDS Data Tutorial Technical Details Computational Profile Module API Reference diff --git a/docs/source/notebooks/tutorial.ipynb b/docs/source/notebooks/tutorial_esgpt.ipynb similarity index 99% rename from docs/source/notebooks/tutorial.ipynb rename to docs/source/notebooks/tutorial_esgpt.ipynb index 67feef4..b46ca31 100644 --- a/docs/source/notebooks/tutorial.ipynb +++ b/docs/source/notebooks/tutorial_esgpt.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Code Example with Synthetic Data" + "# Code Example with Synthetic ESGPT Data" ] }, { diff --git a/docs/source/notebooks/tutorial_meds.ipynb b/docs/source/notebooks/tutorial_meds.ipynb new file mode 100644 index 0000000..a96d096 --- /dev/null +++ b/docs/source/notebooks/tutorial_meds.ipynb @@ -0,0 +1,252 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Code Tutorial with Synthetic MEDS Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Set-up" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Imports\n", + "\n", + "First, let's import ACES! Three modules - `config`, `predicates`, and `query` - are required to execute an end-to-end cohort extraction. `omegaconf` is also required to express our data config parameters in order to load our `MEDS` dataset. Other imports are only needed for visualization!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from pathlib import Path\n", + "\n", + "import pandas as pd\n", + "import yaml\n", + "from bigtree import print_tree\n", + "from IPython.display import display\n", + "from omegaconf import DictConfig\n", + "\n", + "from aces import config, predicates, query" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Directories\n", + "\n", + "Next, let's specify our paths and directories. In this tutorial, we will extract a cohort for a typical in-hospital mortality prediction task from the MEDS synthetic sample dataset. The task configuration file and sample data are both shipped with the repository in [sample_configs/](https://github.com/justin13601/ACES/tree/main/sample_configs) and [sample_data/](https://github.com/justin13601/ACES/tree/main/sample_data) folders in the project root, respectively." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "config_path = \"../../../sample_configs/inhospital_mortality.yaml\"\n", + "data_path = \"../../../sample_data/meds_sample/\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configuration File" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The task configuration file is the core configuration language that ACES uses to extract cohorts. Details about this configuration language is available in [Configuration Language](https://eventstreamaces.readthedocs.io/en/latest/configuration.html). In brief, the configuration file contains `predicates`, `patient_demographics`, `trigger`, and `windows` sections. \n", + "\n", + "The `predicates` section is used to define dataset-specific concepts that are needed for the task. In our case of binary mortality prediction, we are interested in extracting a cohort of patients that have been admitted into the hospital and who were subsequently discharged or died. As such `admission`, `discharge`, `death`, and `discharge_or_death` would be handy predicates.\n", + "\n", + "The `patient_demographics` section is used to define static concepts that remain constant for subjects over time. For instance, sex is a common static variable. Should we want to filter out cohort to patients with a specific sex, we can do so here in the same way as defining predicates. For more information on predicates, please refer to this [guide](https://eventstreamaces.readthedocs.io/en/latest/technical.html#predicates-plainpredicateconfig-and-derivedpredicateconfig). In this example, let's say we are only interested in male patients.\n", + "\n", + "We'd also like to make a prediction of mortality for each admission. Hence, a reasonable `trigger` event would be an `admission` predicate.\n", + "\n", + "Suppose in our task, we'd like to set a constraint that the admission must have been more than 48 hours long. Additionally, for our prediction inputs, we'd like to use all information in the patient record up until 24 hours after admission, which must contain at least 5 event records (as we'd want to ensure there is sufficient input data). These clauses are captured in the `windows` section where each window is defined relative to another." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with open(config_path, \"r\") as stream:\n", + " data_loaded = yaml.safe_load(stream)\n", + " print(json.dumps(data_loaded, indent=4))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the `input` window begins at `null` (start of the patient record) and ends 24 hours after `trigger` (`admission`). A `gap` window is defined for 24 hours after the end of the `input` window, constraining the admission to be longer than 48 hours at minimum. Finally, a `target` window is specified from the end of the `gap` window to either the next `discharge` or `death` event (ie., `discharge_or_death`). This would allow us to extract a binary label for each patient in our cohort to be used in the prediction task (ie., field `label` in the `target` window, which will extract `0`: discharged, `1`: died). Additionally, an `index_timestamp` field is set as the `end` of the `input` window to denote when a prediction is made (ie., at the end of the `input` window when all input data is fed into the model), and can be used to index extraction results." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now load our configuration file by passing its path (`str`) into `config.TaskExtractorConfig.load()`. This parses the configuration file for each of the three key sections indicated above and prepares ACES for extraction based on our defined constraints (inclusion/exclusion criteria for each window)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cfg = config.TaskExtractorConfig.load(config_path=config_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Task Tree\n", + "\n", + "With the configuration file loaded and parsed, we can access a visualization of a tree structure that is representative of our task of interest. As seen, the tree nodes are `start` and `end` time points of the windows that were defined in the configuration file, and the tree edges express the relationships between these windows. ACES will traverse this tree and recursively compute aggregated predicate counts for each subtree. This would allow us to filter our dataset to valid realizations of this task tree, which would make up our task cohort." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tree = cfg.window_tree\n", + "print_tree(tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This tutorial uses synthetic data of 100 patients stored in the MEDS standard. For more information about this data, please refer to the generation of this synthetic data in the [ESGPT Documentation](https://eventstreamml.readthedocs.io/en/latest/_collections/local_tutorial_notebook.html) (separately converted to MEDS). Here is what the data looks like:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pd.read_parquet(f\"{data_path}/train/0.parquet\").head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Predicate Columns\n", + "\n", + "The next step in our cohort extraction is the generation of predicate columns. Our defined dataset-agnostic windows (ie., complex task logic) are linked to dataset-specific predicates (ie., dataset observations and concepts), which facilitates the sharing of tasks across datasets. As such, the predicates dataframe is the foundational unit on which ACES acts upon.\n", + "\n", + "A predicate column is simply a column containing numerical counts (often just `0`'s and `1`'s), representing the number of times a given predicate (concept) occurs at a given timestamp for a given patient.\n", + "\n", + "In the case of MEDS (and ESGPT), ACES support the automatic generation of these predicate columns from the configuration file. However, some fields need to be provided via a `DictConfig` object. These include the path to the directory of the MEDS dataset (`str`) and the data standard (which is `meds` in this case).\n", + "\n", + "Given this data configuration, we then call `predicates.get_predicates_df()` to generate the relevant predicate columns for our task. Due to the nature of the specified predicates, the resulting dataframe simply contains the unique (`subject_id`, `timestamp`) pairs and binary columns for each predicate. An additional predicate `_ANY_EVENT` is also generated - this will be used to enforce our constraint of the number of events in the `input` window. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_config = DictConfig({\"path\": data_path, \"standard\": \"meds\"})\n", + "\n", + "predicates_df = predicates.get_predicates_df(cfg=cfg, data_config=data_config)\n", + "display(predicates_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## End-to-End Query" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, with our task configuration object and the computed predicates dataframe, we can call `query.query()` to execute the extraction of our cohort.\n", + "\n", + "Each row of the resulting dataframe is a valid realization of our task tree. Hence, each instance can be included in our cohort used for the prediction of in-hospital mortality as defined in our task configuration file. The output contains:\n", + "\n", + "- `subject_id`: subject IDs of our cohort (since we'd like to treat individual admissions as separate samples, there will be duplicate subject IDs)\n", + "- `index_timestamp`: timestamp of when a prediction is made, which coincides with the `end` timestamp of the `input` window (as specified in our task configuration)\n", + "- `label`: binary label of mortality, which is derived from the `death` predicate of the `target` window (as specified in our task configuration)\n", + "- `trigger`: timestamp of the `trigger` event, which is the `admission` predicate (as specified in our task configuration)\n", + "\n", + "Additionally, it also includes a column for each node of our task tree in a pre-order traversal order. Each column contains a `pl.Struct` object containing the name of the node, the start and end times of the window it represents, and the counts of all defined predicates in that window." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_result = query.query(cfg=cfg, predicates_df=predicates_df)\n", + "display(df_result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "... and that's a wrap! We have used ACES to perform an end-to-end extraction on a MEDS dataset for a cohort that can be used to predict in-hospital mortality. Similar pipelines can be made for other tasks, as well as using the ESGPT data standard. You may also pre-compute predicate columns and use the `direct` flag when loading in `.csv` or `.parquet` data files. More information about this is available in [Predicates DataFrame](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html).\n", + "\n", + "As always, please don't hesitate to reach out should you have any questions about ACES!\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "esgpt", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/source/usage.md b/docs/source/usage.md index 462282b..cac6545 100644 --- a/docs/source/usage.md +++ b/docs/source/usage.md @@ -75,10 +75,13 @@ ACES/ │ │ ├── events_df.parquet │ │ └── dynamic_measurements_df.parquet │ ├── meds_sample/ -│ │ ├── shards/ +│ │ ├── held_out/ +│ │ │ └── 0.parquet +│ │ ├── train/ │ │ │ ├── 0.parquet │ │ │ └── 1.parquet -│ │ └── sample_shard.parquet +│ │ └── tuning/ +│ │ └── 0.parquet │ └── sample_data.csv ├── sample_configs/ │ └── inhospital_mortality.yaml @@ -88,7 +91,7 @@ ACES/ **To query from a single MEDS shard**: ```bash -aces-cli cohort_name="inhospital_mortality" cohort_dir="sample_configs/" data.standard=meds data.path="sample_data/meds_sample/sample_shard.parquet" +aces-cli cohort_name="inhospital_mortality" cohort_dir="sample_configs/" data.standard=meds data.path="sample_data/meds_sample/train/0.parquet" ``` **To query from multiple MEDS shards**: diff --git a/sample_configs/imminent_mortality.yaml b/sample_configs/imminent_mortality.yaml index 3901ba0..8c9a934 100644 --- a/sample_configs/imminent_mortality.yaml +++ b/sample_configs/imminent_mortality.yaml @@ -1,7 +1,7 @@ # Task: 24-hour Imminent Mortality Prediction predicates: death: - code: event_type//DEATH + code: DEATH trigger: _ANY_EVENT diff --git a/sample_configs/inhospital_mortality.yaml b/sample_configs/inhospital_mortality.yaml index a8f7eae..68758f7 100644 --- a/sample_configs/inhospital_mortality.yaml +++ b/sample_configs/inhospital_mortality.yaml @@ -1,11 +1,11 @@ # Task: 24-hour In-hospital Mortality Prediction predicates: admission: - code: event_type//ADMISSION + code: { regex: "ADMISSION//.*" } discharge: - code: event_type//DISCHARGE + code: { regex: "DISCHARGE//.*" } death: - code: event_type//DEATH + code: DEATH discharge_or_death: expr: or(discharge, death) diff --git a/sample_configs/intervention_weaning.yaml b/sample_configs/intervention_weaning.yaml index 683b684..d3f63ab 100644 --- a/sample_configs/intervention_weaning.yaml +++ b/sample_configs/intervention_weaning.yaml @@ -1,9 +1,9 @@ # Task: Ventilation Weaning Prediction predicates: procedure_start: - code: event_type//PROCEDURE_START + code: PROCEDURE_START procedure_end: - code: event_type//PROCEDURE_END + code: PROCEDURE_END ventilation: code: procedure//Invasive Ventilation ventilation_start: diff --git a/sample_configs/long_term_recurrence.yaml b/sample_configs/long_term_recurrence.yaml index 79da762..71f6f99 100644 --- a/sample_configs/long_term_recurrence.yaml +++ b/sample_configs/long_term_recurrence.yaml @@ -1,9 +1,9 @@ # Task: Long-term (3 Years) MI Recurrence Prediction predicates: admission: - code: event_type//ADMISSION + code: { regex: "ADMISSION//.*" } discharge: - code: event_type//DISCHARGE + code: { regex: "DISCHARGE//.*" } diagnosis_ICD9CM_41071: code: diagnosis//ICD9CM_41071 diagnosis_ICD10CM_I214: diff --git a/sample_configs/readmission_risk.yaml b/sample_configs/readmission_risk.yaml index 0d9f81a..c77e68b 100644 --- a/sample_configs/readmission_risk.yaml +++ b/sample_configs/readmission_risk.yaml @@ -1,9 +1,9 @@ # Task: 30-day Readmission Risk Prediction predicates: admission: - code: event_type//ADMISSION + code: { regex: "ADMISSION//.*" } discharge: - code: event_type//DISCHARGE + code: { regex: "DISCHARGE//.*" } trigger: admission diff --git a/sample_data/meds_sample/held_out/0.parquet b/sample_data/meds_sample/held_out/0.parquet index 5c71d98..cc71ffb 100644 Binary files a/sample_data/meds_sample/held_out/0.parquet and b/sample_data/meds_sample/held_out/0.parquet differ diff --git a/sample_data/meds_sample/sample_shard.parquet b/sample_data/meds_sample/sample_shard.parquet deleted file mode 100644 index 5c71d98..0000000 Binary files a/sample_data/meds_sample/sample_shard.parquet and /dev/null differ diff --git a/sample_data/meds_sample/train/0.parquet b/sample_data/meds_sample/train/0.parquet index 2f90ac3..19f222a 100644 Binary files a/sample_data/meds_sample/train/0.parquet and b/sample_data/meds_sample/train/0.parquet differ diff --git a/sample_data/meds_sample/train/1.parquet b/sample_data/meds_sample/train/1.parquet index 98e4ee7..a5c0437 100644 Binary files a/sample_data/meds_sample/train/1.parquet and b/sample_data/meds_sample/train/1.parquet differ diff --git a/sample_data/meds_sample/tuning/0.parquet b/sample_data/meds_sample/tuning/0.parquet new file mode 100644 index 0000000..063eef4 Binary files /dev/null and b/sample_data/meds_sample/tuning/0.parquet differ