Updated documentation (#99)

* expand_shards note * MEDS label schema and predicates only file * Code workflows * Static variables * Briefly mention trigger, explode meds, regex/any for predicates
justin13601 · Aug 12, 2024 · d0c28ce · d0c28ce
1 parent a5390f0
commit d0c28ce
Show file tree

Hide file tree

Showing 9 changed files with 147 additions and 22 deletions.
diff --git a/.github/workflows/python-building.yaml b/.github/workflows/python-building.yaml
@@ -0,0 +1,95 @@
+name: Publish Python 🐍 Distribution 📦 to PyPI and TestPyPI
+
+on: push
+
+jobs:
+  build:
+    name: Build distribution 📦
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.x"
+      - name: Install pypa/build
+        run: >-
+          python3 -m
+          pip install
+          build
+          --user
+      - name: Build a binary wheel and a source tarball
+        run: python3 -m build
+      - name: Store the distribution packages
+        uses: actions/upload-artifact@v4
+        with:
+          name: python-package-distributions
+          path: dist/
+
+  publish-to-pypi:
+    name: >-
+      Publish Python 🐍 distribution 📦 to PyPI
+    if: startsWith(github.ref, 'refs/tags/') # only publish to PyPI on tag pushes
+    needs:
+      - build
+    runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/p/es-aces # Replace <package-name> with your PyPI project name
+    permissions:
+      id-token: write # IMPORTANT: mandatory for trusted publishing
+
+    steps:
+      - name: Download all the dists
+        uses: actions/download-artifact@v4
+        with:
+          name: python-package-distributions
+          path: dist/
+
+      - name: Publish distribution 📦 to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
+
+  github-release:
+    name: >-
+      Sign the Python 🐍 Distribution 📦 with Sigstore
+      and upload them to GitHub Release
+    needs:
+      - publish-to-pypi
+    runs-on: ubuntu-latest
+
+    permissions:
+      contents: write # IMPORTANT: mandatory for making GitHub Releases
+      id-token: write # IMPORTANT: mandatory for sigstore
+
+    steps:
+      - name: Download all the dists
+        uses: actions/download-artifact@v4
+        with:
+          name: python-package-distributions
+          path: dist/
+
+      - name: Sign the dists with Sigstore
+        uses: sigstore/gh-action-sigstore-python@v2.1.1
+        with:
+          inputs: >-
+            ./dist/*.tar.gz
+            ./dist/*.whl
+      - name: Create GitHub Release
+        env:
+          GITHUB_TOKEN: ${{ github.token }}
+        run: >-
+          gh release create
+          '${{ github.ref_name }}'
+          --repo '${{ github.repository }}'
+          --notes ""
+      - name: Upload artifact signatures to GitHub Release
+        env:
+          GITHUB_TOKEN: ${{ github.token }}
+        # Upload to GitHub Release using the `gh` CLI.
+        # `dist/` contains the built packages, and the
+        # sigstore-produced signatures and certificates.
+        run: >-
+          gh release upload
+          '${{ github.ref_name }}' dist/**
+          --repo '${{ github.repository }}'
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 <p align="center">
   <a href="https://www.python.org/downloads/release/python-3100/"><img alt="Python" src="https://img.shields.io/badge/-Python_3.10+-blue?logo=python&logoColor=white"></a>
-  <a href="https://pypi.org/project/es-aces/"><img alt="PyPI" src="https://img.shields.io/badge/PyPI-v0.2.5-orange?logoColor=orange"></a>
+  <a href="https://pypi.org/project/es-aces/"><img alt="PyPI" src="https://img.shields.io/badge/PyPI-v0.3.0-orange?logoColor=orange"></a>
   <a href="https://hydra.cc/"><img alt="Hydra" src="https://img.shields.io/badge/Config-Hydra_1.3-89b8cd"></a>
   <a href="https://codecov.io/gh/justin13601/ACES"><img alt="Codecov" src="https://codecov.io/gh/justin13601/ACES/graph/badge.svg?token=6EA84VFXOV"></a>
   <a href="https://github.com/justin13601/ACES/actions/workflows/tests.yml"><img alt="Tests" src="https://github.com/justin13601/ACES/actions/workflows/tests.yml/badge.svg"></a>
@@ -63,7 +63,7 @@ pip install es-aces
 ## Instructions for Use
 
 1. **Prepare a Task Configuration File**: Define your predicates and task windows according to your research needs. Please see below or [here](https://eventstreamaces.readthedocs.io/en/latest/configuration.html) for details regarding the configuration language.
-2. **Get Predicates DataFrame**: Process your dataset according to instructions for the [MEDS](https://github.com/Medical-Event-Data-Standard/meds) or [ESGPT](https://github.com/mmcdermott/EventStreamGPT) standard so you can leverage ACES to automatically create the predicates dataframe. You can also create your own predicates dataframe directly (more information below and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html)).
+2. **Get Predicates DataFrame**: Process your dataset according to the instructions for the [MEDS](https://github.com/Medical-Event-Data-Standard/meds) (single-nested or un-nested) or [ESGPT](https://github.com/mmcdermott/EventStreamGPT) standard so you can leverage ACES to automatically create the predicates dataframe. You can also create your own predicates dataframe directly (more information below and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html)).
 3. **Execute Query**: A query may be executed using either the command-line interface or by importing the package in Python:
 
 ### Command-Line Interface:
@@ -256,7 +256,7 @@ There are also a few special predicates that you can use. These *do not* need to
 
 ### Trigger Event
 
-The trigger event is a simple field with a value of a predicate name. For each trigger event, a predication by a model can be made. For instance, in the following example, the trigger event is an admission. Therefore, in your task, a prediction by a model can be made for each valid admission (after extraction according to other task specifications).
+The trigger event is a simple field with a value of a predicate name. For each trigger event, a predication by a model can be made. For instance, in the following example, the trigger event is an admission. Therefore, in your task, a prediction by a model can be made for each valid admission (after extraction according to other task specifications). You can also simply filter to a cohort of one event (ie., just a trigger event) should you not have any further criteria in your task.
 
 ```yaml
 predicates:
@@ -298,7 +298,7 @@ The `has` field specifies constraints relating to predicates within the window.
 
 ### Static Data
 
-Support for static data depends on your data standard and those variables are expressed. For instance, in MEDS, it is feasible to express static data as a predicate, and thus criteria can be set normally. However, this is not yet incorporated for ESGPT. If a predicates dataframe is directly used, you may create a predicate column that specifies your static variable.
+Static data is now supported. In MEDS, static variables are simply stored in rows with `null` timestamps. In ESGPT, static variables are stored in a separate `subjects_df` table. In either case, it is feasible to express static variables as a predicate and apply the associated criteria normally using the `patient_demographics` heading of a configuration file. Please see [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/examples.html) and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html) for examples and details.
 
 ### Complementary Tools
 

diff --git a/docs/source/algorithm.md b/docs/source/algorithm.md
@@ -335,6 +335,13 @@ will be created to serve as an index for the output cohort. This timestamp can b
 start or end timestamp of any desired window; however, it should represent the timestamp at which point a
 prediction can be made (ie., at the end of the `input` windows).
 
+##### Matching Input Schemas
+
+For queries on MEDS-formatted dataset, ACES will automatically typecast columns and filter dataframes
+appropriately to match the
+[label schema](https://github.com/Medical-Event-Data-Standard/meds/blob/main/src/meds/schema.py#L68) defined
+in MEDS v0.3.
+
 ##### Re-order & Return
 
 Finally, given this dataframe, the algorithm will sort the columns by placing `subject_id`, `index_timestamp`,

diff --git a/docs/source/configuration.md b/docs/source/configuration.md
@@ -44,8 +44,11 @@ ______________________________________________________________________
 
 These configs consist of the following four fields:
 
-- `code`: The string value for the categorical code object that is relevant for this predicate. An
+- `code`: The string expression for the code object that is relevant for this predicate. An
   observation will only satisfy this predicate if there is an occurrence of this code in the observation.
+  The field can additionally be a dictionary with either a `regex` key and the value being a regular
+  expression (satisfied if the regular expression evaluates to True), or a `any` key and the value being a
+  list of strings (satisfied if there is an occurrence for any code in the list).
 - `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
   underlying `code` with a reported numerical value that is either greater than or greater than or equal to
   `value_min` (with these options being decided on the basis of `value_min_inclusive`, where

diff --git a/docs/source/notebooks/examples.ipynb b/docs/source/notebooks/examples.ipynb
@@ -69,7 +69,9 @@
     "\n",
     "Next, suppose we'd like to only include hospital admissions that were longer than 48 hours. To represent this clause, we can specify `gap` as above with a length of 48 hours (overlapping the initial 24 hours of `input`). If we then place constraints on `gap`, preventing it to have any discharge or death events, then the admission must then be at least 48 hours.\n",
     "\n",
-    "Finally, we specify `target`, which is our prediction horizon and lasts until the immediately next discharge or death event. This allows us to extract a cohort that includes both patients who have died and those who did not (ie., successfully discharged)."
+    "Finally, we specify `target`, which is our prediction horizon and lasts until the immediately next discharge or death event. This allows us to extract a cohort that includes both patients who have died and those who did not (ie., successfully discharged).\n",
+    "\n",
+    "In addition to constructing a cohort based on dynamic variables, we can also place constraints on static variables (ie., eye color). Suppose we'd like to filter our cohort to only those with blue eyes."
    ]
   },
   {
@@ -89,6 +91,10 @@
     "  discharge_or_death:\n",
     "    expr: or(discharge, death)\n",
     "\n",
+    "patient_demographics:\n",
+    "  eye_color:\n",
+    "    code: EYE//blue\n",
+    "\n",
     "trigger: admission\n",
     "\n",
     "windows:\n",

diff --git a/docs/source/notebooks/predicates.ipynb b/docs/source/notebooks/predicates.ipynb
@@ -26,34 +26,39 @@
     "\n",
     "| subject_id | timestamp           | code                    | value                   |\n",
     "|------------|---------------------|-------------------------|-------------------------|\n",
+    "| 1          | null                | SEX//male               | null                        |\n",
     "| 1          | 1989-01-01 00:00:00 | ADMISSION               | null                        |\n",
     "| 1          | 1989-01-01 01:00:00 | LAB//HR                 | 90                      |\n",
     "| 1          | 1989-01-01 01:00:00 | PROCEDURE_START         | null                        |\n",
     "| 1          | 1989-01-01 02:00:00 | DISCHARGE               | null                        |\n",
     "| 1          | 1989-01-01 02:00:00 | PROCEDURE_END           | null                        |\n",
+    "| 2          | null                | SEX//female               | null                        |\n",
     "| 2          | 1991-05-06 12:00:00 | ADMISSION               | null                        |\n",
     "| 2          | 1991-05-06 20:00:00 | DEATH                   | null                        |\n",
+    "| 3          | null                | SEX//male               | null                        |\n",
     "| 3          | 1980-10-17 22:00:00 | ADMISSION               | null                        |\n",
     "| 3          | 1980-10-17 22:00:00 | LAB//HR                 | 120                     |\n",
     "| 3          | 1980-10-18 01:00:00 | LAB//temp               | 37                      |\n",
     "| 3          | 1980-10-18 09:00:00 | DISCHARGE               | null                        |\n",
     "| 3          | 1982-02-02 02:00:00 | ADMISSION               | null                        |\n",
     "| 3          | 1982-02-02 04:00:00 | DEATH                   | null                        |\n",
     "\n",
-    "The `code` column contains a string of an event that occurred at the given `timestamp` for a given `subject_id`. You may then create a series of predicate columns depending on what suits your needs. For instance, here are some plausible predicate columns that could be created:\n",
+    "The `code` column contains a string of an event that occurred at the given `timestamp` for a given `subject_id`. **Note**: Static variables are shown as rows with `null` timestamps. \n",
     "\n",
-    "| subject_id | timestamp           | admission | discharge | death | discharge_or_death | lab | procedure_start| HR_over_100    |\n",
-    "|------------|---------------------|-----------|-----------|-------|--------------------|-----|----------------|----------------|\n",
-    "| 1          | 1989-01-01 00:00:00 | 1         | 0         | 0     | 0                  | 0   | 0              | 0              |\n",
-    "| 1          | 1989-01-01 01:00:00 | 0         | 0         | 0     | 0                  | 1   | 1              | 1              |\n",
-    "| 1          | 1989-01-01 02:00:00 | 0         | 1         | 0     | 1                  | 0   | 0              | 0              |\n",
-    "| 2          | 1991-05-06 12:00:00 | 1         | 0         | 0     | 0                  | 0   | 0              | 0              |\n",
-    "| 2          | 1991-05-06 20:00:00 | 0         | 0         | 1     | 1                  | 0   | 0              | 0              |\n",
-    "| 3          | 1980-10-17 22:00:00 | 1         | 0         | 0     | 0                  | 1   | 0              | 0              |\n",
-    "| 3          | 1980-10-18 01:00:00 | 0         | 0         | 0     | 0                  | 1   | 0              | 0              |\n",
-    "| 3          | 1980-10-18 09:00:00 | 0         | 1         | 0     | 1                  | 0   | 0              | 0              |\n",
-    "| 3          | 1982-02-02 02:00:00 | 1         | 0         | 0     | 0                  | 0   | 0              | 0              |\n",
-    "| 3          | 1982-02-02 04:00:00 | 0         | 0         | 1     | 1                  | 0   | 0              | 0              |\n",
+    "You may then create a series of predicate columns depending on what suits your needs. For instance, here are some plausible predicate columns that could be created:\n",
+    "\n",
+    "| subject_id | timestamp           | admission | discharge | death | discharge_or_death | lab | procedure_start| HR_over_100    | male    |\n",
+    "|------------|---------------------|-----------|-----------|-------|--------------------|-----|----------------|----------------|----------------|\n",
+    "| 1          | 1989-01-01 00:00:00 | 1         | 0         | 0     | 0                  | 0   | 0              | 0              | 1              |\n",
+    "| 1          | 1989-01-01 01:00:00 | 0         | 0         | 0     | 0                  | 1   | 1              | 1              | 1              |\n",
+    "| 1          | 1989-01-01 02:00:00 | 0         | 1         | 0     | 1                  | 0   | 0              | 0              | 1              |\n",
+    "| 2          | 1991-05-06 12:00:00 | 1         | 0         | 0     | 0                  | 0   | 0              | 0              | 0              |\n",
+    "| 2          | 1991-05-06 20:00:00 | 0         | 0         | 1     | 1                  | 0   | 0              | 0              | 0              |\n",
+    "| 3          | 1980-10-17 22:00:00 | 1         | 0         | 0     | 0                  | 1   | 0              | 0              | 1              |\n",
+    "| 3          | 1980-10-18 01:00:00 | 0         | 0         | 0     | 0                  | 1   | 0              | 0              | 1              |\n",
+    "| 3          | 1980-10-18 09:00:00 | 0         | 1         | 0     | 1                  | 0   | 0              | 0              | 1              |\n",
+    "| 3          | 1982-02-02 02:00:00 | 1         | 0         | 0     | 0                  | 0   | 0              | 0              | 1              |\n",
+    "| 3          | 1982-02-02 04:00:00 | 0         | 0         | 1     | 1                  | 0   | 0              | 0              | 1              |\n",
     "\n",
     "**Note**: This set of predicates are all `plain` predicates (ie., explicitly expressed as a value in the dataset), with the exception of the `derived` predicate `discharge_or_death`, which can be expressed by applying boolean logic on the `discharge` and `death` predicates (ie., `or(discharge, death)`). You may choose to create these columns for `derived` predicates explicitly (as you would `plain` predicates). Or, ACES can automatically create them from `plain` predicates if the boolean logic is provided in the task configuration file. Please see [Predicates](https://eventstreamaces.readthedocs.io/en/latest/configuration.html#predicates-plainpredicateconfig-and-derivedpredicateconfig) for more information.\n",
     "\n",

diff --git a/docs/source/usage.md b/docs/source/usage.md
@@ -209,7 +209,7 @@ aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data.path="baz.p
 
 #### Multiple Shards
 
-A MEDS dataset can have multiple shards, each stored as a `.parquet` file containing subsets of the full dataset. We can make use of Hydra's launchers and multi-run (`-m`) capabilities to start an extraction job for each shard (`data=sharded`), either in series or in parallel (e.g., using `joblib`, or `submitit` for Slurm). To load data with multiple shards, a data root needs to be provided, along with an expression containing a comma-delimited list of files for each shard. We provide a function `expand_shards` to do this, which accepts a sequence representing `<shards_location>/<number_of_shards>`.
+A MEDS dataset can have multiple shards, each stored as a `.parquet` file containing subsets of the full dataset. We can make use of Hydra's launchers and multi-run (`-m`) capabilities to start an extraction job for each shard (`data=sharded`), either in series or in parallel (e.g., using `joblib`, or `submitit` for Slurm). To load data with multiple shards, a data root needs to be provided, along with an expression containing a comma-delimited list of files for each shard. We provide a function `expand_shards` to do this, which accepts a sequence representing `<shards_location>/<number_of_shards>`. It also accepts a file directory, where all `.parquet` files in its directory and subdirectories will be included.
 
 ```bash
 aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data=sharded data.root="baz/" "data.shard=$(expand_shards qux/#)" -m
@@ -257,4 +257,13 @@ For example, to query an in-hospital mortality task on the sample data (both the
 >>> query.query(cfg=cfg, predicates_df=predicates_df)
 ```
 
+For more complex tasks involving a large number of predicates, a separate predicates-only "database" file can
+be created and passed into `TaskExtractorConfig.load()`. Only referenced predicates will have a predicate
+column computed and evaluated, so one could create a dataset-specific deposit file with many predicates and
+reference as needed to ensure the cleanliness of the dataset-agnostic task criteria file.
+
+```python
+>>> cfg = config.TaskExtractorConfig.load(config_path="criteria.yaml", predicates_path="predicates.yaml")
+```
+
 ______________________________________________________________________
diff --git a/src/aces/__init__.py b/src/aces/__init__.py
@@ -4,7 +4,7 @@
 """
 from importlib.metadata import PackageNotFoundError, version
 
-__package_name__ = "MEDS_polars_functions"
+__package_name__ = "es-aces"
 try:
     __version__ = version(__package_name__)
 except PackageNotFoundError: