Skip to content

Commit

Permalink
Updated documentation (#99)
Browse files Browse the repository at this point in the history
* expand_shards note

* MEDS label schema and predicates only file

* Code workflows

* Static variables

* Briefly mention trigger, explode meds, regex/any for predicates
  • Loading branch information
justin13601 authored Aug 12, 2024
1 parent a5390f0 commit d0c28ce
Show file tree
Hide file tree
Showing 9 changed files with 147 additions and 22 deletions.
95 changes: 95 additions & 0 deletions .github/workflows/python-building.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
name: Publish Python 🐍 Distribution 📦 to PyPI and TestPyPI

on: push

jobs:
build:
name: Build distribution 📦
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.x"
- name: Install pypa/build
run: >-
python3 -m
pip install
build
--user
- name: Build a binary wheel and a source tarball
run: python3 -m build
- name: Store the distribution packages
uses: actions/upload-artifact@v4
with:
name: python-package-distributions
path: dist/

publish-to-pypi:
name: >-
Publish Python 🐍 distribution 📦 to PyPI
if: startsWith(github.ref, 'refs/tags/') # only publish to PyPI on tag pushes
needs:
- build
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/es-aces # Replace <package-name> with your PyPI project name
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing

steps:
- name: Download all the dists
uses: actions/download-artifact@v4
with:
name: python-package-distributions
path: dist/

- name: Publish distribution 📦 to PyPI
uses: pypa/gh-action-pypi-publish@release/v1

github-release:
name: >-
Sign the Python 🐍 Distribution 📦 with Sigstore
and upload them to GitHub Release
needs:
- publish-to-pypi
runs-on: ubuntu-latest

permissions:
contents: write # IMPORTANT: mandatory for making GitHub Releases
id-token: write # IMPORTANT: mandatory for sigstore

steps:
- name: Download all the dists
uses: actions/download-artifact@v4
with:
name: python-package-distributions
path: dist/

- name: Sign the dists with Sigstore
uses: sigstore/gh-action-sigstore-python@v2.1.1
with:
inputs: >-
./dist/*.tar.gz
./dist/*.whl
- name: Create GitHub Release
env:
GITHUB_TOKEN: ${{ github.token }}
run: >-
gh release create
'${{ github.ref_name }}'
--repo '${{ github.repository }}'
--notes ""
- name: Upload artifact signatures to GitHub Release
env:
GITHUB_TOKEN: ${{ github.token }}
# Upload to GitHub Release using the `gh` CLI.
# `dist/` contains the built packages, and the
# sigstore-produced signatures and certificates.
run: >-
gh release upload
'${{ github.ref_name }}' dist/**
--repo '${{ github.repository }}'
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

<p align="center">
<a href="https://www.python.org/downloads/release/python-3100/"><img alt="Python" src="https://img.shields.io/badge/-Python_3.10+-blue?logo=python&logoColor=white"></a>
<a href="https://pypi.org/project/es-aces/"><img alt="PyPI" src="https://img.shields.io/badge/PyPI-v0.2.5-orange?logoColor=orange"></a>
<a href="https://pypi.org/project/es-aces/"><img alt="PyPI" src="https://img.shields.io/badge/PyPI-v0.3.0-orange?logoColor=orange"></a>
<a href="https://hydra.cc/"><img alt="Hydra" src="https://img.shields.io/badge/Config-Hydra_1.3-89b8cd"></a>
<a href="https://codecov.io/gh/justin13601/ACES"><img alt="Codecov" src="https://codecov.io/gh/justin13601/ACES/graph/badge.svg?token=6EA84VFXOV"></a>
<a href="https://github.com/justin13601/ACES/actions/workflows/tests.yml"><img alt="Tests" src="https://github.com/justin13601/ACES/actions/workflows/tests.yml/badge.svg"></a>
Expand Down Expand Up @@ -63,7 +63,7 @@ pip install es-aces
## Instructions for Use

1. **Prepare a Task Configuration File**: Define your predicates and task windows according to your research needs. Please see below or [here](https://eventstreamaces.readthedocs.io/en/latest/configuration.html) for details regarding the configuration language.
2. **Get Predicates DataFrame**: Process your dataset according to instructions for the [MEDS](https://github.com/Medical-Event-Data-Standard/meds) or [ESGPT](https://github.com/mmcdermott/EventStreamGPT) standard so you can leverage ACES to automatically create the predicates dataframe. You can also create your own predicates dataframe directly (more information below and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html)).
2. **Get Predicates DataFrame**: Process your dataset according to the instructions for the [MEDS](https://github.com/Medical-Event-Data-Standard/meds) (single-nested or un-nested) or [ESGPT](https://github.com/mmcdermott/EventStreamGPT) standard so you can leverage ACES to automatically create the predicates dataframe. You can also create your own predicates dataframe directly (more information below and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html)).
3. **Execute Query**: A query may be executed using either the command-line interface or by importing the package in Python:

### Command-Line Interface:
Expand Down Expand Up @@ -256,7 +256,7 @@ There are also a few special predicates that you can use. These *do not* need to

### Trigger Event

The trigger event is a simple field with a value of a predicate name. For each trigger event, a predication by a model can be made. For instance, in the following example, the trigger event is an admission. Therefore, in your task, a prediction by a model can be made for each valid admission (after extraction according to other task specifications).
The trigger event is a simple field with a value of a predicate name. For each trigger event, a predication by a model can be made. For instance, in the following example, the trigger event is an admission. Therefore, in your task, a prediction by a model can be made for each valid admission (after extraction according to other task specifications). You can also simply filter to a cohort of one event (ie., just a trigger event) should you not have any further criteria in your task.

```yaml
predicates:
Expand Down Expand Up @@ -298,7 +298,7 @@ The `has` field specifies constraints relating to predicates within the window.

### Static Data

Support for static data depends on your data standard and those variables are expressed. For instance, in MEDS, it is feasible to express static data as a predicate, and thus criteria can be set normally. However, this is not yet incorporated for ESGPT. If a predicates dataframe is directly used, you may create a predicate column that specifies your static variable.
Static data is now supported. In MEDS, static variables are simply stored in rows with `null` timestamps. In ESGPT, static variables are stored in a separate `subjects_df` table. In either case, it is feasible to express static variables as a predicate and apply the associated criteria normally using the `patient_demographics` heading of a configuration file. Please see [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/examples.html) and [here](https://eventstreamaces.readthedocs.io/en/latest/notebooks/predicates.html) for examples and details.

### Complementary Tools

Expand Down
7 changes: 7 additions & 0 deletions docs/source/algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -335,6 +335,13 @@ will be created to serve as an index for the output cohort. This timestamp can b
start or end timestamp of any desired window; however, it should represent the timestamp at which point a
prediction can be made (ie., at the end of the `input` windows).

##### Matching Input Schemas

For queries on MEDS-formatted dataset, ACES will automatically typecast columns and filter dataframes
appropriately to match the
[label schema](https://github.com/Medical-Event-Data-Standard/meds/blob/main/src/meds/schema.py#L68) defined
in MEDS v0.3.

##### Re-order & Return

Finally, given this dataframe, the algorithm will sort the columns by placing `subject_id`, `index_timestamp`,
Expand Down
5 changes: 4 additions & 1 deletion docs/source/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,11 @@ ______________________________________________________________________

These configs consist of the following four fields:

- `code`: The string value for the categorical code object that is relevant for this predicate. An
- `code`: The string expression for the code object that is relevant for this predicate. An
observation will only satisfy this predicate if there is an occurrence of this code in the observation.
The field can additionally be a dictionary with either a `regex` key and the value being a regular
expression (satisfied if the regular expression evaluates to True), or a `any` key and the value being a
list of strings (satisfied if there is an occurrence for any code in the list).
- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
underlying `code` with a reported numerical value that is either greater than or greater than or equal to
`value_min` (with these options being decided on the basis of `value_min_inclusive`, where
Expand Down
8 changes: 7 additions & 1 deletion docs/source/notebooks/examples.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,9 @@
"\n",
"Next, suppose we'd like to only include hospital admissions that were longer than 48 hours. To represent this clause, we can specify `gap` as above with a length of 48 hours (overlapping the initial 24 hours of `input`). If we then place constraints on `gap`, preventing it to have any discharge or death events, then the admission must then be at least 48 hours.\n",
"\n",
"Finally, we specify `target`, which is our prediction horizon and lasts until the immediately next discharge or death event. This allows us to extract a cohort that includes both patients who have died and those who did not (ie., successfully discharged)."
"Finally, we specify `target`, which is our prediction horizon and lasts until the immediately next discharge or death event. This allows us to extract a cohort that includes both patients who have died and those who did not (ie., successfully discharged).\n",
"\n",
"In addition to constructing a cohort based on dynamic variables, we can also place constraints on static variables (ie., eye color). Suppose we'd like to filter our cohort to only those with blue eyes."
]
},
{
Expand All @@ -89,6 +91,10 @@
" discharge_or_death:\n",
" expr: or(discharge, death)\n",
"\n",
"patient_demographics:\n",
" eye_color:\n",
" code: EYE//blue\n",
"\n",
"trigger: admission\n",
"\n",
"windows:\n",
Expand Down
31 changes: 18 additions & 13 deletions docs/source/notebooks/predicates.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,34 +26,39 @@
"\n",
"| subject_id | timestamp | code | value |\n",
"|------------|---------------------|-------------------------|-------------------------|\n",
"| 1 | null | SEX//male | null |\n",
"| 1 | 1989-01-01 00:00:00 | ADMISSION | null |\n",
"| 1 | 1989-01-01 01:00:00 | LAB//HR | 90 |\n",
"| 1 | 1989-01-01 01:00:00 | PROCEDURE_START | null |\n",
"| 1 | 1989-01-01 02:00:00 | DISCHARGE | null |\n",
"| 1 | 1989-01-01 02:00:00 | PROCEDURE_END | null |\n",
"| 2 | null | SEX//female | null |\n",
"| 2 | 1991-05-06 12:00:00 | ADMISSION | null |\n",
"| 2 | 1991-05-06 20:00:00 | DEATH | null |\n",
"| 3 | null | SEX//male | null |\n",
"| 3 | 1980-10-17 22:00:00 | ADMISSION | null |\n",
"| 3 | 1980-10-17 22:00:00 | LAB//HR | 120 |\n",
"| 3 | 1980-10-18 01:00:00 | LAB//temp | 37 |\n",
"| 3 | 1980-10-18 09:00:00 | DISCHARGE | null |\n",
"| 3 | 1982-02-02 02:00:00 | ADMISSION | null |\n",
"| 3 | 1982-02-02 04:00:00 | DEATH | null |\n",
"\n",
"The `code` column contains a string of an event that occurred at the given `timestamp` for a given `subject_id`. You may then create a series of predicate columns depending on what suits your needs. For instance, here are some plausible predicate columns that could be created:\n",
"The `code` column contains a string of an event that occurred at the given `timestamp` for a given `subject_id`. **Note**: Static variables are shown as rows with `null` timestamps. \n",
"\n",
"| subject_id | timestamp | admission | discharge | death | discharge_or_death | lab | procedure_start| HR_over_100 |\n",
"|------------|---------------------|-----------|-----------|-------|--------------------|-----|----------------|----------------|\n",
"| 1 | 1989-01-01 00:00:00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 1 | 1989-01-01 01:00:00 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |\n",
"| 1 | 1989-01-01 02:00:00 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |\n",
"| 2 | 1991-05-06 12:00:00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 2 | 1991-05-06 20:00:00 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |\n",
"| 3 | 1980-10-17 22:00:00 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |\n",
"| 3 | 1980-10-18 01:00:00 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |\n",
"| 3 | 1980-10-18 09:00:00 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |\n",
"| 3 | 1982-02-02 02:00:00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 3 | 1982-02-02 04:00:00 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |\n",
"You may then create a series of predicate columns depending on what suits your needs. For instance, here are some plausible predicate columns that could be created:\n",
"\n",
"| subject_id | timestamp | admission | discharge | death | discharge_or_death | lab | procedure_start| HR_over_100 | male |\n",
"|------------|---------------------|-----------|-----------|-------|--------------------|-----|----------------|----------------|----------------|\n",
"| 1 | 1989-01-01 00:00:00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |\n",
"| 1 | 1989-01-01 01:00:00 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |\n",
"| 1 | 1989-01-01 02:00:00 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |\n",
"| 2 | 1991-05-06 12:00:00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n",
"| 2 | 1991-05-06 20:00:00 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |\n",
"| 3 | 1980-10-17 22:00:00 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |\n",
"| 3 | 1980-10-18 01:00:00 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |\n",
"| 3 | 1980-10-18 09:00:00 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |\n",
"| 3 | 1982-02-02 02:00:00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |\n",
"| 3 | 1982-02-02 04:00:00 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |\n",
"\n",
"**Note**: This set of predicates are all `plain` predicates (ie., explicitly expressed as a value in the dataset), with the exception of the `derived` predicate `discharge_or_death`, which can be expressed by applying boolean logic on the `discharge` and `death` predicates (ie., `or(discharge, death)`). You may choose to create these columns for `derived` predicates explicitly (as you would `plain` predicates). Or, ACES can automatically create them from `plain` predicates if the boolean logic is provided in the task configuration file. Please see [Predicates](https://eventstreamaces.readthedocs.io/en/latest/configuration.html#predicates-plainpredicateconfig-and-derivedpredicateconfig) for more information.\n",
"\n",
Expand Down
11 changes: 10 additions & 1 deletion docs/source/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data.path="baz.p

#### Multiple Shards

A MEDS dataset can have multiple shards, each stored as a `.parquet` file containing subsets of the full dataset. We can make use of Hydra's launchers and multi-run (`-m`) capabilities to start an extraction job for each shard (`data=sharded`), either in series or in parallel (e.g., using `joblib`, or `submitit` for Slurm). To load data with multiple shards, a data root needs to be provided, along with an expression containing a comma-delimited list of files for each shard. We provide a function `expand_shards` to do this, which accepts a sequence representing `<shards_location>/<number_of_shards>`.
A MEDS dataset can have multiple shards, each stored as a `.parquet` file containing subsets of the full dataset. We can make use of Hydra's launchers and multi-run (`-m`) capabilities to start an extraction job for each shard (`data=sharded`), either in series or in parallel (e.g., using `joblib`, or `submitit` for Slurm). To load data with multiple shards, a data root needs to be provided, along with an expression containing a comma-delimited list of files for each shard. We provide a function `expand_shards` to do this, which accepts a sequence representing `<shards_location>/<number_of_shards>`. It also accepts a file directory, where all `.parquet` files in its directory and subdirectories will be included.

```bash
aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data=sharded data.root="baz/" "data.shard=$(expand_shards qux/#)" -m
Expand Down Expand Up @@ -257,4 +257,13 @@ For example, to query an in-hospital mortality task on the sample data (both the
>>> query.query(cfg=cfg, predicates_df=predicates_df)
```

For more complex tasks involving a large number of predicates, a separate predicates-only "database" file can
be created and passed into `TaskExtractorConfig.load()`. Only referenced predicates will have a predicate
column computed and evaluated, so one could create a dataset-specific deposit file with many predicates and
reference as needed to ensure the cleanliness of the dataset-agnostic task criteria file.

```python
>>> cfg = config.TaskExtractorConfig.load(config_path="criteria.yaml", predicates_path="predicates.yaml")
```

______________________________________________________________________
2 changes: 1 addition & 1 deletion src/aces/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"""
from importlib.metadata import PackageNotFoundError, version

__package_name__ = "MEDS_polars_functions"
__package_name__ = "es-aces"
try:
__version__ = version(__package_name__)
except PackageNotFoundError:
Expand Down
Loading

0 comments on commit d0c28ce

Please sign in to comment.