Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding more comprehensive tests across the API. #28

Merged
merged 50 commits into from
May 22, 2024
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
9937b82
Updated summarize API to simplify summarize_window so it can be ignor…
mmcdermott May 9, 2024
ac2bc27
Split summarize_temporal_window into two functions; added namedtuple …
mmcdermott May 9, 2024
9a66d9e
Fix the odd test error. This fix is very strange.
mmcdermott May 9, 2024
5a76ddb
Expanded aggregate's doctest to cover other edge cases.
mmcdermott May 9, 2024
cb4a6e7
Added a (currently unused) named tuple for event bound parameters; ad…
mmcdermott May 10, 2024
e985ef9
updated docstrings and fixed doctest given proper understand of summa…
mmcdermott May 10, 2024
0abbf35
added another test and marginal refinement to the summarize_event_bou…
mmcdermott May 10, 2024
ae92ed1
Corrected small typo, used ToEventWindowBound in summarizer.
mmcdermott May 10, 2024
36cb8eb
Added a terminology description file.
mmcdermott May 10, 2024
276c481
Updated aggregate_temporal_window tests.
mmcdermott May 11, 2024
2bc74bc
Some edits
justin13601 May 11, 2024
668965b
Merge branch 'tests' of https://github.com/justin13601/ESGPTTaskQuery…
justin13601 May 11, 2024
e230dd6
Edits to the terminology
justin13601 May 11, 2024
eb61d90
Small edits to the recursion description
justin13601 May 11, 2024
ff6a82c
Added a set of doctest cases for an aggregate event bound window func…
mmcdermott May 11, 2024
bc3d33c
Got aggregate_event_bound working for all cases where offset is 0
mmcdermott May 11, 2024
6f117ba
With offset doctest partially passing; need to subtract temporal grou…
mmcdermott May 11, 2024
bfecd5c
Got it working in the case with a positive offset too.
mmcdermott May 11, 2024
f2e3dea
Separating out functions a bit and adding doctests for check_constrai…
mmcdermott May 11, 2024
462846c
Added forggoten files.
mmcdermott May 12, 2024
b9c0032
Include timestamp_at_start in aggregate functions to future proof and…
mmcdermott May 12, 2024
f59c810
Some in-progress updates.
mmcdermott May 12, 2024
6500db0
Corrected some more of the doctests.
mmcdermott May 12, 2024
5373800
Fixed the doctests up to the case with offset
mmcdermott May 12, 2024
6cc12d4
Fixed at least one test case's worth of the with offset period.
mmcdermott May 13, 2024
2090f63
Found a bug in the aggregate function; test is failing where (I belie…
mmcdermott May 13, 2024
312dc89
Yep, it was a double subtraction issue. Fixed the error.
mmcdermott May 13, 2024
538c69c
Another, I think, proper failing test case.
mmcdermott May 13, 2024
c521166
Corrected that issue too.
mmcdermott May 13, 2024
b436e1f
Another test failure; likely the counterpart to the prior one.
mmcdermott May 13, 2024
68f7d51
Corrected one more bug. It was a slightly different issue than I anti…
mmcdermott May 13, 2024
4837e49
Corrected the issue.
mmcdermott May 13, 2024
724b82a
Added another passing test case.
mmcdermott May 13, 2024
37a4737
All test cases passing!
mmcdermott May 13, 2024
fa635fc
Removed malformed constraint checking test case and corrected typo in…
mmcdermott May 13, 2024
76361c8
Trying to eliminate the other event bound functions to use the new ge…
mmcdermott May 13, 2024
3d57b1f
Ok, this change reflects a change in API -- basically, stating that i…
mmcdermott May 13, 2024
2b837a6
Added a tiny additional comment pointing to additional testing in the…
mmcdermott May 13, 2024
47ef545
extract subtree (partial test only) appears to be working.
mmcdermott May 13, 2024
118d0a7
Added more tests for extract_subtree. It is a little awkward as somet…
mmcdermott May 13, 2024
fbe0cd8
Removed old API files and functions. Tests are passing but query_subt…
mmcdermott May 13, 2024
734a708
Removing duplicated tests.
mmcdermott May 13, 2024
0fea925
fixed query script
mmcdermott May 13, 2024
33c6ebb
Config language updates and doc (#29)
mmcdermott May 20, 2024
7b0d59f
Working pipeline
justin13601 May 22, 2024
dfcbfbd
Remove outdated unit test file
justin13601 May 22, 2024
afd2653
Fix whitespace?
justin13601 May 22, 2024
141a4e2
didn't work, undo - ask matthew
justin13601 May 22, 2024
e35d5a4
Run script working on ESGPT / CSV / presumably MEDS parquet
justin13601 May 22, 2024
5ddfc1e
Update sample configs
justin13601 May 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -163,3 +163,4 @@ cython_debug/
.vscode/
passwords.txt
outputs/
result.csv
178 changes: 178 additions & 0 deletions config_str_language.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
Configuration Language Specification

## Introduction and Terminology

This document specifies the configuration language for the automatic extraction of task dataframes and cohorts
from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)
format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format. This extraction
system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and
labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all
realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more
details on the recursive algorithm, see the `terminology.md` file. **TODO** better integrate, name, and link
to these documentation files.

As indicated above, these cohorts are specified through a combination of concepts (realized as event
_predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling
criteria which, conditioned on a set of predicate definitions, are _dataset agnostic_.

Predicates are currently limited to "count" predicates, which are predicates that count the number of times a
boolean condition is satisfied over a given time window, which can either be a single timepoint, thus tracking
whether how many observations there were that satisfied the boolean condition in that event (_aka_ at that
timepoint) or over 1-dimensional windows. In the future, predicates may expand to include other notions of
functional characterization, such as tracking the average/min/max value a concept takes on over a time-period,
etc.

Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or
temporal relationships on said events. The windows between these time-points can then either be constrained to
contain events that satisfy certain aggregation functions over predicates for these time frames.

## Machine Form (what is used by the algorithm)

In the machine form, the configuration file consists of two parts:

- `predicates`, stored as a dictionary from string predicate names (which must be unique) to either
`DirectPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or
`DerivedPredicateConfig` objects, which store predicates that build on other predicates.
- `windows`, stored as a dictionary from string window names (which must be unique) to `WindowConfig`
objects.

Next, we will detail each of these configuration objects.

### Predicates: `DirectPredicateConfig` and `DerivedPredicateConfig`

#### `DirectPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data

These configs consist of the following four fields:

- `code`: The string value for the categorical code object that is relevant for this predicate. An
observation will only satisfy this predicate if there is an occurrence of this code in the observation.
- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
underlying `code` with a reported numerical value that is either greater than or greater than or equal to
`value_min` (with these options being decided on the basis of `value_min_inclusive`, where
`value_min_incusive=True` indicating that an observation satisfies this predicate if its value is greater
than or equal to `value_min`, and `value_min_inclusive=False` indicating a greater than but not equal to
will be used.
- `value_max`: If specified, an observation will only satisfy this predicate if the occurrence of the
underlying `code` with a reported numerical value that is either less than or less than or equal to
`value_max` (with these options being decided on the basis of `value_max_inclusive`, where
`value_max_incusive=True` indicating that an observation satisfies this predicate if its value is less
than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to
will be used.
- `value_min_inclusive`: See `value_min`
- `value_max_inclusive`: See `value_max`

A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending
on its source format.

1. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format
(recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`
and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax
does not currently support defining predicates that also rely on matching other, optional fields in the
MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull
request or upvoting any existing issue/PR that requests/implements this feature, and we will add support
for this capability.
2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the
`code` will be interpreted in the following manner:
a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the
`"//"` character, with the first element specifying the name of the ESGPT measurement under
consideration, which should either be of the multi-label classification or multivariate regression
type, and the second element being the name of the categorical key corresponding to the code in
question within the underlying measurement specified. If either of `value_min` and `value_max` are
present, then this measurement must be of a multivariate regression type, and the corresponding
`values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be
sourced from the ESGPT dataset configuration object.
b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must
be of the univariate regression type and its value, if needed, will be pulled from the corresponding
column.

#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates

These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of
accepted operations that can be applied to other predicates, containing precisely the following:

- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.

Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced
analytic operations over predicates.

### Windows and Events:

#### Windows: `WindowConfig`

Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four
parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,
and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that
specify the constraints that must be satisfied over the defined predicates for a possible realization of this
window to be valid.

##### The Time Range Fields

###### `start` and `end`

Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in
symbolic form, the relationship between the start and end time of the window. These two fields must obey the
following rules:

_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as
specified either through the name of the trigger event or the start or end event of another window. The other
field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or
must reference the field that references the external event.

_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external
event or an internal event, that reference must be expressed in one of the following ways.

1. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.
In this case, the referencing event (either the start or end of the window) will be defined as occurring
exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
end or start of the window).
Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
`$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
the window fields.
2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
In this case, the referencing event will be defined as the next or previous event satisfying the
predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate
ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
"previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the
`$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
inclusive values are set.
3. `$REFERENCING = $REFERENCED`
In this case, the referencing event will be defined as the same event as the referenced event.

_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of
the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's
record. In either of these cases, the other field must reference an external event, per rule 1.

###### `start_inclusive` and `end_inclusive`

These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.
This applies both to whether they are included in the calculation of the predicate values over the windows,
but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`
cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that
`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the
`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not
consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as
its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective
timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that
`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of
the `start` event itself.

##### The Constraints Field

The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`
that define the valid range the count of observations of the named predicate that must be found in a window
for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note
that as predicate counts are always integral, this specification does not need an additional
inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
to be included.

#### Events: `EventConfig`

The event config consists of only a single field, `predicate`, which specifies the predicate that must be
observed with value greater than one to satisfy the event. There can only be one defined "event" with an
"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.
Comment on lines +1 to +178
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config_str_language.md document is well-written and provides a clear specification of the configuration language. Consider revising the document to address loose punctuation marks for enhanced readability.

- predicates`, stored as a dictionary from string predicate names
+ predicates, stored as a dictionary from string predicate names
- windows`, stored as a dictionary from string window names
+ windows, stored as a dictionary from string window names

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
Configuration Language Specification
## Introduction and Terminology
This document specifies the configuration language for the automatic extraction of task dataframes and cohorts
from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)
format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format. This extraction
system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and
labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all
realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more
details on the recursive algorithm, see the `terminology.md` file. **TODO** better integrate, name, and link
to these documentation files.
As indicated above, these cohorts are specified through a combination of concepts (realized as event
_predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling
criteria which, conditioned on a set of predicate definitions, are _dataset agnostic_.
Predicates are currently limited to "count" predicates, which are predicates that count the number of times a
boolean condition is satisfied over a given time window, which can either be a single timepoint, thus tracking
whether how many observations there were that satisfied the boolean condition in that event (_aka_ at that
timepoint) or over 1-dimensional windows. In the future, predicates may expand to include other notions of
functional characterization, such as tracking the average/min/max value a concept takes on over a time-period,
etc.
Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or
temporal relationships on said events. The windows between these time-points can then either be constrained to
contain events that satisfy certain aggregation functions over predicates for these time frames.
## Machine Form (what is used by the algorithm)
In the machine form, the configuration file consists of two parts:
- `predicates`, stored as a dictionary from string predicate names (which must be unique) to either
`DirectPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or
`DerivedPredicateConfig` objects, which store predicates that build on other predicates.
- `windows`, stored as a dictionary from string window names (which must be unique) to `WindowConfig`
objects.
Next, we will detail each of these configuration objects.
### Predicates: `DirectPredicateConfig` and `DerivedPredicateConfig`
#### `DirectPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data
These configs consist of the following four fields:
- `code`: The string value for the categorical code object that is relevant for this predicate. An
observation will only satisfy this predicate if there is an occurrence of this code in the observation.
- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
underlying `code` with a reported numerical value that is either greater than or greater than or equal to
`value_min` (with these options being decided on the basis of `value_min_inclusive`, where
`value_min_incusive=True` indicating that an observation satisfies this predicate if its value is greater
than or equal to `value_min`, and `value_min_inclusive=False` indicating a greater than but not equal to
will be used.
- `value_max`: If specified, an observation will only satisfy this predicate if the occurrence of the
underlying `code` with a reported numerical value that is either less than or less than or equal to
`value_max` (with these options being decided on the basis of `value_max_inclusive`, where
`value_max_incusive=True` indicating that an observation satisfies this predicate if its value is less
than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to
will be used.
- `value_min_inclusive`: See `value_min`
- `value_max_inclusive`: See `value_max`
A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending
on its source format.
1. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format
(recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`
and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax
does not currently support defining predicates that also rely on matching other, optional fields in the
MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull
request or upvoting any existing issue/PR that requests/implements this feature, and we will add support
for this capability.
2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the
`code` will be interpreted in the following manner:
a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the
`"//"` character, with the first element specifying the name of the ESGPT measurement under
consideration, which should either be of the multi-label classification or multivariate regression
type, and the second element being the name of the categorical key corresponding to the code in
question within the underlying measurement specified. If either of `value_min` and `value_max` are
present, then this measurement must be of a multivariate regression type, and the corresponding
`values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be
sourced from the ESGPT dataset configuration object.
b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must
be of the univariate regression type and its value, if needed, will be pulled from the corresponding
column.
#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates
These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of
accepted operations that can be applied to other predicates, containing precisely the following:
- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.
Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced
analytic operations over predicates.
### Windows and Events:
#### Windows: `WindowConfig`
Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four
parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,
and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that
specify the constraints that must be satisfied over the defined predicates for a possible realization of this
window to be valid.
##### The Time Range Fields
###### `start` and `end`
Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in
symbolic form, the relationship between the start and end time of the window. These two fields must obey the
following rules:
_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as
specified either through the name of the trigger event or the start or end event of another window. The other
field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or
must reference the field that references the external event.
_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external
event or an internal event, that reference must be expressed in one of the following ways.
1. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.
In this case, the referencing event (either the start or end of the window) will be defined as occurring
exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
end or start of the window).
Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
`$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
the window fields.
2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
In this case, the referencing event will be defined as the next or previous event satisfying the
predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate
ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
"previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the
`$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
inclusive values are set.
3. `$REFERENCING = $REFERENCED`
In this case, the referencing event will be defined as the same event as the referenced event.
_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of
the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's
record. In either of these cases, the other field must reference an external event, per rule 1.
###### `start_inclusive` and `end_inclusive`
These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.
This applies both to whether they are included in the calculation of the predicate values over the windows,
but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`
cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that
`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the
`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not
consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as
its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective
timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that
`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of
the `start` event itself.
##### The Constraints Field
The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`
that define the valid range the count of observations of the named predicate that must be found in a window
for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note
that as predicate counts are always integral, this specification does not need an additional
inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
to be included.
#### Events: `EventConfig`
The event config consists of only a single field, `predicate`, which specifies the predicate that must be
observed with value greater than one to satisfy the event. There can only be one defined "event" with an
"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.
Configuration Language Specification
## Introduction and Terminology
This document specifies the configuration language for the automatic extraction of task dataframes and cohorts
from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)
format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format. This extraction
system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and
labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all
realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more
details on the recursive algorithm, see the `terminology.md` file. **TODO** better integrate, name, and link
to these documentation files.
As indicated above, these cohorts are specified through a combination of concepts (realized as event
_predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling
criteria which, conditioned on a set of predicate definitions, are _dataset agnostic_.
Predicates are currently limited to "count" predicates, which are predicates that count the number of times a
boolean condition is satisfied over a given time window, which can either be a single timepoint, thus tracking
whether how many observations there were that satisfied the boolean condition in that event (_aka_ at that
timepoint) or over 1-dimensional windows. In the future, predicates may expand to include other notions of
functional characterization, such as tracking the average/min/max value a concept takes on over a time-period,
etc.
Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or
temporal relationships on said events. The windows between these time-points can then either be constrained to
contain events that satisfy certain aggregation functions over predicates for these time frames.
## Machine Form (what is used by the algorithm)
In the machine form, the configuration file consists of two parts:
- predicates, stored as a dictionary from string predicate names (which must be unique) to either
`DirectPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or
`DerivedPredicateConfig` objects, which store predicates that build on other predicates.
- windows, stored as a dictionary from string window names (which must be unique) to `WindowConfig`
objects.
Next, we will detail each of these configuration objects.
### Predicates: `DirectPredicateConfig` and `DerivedPredicateConfig`
#### `DirectPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data
These configs consist of the following four fields:
- `code`: The string value for the categorical code object that is relevant for this predicate. An
observation will only satisfy this predicate if there is an occurrence of this code in the observation.
- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
underlying `code` with a reported numerical value that is either greater than or greater than or equal to
`value_min` (with these options being decided on the basis of `value_min_inclusive`, where
`value_min_incusive=True` indicating that an observation satisfies this predicate if its value is greater
than or equal to `value_min`, and `value_min_inclusive=False` indicating a greater than but not equal to
will be used.
- `value_max`: If specified, an observation will only satisfy this predicate if the occurrence of the
underlying `code` with a reported numerical value that is either less than or less than or equal to
`value_max` (with these options being decided on the basis of `value_max_inclusive`, where
`value_max_incusive=True` indicating that an observation satisfies this predicate if its value is less
than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to
will be used.
- `value_min_inclusive`: See `value_min`
- `value_max_inclusive`: See `value_max`
A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending
on its source format.
1. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format
(recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`
and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax
does not currently support defining predicates that also rely on matching other, optional fields in the
MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull
request or upvoting any existing issue/PR that requests/implements this feature, and we will add support
for this capability.
2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the
`code` will be interpreted in the following manner:
a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the
`"//"` character, with the first element specifying the name of the ESGPT measurement under
consideration, which should either be of the multi-label classification or multivariate regression
type, and the second element being the name of the categorical key corresponding to the code in
question within the underlying measurement specified. If either of `value_min` and `value_max` are
present, then this measurement must be of a multivariate regression type, and the corresponding
`values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be
sourced from the ESGPT dataset configuration object.
b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must
be of the univariate regression type and its value, if needed, will be pulled from the corresponding
column.
#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates
These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of
accepted operations that can be applied to other predicates, containing precisely the following:
- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.
Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced
analytic operations over predicates.
### Windows and Events:
#### Windows: `WindowConfig`
Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four
parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,
and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that
specify the constraints that must be satisfied over the defined predicates for a possible realization of this
window to be valid.
##### The Time Range Fields
###### `start` and `end`
Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in
symbolic form, the relationship between the start and end time of the window. These two fields must obey the
following rules:
_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as
specified either through the name of the trigger event or the start or end event of another window. The other
field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or
must reference the field that references the external event.
_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external
event or an internal event, that reference must be expressed in one of the following ways.
1. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.
In this case, the referencing event (either the start or end of the window) will be defined as occurring
exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
end or start of the window).
Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
`$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
the window fields.
2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
In this case, the referencing event will be defined as the next or previous event satisfying the
predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate
ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
"previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the
`$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
inclusive values are set.
3. `$REFERENCING = $REFERENCED`
In this case, the referencing event will be defined as the same event as the referenced event.
_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of
the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's
record. In either of these cases, the other field must reference an external event, per rule 1.
###### `start_inclusive` and `end_inclusive`
These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.
This applies both to whether they are included in the calculation of the predicate values over the windows,
but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`
cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that
`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the
`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not
consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as
its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective
timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that
`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of
the `start` event itself.
##### The Constraints Field
The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`
that define the valid range the count of observations of the named predicate that must be found in a window
for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note
that as predicate counts are always integral, this specification does not need an additional
inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
to be included.
#### Events: `EventConfig`
The event config consists of only a single field, `predicate`, which specifies the predicate that must be
observed with value greater than one to satisfy the event. There can only be one defined "event" with an
"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ dependencies = [
"pandas == 2.2.2",
"loguru == 0.7.2",
"hydra-core == 1.3.2",
"pytimeparse == 1.1.8"
"pytimeparse == 1.1.8",
"networkx == 3.3",
]

[project.optional-dependencies]
Expand Down
Loading
Loading