Adding more comprehensive tests across the API. (#28)

* Updated summarize API to simplify summarize_window so it can be ignored in testing. * Split summarize_temporal_window into two functions; added namedtuple for type checking to aggregate function, aggregate function not working... * Fix the odd test error. This fix is very strange. * Expanded aggregate's doctest to cover other edge cases. * Added a (currently unused) named tuple for event bound parameters; added a (currently failing) doctest for event bound summarizer * updated docstrings and fixed doctest given proper understand of summarize event bound window; many more tests are needed. * added another test and marginal refinement to the summarize_event_bound docstring. * Corrected small typo, used ToEventWindowBound in summarizer. * Added a terminology description file. * Updated aggregate_temporal_window tests. * Some edits * Edits to the terminology * Small edits to the recursion description * Added a set of doctest cases for an aggregate event bound window function. * Got aggregate_event_bound working for all cases where offset is 0 * With offset doctest partially passing; need to subtract temporal group by stage as well. * Got it working in the case with a positive offset too. * Separating out functions a bit and adding doctests for check_constraints. * Added forggoten files. * Include timestamp_at_start in aggregate functions to future proof and to make property testing easier. * Some in-progress updates. * Corrected some more of the doctests. * Fixed the doctests up to the case with offset * Fixed at least one test case's worth of the with offset period. * Found a bug in the aggregate function; test is failing where (I believe) the test case has the correct output. We are likely double subtracting the row's counts through the offset aggregation and the cum_sum * Yep, it was a double subtraction issue. Fixed the error. * Another, I think, proper failing test case. * Corrected that issue too. * Another test failure; likely the counterpart to the prior one. * Corrected one more bug. It was a slightly different issue than I anticipated. * Corrected the issue. * Added another passing test case. * All test cases passing! * Removed malformed constraint checking test case and corrected typo in extract subtree doctest. * Trying to eliminate the other event bound functions to use the new general aggregation function; some test differences that warrant further assessment * Ok, this change reflects a change in API -- basically, stating that if we have an event bound window that is right inclusive, then events that are the bounding event will end up with singleton row windows. That is fine, and in my opinion better than the alternative, but it is slightly different. * Added a tiny additional comment pointing to additional testing in the event bound aggregation function. * extract subtree (partial test only) appears to be working. * Added more tests for extract_subtree. It is a little awkward as sometimes start and end are swapped unexpectedly, but the tests do seem to pass. * Removed old API files and functions. Tests are passing but query_subtree may not be. * Removing duplicated tests. * fixed query script * Config language updates and doc (#29) * Some starting code for config file updates * Some precommit fixes and documentation updates * Added basic implementation of the polars expression logic for MEDS and ESGPT to the PredicateConfig object * Expanded tests and added event_type case. * Started filling out more of the logic in the configuration objects. still in progress * More of the logic in the configuration objects. still in progress * Added doctests and more window logic so windows can naturally be parsed into the appropriate endpoint expressions to facilitate object oriented tree building. * Added (yet untested) code for initialization and validation of windows, which also includes tree building. * Added some basic doctests to the config class. * Removed references to the simplified parsing language. * Working pipeline * Remove outdated unit test file * Fix whitespace? * didn't work, undo - ask matthew * Run script working on ESGPT / CSV / presumably MEDS parquet * Update sample configs --------- Co-authored-by: Justin Xu <justin13601@hotmail.com>
justin13601 · May 22, 2024 · 6939fcd · 6939fcd
1 parent 8532167
commit 6939fcd
Show file tree

Hide file tree

Showing 25 changed files with 3,455 additions and 2,130 deletions.
diff --git a/.gitignore b/.gitignore
@@ -163,3 +163,4 @@ cython_debug/
 .vscode/
 passwords.txt
 outputs/
+result.csv
diff --git a/config_str_language.md b/config_str_language.md
@@ -0,0 +1,178 @@
+Configuration Language Specification
+
+## Introduction and Terminology
+
+This document specifies the configuration language for the automatic extraction of task dataframes and cohorts
+from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)
+format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format. This extraction
+system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and
+labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all
+realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more
+details on the recursive algorithm, see the `terminology.md` file. **TODO** better integrate, name, and link
+to these documentation files.
+
+As indicated above, these cohorts are specified through a combination of concepts (realized as event
+_predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling
+criteria which, conditioned on a set of predicate definitions, are _dataset agnostic_.
+
+Predicates are currently limited to "count" predicates, which are predicates that count the number of times a
+boolean condition is satisfied over a given time window, which can either be a single timepoint, thus tracking
+whether how many observations there were that satisfied the boolean condition in that event (_aka_ at that
+timepoint) or over 1-dimensional windows. In the future, predicates may expand to include other notions of
+functional characterization, such as tracking the average/min/max value a concept takes on over a time-period,
+etc.
+
+Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or
+temporal relationships on said events. The windows between these time-points can then either be constrained to
+contain events that satisfy certain aggregation functions over predicates for these time frames.
+
+## Machine Form (what is used by the algorithm)
+
+In the machine form, the configuration file consists of two parts:
+
+- `predicates`, stored as a dictionary from string predicate names (which must be unique) to either
+  `DirectPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or
+  `DerivedPredicateConfig` objects, which store predicates that build on other predicates.
+- `windows`, stored as a dictionary from string window names (which must be unique) to `WindowConfig`
+  objects.
+
+Next, we will detail each of these configuration objects.
+
+### Predicates: `DirectPredicateConfig` and `DerivedPredicateConfig`
+
+#### `DirectPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data
+
+These configs consist of the following four fields:
+
+- `code`: The string value for the categorical code object that is relevant for this predicate. An
+  observation will only satisfy this predicate if there is an occurrence of this code in the observation.
+- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
+  underlying `code` with a reported numerical value that is either greater than or greater than or equal to
+  `value_min` (with these options being decided on the basis of `value_min_inclusive`, where
+  `value_min_incusive=True` indicating that an observation satisfies this predicate if its value is greater
+  than or equal to `value_min`, and `value_min_inclusive=False` indicating a greater than but not equal to
+  will be used.
+- `value_max`: If specified, an observation will only satisfy this predicate if the occurrence of the
+  underlying `code` with a reported numerical value that is either less than or less than or equal to
+  `value_max` (with these options being decided on the basis of `value_max_inclusive`, where
+  `value_max_incusive=True` indicating that an observation satisfies this predicate if its value is less
+  than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to
+  will be used.
+- `value_min_inclusive`: See `value_min`
+- `value_max_inclusive`: See `value_max`
+
+A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending
+on its source format.
+
+1. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format
+   (recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`
+   and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax
+   does not currently support defining predicates that also rely on matching other, optional fields in the
+   MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull
+   request or upvoting any existing issue/PR that requests/implements this feature, and we will add support
+   for this capability.
+2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the
+   `code` will be interpreted in the following manner:
+   a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the
+   `"//"` character, with the first element specifying the name of the ESGPT measurement under
+   consideration, which should either be of the multi-label classification or multivariate regression
+   type, and the second element being the name of the categorical key corresponding to the code in
+   question within the underlying measurement specified. If either of `value_min` and `value_max` are
+   present, then this measurement must be of a multivariate regression type, and the corresponding
+   `values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be
+   sourced from the ESGPT dataset configuration object.
+   b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must
+   be of the univariate regression type and its value, if needed, will be pulled from the corresponding
+   column.
+
+#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates
+
+These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of
+accepted operations that can be applied to other predicates, containing precisely the following:
+
+- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
+- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.
+
+Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced
+analytic operations over predicates.
+
+### Windows and Events:
+
+#### Windows: `WindowConfig`
+
+Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four
+parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,
+and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that
+specify the constraints that must be satisfied over the defined predicates for a possible realization of this
+window to be valid.
+
+##### The Time Range Fields
+
+###### `start` and `end`
+
+Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in
+symbolic form, the relationship between the start and end time of the window. These two fields must obey the
+following rules:
+
+_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as
+specified either through the name of the trigger event or the start or end event of another window. The other
+field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or
+must reference the field that references the external event.
+
+_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external
+event or an internal event, that reference must be expressed in one of the following ways.
+
+1. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.
+   In this case, the referencing event (either the start or end of the window) will be defined as occurring
+   exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
+   end or start of the window).
+   Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
+   `$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
+   the window fields.
+2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
+   In this case, the referencing event will be defined as the next or previous event satisfying the
+   predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate
+   ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
+   "previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
+   the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the
+   `$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
+   inclusive values are set.
+3. `$REFERENCING = $REFERENCED`
+   In this case, the referencing event will be defined as the same event as the referenced event.
+
+_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of
+the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's
+record. In either of these cases, the other field must reference an external event, per rule 1.
+
+###### `start_inclusive` and `end_inclusive`
+
+These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.
+This applies both to whether they are included in the calculation of the predicate values over the windows,
+but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`
+cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that
+`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the
+`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not
+consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as
+its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective
+timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that
+`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of
+the `start` event itself.
+
+##### The Constraints Field
+
+The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`
+that define the valid range the count of observations of the named predicate that must be found in a window
+for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
+those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note
+that as predicate counts are always integral, this specification does not need an additional
+inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
+to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
+the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
+`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
+to be included.
+
+#### Events: `EventConfig`
+
+The event config consists of only a single field, `predicate`, which specifies the predicate that must be
+observed with value greater than one to satisfy the event. There can only be one defined "event" with an
+"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.
diff --git a/pyproject.toml b/pyproject.toml
@@ -24,7 +24,8 @@ dependencies = [
     "pandas == 2.2.2",
     "loguru == 0.7.2",
     "hydra-core == 1.3.2",
-    "pytimeparse == 1.1.8"
+    "pytimeparse == 1.1.8",
+    "networkx == 3.3",
 ]
 
 [project.optional-dependencies]