Adding more comprehensive tests across the API. #28

mmcdermott · 2024-05-09T10:41:24Z

Still in draft mode. Some minor changes to the API expected.

Summary by CodeRabbit

New Features
- Introduced detailed specifications for configuration language in config_str_language.md.
- Added terminology and design details for algorithm extraction in terminology.md.
- Added a parse_timedelta function to handle time string parsing.
Enhancements
- Updated run.py to improve data loading from various file formats.
- Improved predicate handling and logging in predicates.py.
- Enhanced query.py to validate parameter types and utilize new modules.
Dependencies
- Added networkx == 3.3 dependency.
Configuration
- Restructured sample_configs/01-inhospital-mortality.yaml for better predicate and window definitions.
- Updated run.yaml to change data_path configuration.
Chores
- Added result.csv to .gitignore.
Bug Fixes
- No specific bug fixes mentioned.

…ed in testing.

coderabbitai · 2024-05-09T10:41:30Z

Warning

Rate Limit Exceeded

@justin13601 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 39 minutes and 56 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.
Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.
Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between e35d5a4 and 5ddfc1e.

Walkthrough

This update introduces a comprehensive overhaul to the configuration and querying system for extracting data from structured EHR data. Key changes include the addition of new dependencies, enhancements to constraint validation, and significant refactoring of predicate generation and query functions. The update also includes new utility functions, updated sample configurations, and improved documentation.

Changes

File/Path	Change Summary
`config_str_language.md`	Introduced detailed specification for the configuration language for EHR data extraction.
`pyproject.toml`	Added `"networkx == 3.3"` to the list of dependencies.
`src/esgpt_task_querying/constraints.py`	Added `check_constraints` function with new parameters and logic for constraints validation.
`src/esgpt_task_querying/extract_subtree.py`	Updated `extract_subtree` function with new parameter `subtree_root_offset`.
`src/esgpt_task_querying/predicates.py`	Refactored `generate_predicate_columns` to `generate_predicates_df` with new parameters and logic.
`src/esgpt_task_querying/query.py`	Updated `query` function with new parameter names and imports, added type validation.
`src/esgpt_task_querying/types.py`	Introduced `TemporalWindowBounds` and `ToEventWindowBounds` named tuples with methods for Polars ops.
`sample_configs/01-inhospital-mortality.yaml`	Restructured predicates and windows sections, updated admission, discharge, death, and gap definitions.
`src/esgpt_task_querying/utils.py`	Added `parse_timedelta` function for parsing time strings into `timedelta` objects.
`terminology.md`	Introduced terminology and design details for the algorithm extracting valid time windows from event data.
`.gitignore`	Added `result.csv` to ignored files.
`run.py`	Renamed and restructured functions for loading data, updated logic for different file formats.
`run.yaml`	Updated `data_path` configuration from `sample_data/esgpt_sample` to `sample_data/sample.csv`.

🐇
Code changes come and go, like springtime's gentle breeze,
With predicates and windows, we query with such ease.
Constraints now stand tall, in the forest of our code,
And data flows like rivers, down a well-defined road.
So here's a rabbit's cheer, for the updates that we see,
May our queries be swift, and our data ever free! 🌸

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov-commenter · 2024-05-09T10:42:33Z

Codecov Report

Attention: Patch coverage is 94.44444% with 12 lines in your changes are missing coverage. Please review.

Files	Coverage Δ
src/esgpt_task_querying/constraints.py	`100.00% <100.00%> (ø)`
src/esgpt_task_querying/types.py	`100.00% <100.00%> (ø)`
src/esgpt_task_querying/extract_subtree.py	`94.11% <94.11%> (ø)`
src/esgpt_task_querying/aggregate.py	`93.97% <93.97%> (ø)`
src/esgpt_task_querying/summarize.py	`37.50% <64.28%> (-3.07%)`	⬇️

…for type checking to aggregate function, aggregate function not working...

…ded a (currently failing) doctest for event bound summarizer

…rize event bound window; many more tests are needed.

…nd docstring.

into tests

…tion.

…p by stage as well.

…nts.

… to make property testing easier.

…ve) the test case has the correct output. We are likely double subtracting the row's counts through the offset aggregation and the cum_sum

…cipated.

… extract subtree doctest.

…neral aggregation function; some test differences that warrant further assessment

…f we have an event bound window that is right inclusive, then events that are the bounding event will end up with singleton row windows. That is fine, and in my opinion better than the alternative, but it is slightly different.

… event bound aggregation function.

…imes start and end are swapped unexpectedly, but the tests do seem to pass.

…ree may not be.

* Some starting code for config file updates * Some precommit fixes and documentation updates * Added basic implementation of the polars expression logic for MEDS and ESGPT to the PredicateConfig object * Expanded tests and added event_type case. * Started filling out more of the logic in the configuration objects. still in progress * More of the logic in the configuration objects. still in progress * Added doctests and more window logic so windows can naturally be parsed into the appropriate endpoint expressions to facilitate object oriented tree building. * Added (yet untested) code for initialization and validation of windows, which also includes tree building. * Added some basic doctests to the config class. * Removed references to the simplified parsing language.

coderabbitai

Actionable comments posted: 15

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3f9f40e and 33c6ebb.

Files selected for processing (10)

config_str_language.md (1 hunks)
pyproject.toml (1 hunks)
src/esgpt_task_querying/aggregate.py (1 hunks)
src/esgpt_task_querying/config.py (1 hunks)
src/esgpt_task_querying/constraints.py (1 hunks)
src/esgpt_task_querying/extract_subtree.py (1 hunks)
src/esgpt_task_querying/query.py (3 hunks)
src/esgpt_task_querying/types.py (1 hunks)
src/esgpt_task_querying/utils.py (1 hunks)
terminology.md (1 hunks)

Files not summarized due to errors (2)

src/esgpt_task_querying/aggregate.py: Error: Message exceeds token limit
src/esgpt_task_querying/config.py: Error: Message exceeds token limit

Files not reviewed due to errors (1)

src/esgpt_task_querying/extract_subtree.py (no review received)

Files skipped from review due to trivial changes (1)

pyproject.toml

Additional Context Used

LanguageTool (23)

config_str_language.md (16)

Near line 33: Loose punctuation mark.
Context: ...e consists of two parts: - predicates, stored as a dictionary from string pred...

Near line 36: Loose punctuation mark.
Context: ...t build on other predicates. - windows, stored as a dictionary from string wind...

Near line 43: Loose punctuation mark.
Context: ...ateConfig ####DirectPredicateConfig`: Configuration of Predicates that can be...

Near line 47: Loose punctuation mark.
Context: ... of the following four fields: - code: The string value for the categorical co...

Near line 49: Loose punctuation mark.
Context: ...s code in the observation. - value_min: If specified, an observation will only ...

Near line 51: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ..._min(with these options being decided on the basis ofvalue_min_inclusive, where value_m...

Near line 55: Loose punctuation mark.
Context: ...t equal to will be used. - value_max: If specified, an observation will only ...

Near line 57: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ..._max(with these options being decided on the basis ofvalue_max_inclusive, where value_m...

Near line 61: Loose punctuation mark.
Context: ... will be used. - value_min_inclusive: See value_min - value_max_inclusive...

Near line 88: Loose punctuation mark.
Context: ... column. #### DerivedPredicateConfig: Configuration of Predicates that Depend...

Near line 93: Loose punctuation mark.
Context: ... - and(pred_1_name, pred_2_name, ...): Asserts that all of the specified predi...

Near line 93: Consider removing “of” to be more concise
Context: ..._name, pred_2_name, ...): Asserts that all of the specified predicates must be true. - o...

Near line 94: Loose punctuation mark.
Context: ...e. - or(pred_1_name, pred_2_name, ...): Asserts that any of the specified predi...

Near line 125: Loose punctuation mark.
Context: ...REFERENCING = $REFERENCED + $TIME_DELTA, $REFERENCING = $REFERENCED - $TIME_DEL...

Near line 132: Loose punctuation mark.
Context: ...REFERENCING = $REFERENCED -> $PREDICATE, $REFERENCING = $REFERENCED <- $PREDICA...

Near line 177: You might be missing the article “a” here.
Context: ...he predicate that must be observed with value greater than one to satisfy the event. ...

terminology.md (7)

Near line 5: The verb after “to” should be in the base form as part of the to-infinitive. A verb can take many forms, but the base form is always used in the to-infinitive.
Context: ...ch details events that have happened to subjects. Each row in the dataframe will have a ...

Near line 10: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it.
Context: ... of columns which describe the event in a variety of ways. These columns can either have a b...

Near line 45: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it.
Context: ...in this particular case could represent a variety of other events in the medical record, suc...

Near line 92: Did you mean “exists”?
Context: ...n a "valid" form for a patient if there exist a set of time points such that, within ...

Near line 142: The abbreviation “i.e.” (= that is) requires two periods.
Context: ...s strictly increasing in our algorithm (ie. the start of a "window" will always be ...

Near line 218: ‘In the event that’ might be wordy. Consider a shorter alternative.
Context: ...ng the recursive step. ##### offset In the event that the subtree root timestamp is not the s...

Near line 221: The modal verb ‘will’ requires the verb’s base form.
Context: ... If the two are not the same, they will guaranteed to be separated by a constant offset ...

Additional comments not posted (9)

src/esgpt_task_querying/utils.py (1)

6-32: The implementation of parse_timedelta is robust and well-documented. Good job on handling edge cases and providing detailed examples.
src/esgpt_task_querying/query.py (1)
11-13: The integration of new imports and the addition of type checks in the query function are well-implemented. Ensure that the rest of the function correctly utilizes these new functionalities.

Also applies to: 26-27

Verification successful

The integration of check_constraints and extract_subtree in the query function is correctly implemented, as evidenced by their usage with appropriate parameters.

check_constraints is used with tree.constraints and predicates_df.

extract_subtree is used with tree, prospective_root_anchors, and predicates_df.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the integration of `check_constraints` and `extract_subtree` in the `query` function.

# Test: Search for the usage of these functions within the `query.py` file. Expect: Proper usage with correct parameters.
rg --type python $'check_constraints|extract_subtree' src/esgpt_task_querying/query.py
Length of output: 118

Script:
#!/bin/bash
# Description: Verify the integration of `check_constraints` and `extract_subtree` in the `query` function.

# Test: Search for the usage of these functions within the `query.py` file. Expect: Proper usage with correct parameters.
rg --type py 'check_constraints|extract_subtree' src/esgpt_task_querying/query.py
Length of output: 337
src/esgpt_task_querying/constraints.py (1)

9-102: The updates to check_constraints significantly improve its robustness and clarity. The detailed documentation and comprehensive error handling are particularly commendable.

src/esgpt_task_querying/types.py (1)

23-138: The new data structures TemporalWindowBounds and ToEventWindowBounds are well-implemented and documented. The use of dataclasses enhances their usability and maintainability.

Also applies to: 140-307

src/esgpt_task_querying/config.py (2)

135-198: Review the logic in __post_init__ and eval_expr to ensure it handles all edge cases correctly.

421-537: Ensure error handling in WindowConfig is robust, especially in _parse_boundary and __post_init__.

src/esgpt_task_querying/aggregate.py (3)

198-199: Ensure type checking for endpoint_expr parameter.

The type checking here is robust, ensuring that endpoint_expr is always an instance of TemporalWindowBounds, which helps in maintaining type safety throughout the function.

274-275: Proper error handling for negative offsets.

The function correctly raises a ValueError if the offset is negative, which is crucial for preventing logical errors in time calculations.

836-839: Validation of mode and closed parameters.

The validation checks for mode and closed parameters are essential for ensuring that the function handles only expected values, thereby avoiding unexpected behaviors.

coderabbitai · 2024-05-20T05:54:43Z

src/esgpt_task_querying/aggregate.py

+) -> pl.DataFrame:
+    """Aggregates the predicates dataframe into the specified temporal buckets.
+
+    TODO: Use https://hypothesis.readthedocs.io/en/latest/quickstart.html to add extra tests.


Consider implementing the suggested tests using Hypothesis as indicated by the TODO comment.

Would you like assistance in setting up property-based tests using Hypothesis for this function?

yes please!

@mmcdermott, here's a basic setup for property-based testing using Hypothesis for the aggregate_temporal_window function in your aggregate.py module. This will help ensure that your function behaves as expected across a wide range of input scenarios.

from datetime import timedelta import polars as pl from hypothesis import given, strategies as st from src.esgpt_task_querying.aggregate import aggregate_temporal_window from src.esgpt_task_querying.types import TemporalWindowBounds # Strategy to generate random TemporalWindowBounds temporal_window_bounds_strategy = st.builds( TemporalWindowBounds, left_inclusive=st.booleans(), window_size=st.timedeltas(min_value=timedelta(seconds=1), max_value=timedelta(days=365)), right_inclusive=st.booleans(), offset=st.timedeltas(min_value=timedelta(seconds=0), max_value=timedelta(days=1)) ) # Strategy to generate random DataFrames dataframe_strategy = st.dataframes( columns=[st.column('timestamp', dtype=pl.Datetime), st.column('value', dtype=pl.Float64)], rows=st.tuples(st.datetimes(), st.floats()) ) @given(predicates_df=dataframe_strategy, endpoint_expr=temporal_window_bounds_strategy) def test_aggregate_temporal_window(predicates_df, endpoint_expr): result_df = aggregate_temporal_window(predicates_df, endpoint_expr) # Here you can add assertions to check properties of the result_df # For example, check that the result_df is not empty if predicates_df is not empty assert not predicates_df.is_empty() implies not result_df.is_empty() # To run the test test_aggregate_temporal_window()

This test setup uses Hypothesis to generate random dataframes and temporal window bounds, and then checks that the output dataframe meets certain conditions (as an example, it checks that the output is not empty if the input isn't). You can expand this by adding more specific properties that you expect from your function.

coderabbitai · 2024-05-20T05:54:43Z

src/esgpt_task_querying/aggregate.py

+        dataframe, but the values in the predicate columns of the output dataframe will be the sum of the
+        values in the predicate columns of the input dataframe from the timestamp of the event in the row of
+        the output dataframe spanning the specified temporal window.
+    Returns:


Duplicate return statement in the function documentation.

Consider removing the duplicate return statement to avoid confusion and maintain clarity in the documentation.

coderabbitai · 2024-05-20T05:54:43Z

src/esgpt_task_querying/aggregate.py

+) -> pl.DataFrame:
+    """Aggregates ``predicates_df`` between each row plus an offset and the next per-subject matching event.
+
+    TODO: Use https://hypothesis.readthedocs.io/en/latest/quickstart.html to test this function.


Add comprehensive tests for aggregate_event_bound_window.

Would you like help in creating detailed test cases for this function, especially given its complexity and the critical nature of its functionality?

coderabbitai · 2024-05-20T05:54:43Z

src/esgpt_task_querying/aggregate.py

+    if offset != timedelta(0):
+        if offset > timedelta(0):
+            left_inclusive = False
+            if mode == "row_to_bound":
+                # Here, we'll be taking cumsum_at_bound - cumsum_at_row - aggd_over_offset
+                if closed in ("left", "both"):
+                    right_inclusive = False
+                else:
+                    right_inclusive = True
+            else:
+                # Here, we'll be taking cumsum_at_row - cumsum_at_bound + aggd_over_offset
+                if closed in ("right", "both"):
+                    right_inclusive = True
+                else:
+                    right_inclusive = False
+        else:
+            right_inclusive = False
+            if mode == "row_to_bound":
+                # Here, we'll be taking cumsum_at_bound - cumsum_at_row + aggd_over_offset
+                if closed in ("left", "both"):
+                    left_inclusive = True
+                else:
+                    left_inclusive = False
+            else:
+                # Here, we'll be taking cumsum_at_row - cumsum_at_bound - aggd_over_offset
+                if closed in ("right", "both"):
+                    left_inclusive = False
+                else:
+                    left_inclusive = True
+


Complex conditional logic for handling offsets.

This block contains complex conditional logic that could be simplified or broken down into smaller helper functions to improve readability and maintainability.

# Suggested refactor to simplify conditional logic def adjust_inclusivity_based_on_offset(mode, closed, offset): left_inclusive, right_inclusive = False, False if offset > timedelta(0): if mode == "row_to_bound": right_inclusive = closed not in ("left", "both") else: right_inclusive = closed in ("right", "both") else: left_inclusive = closed in ("left", "both") return left_inclusive, right_inclusive left_inclusive, right_inclusive = adjust_inclusivity_based_on_offset(mode, closed, offset)

coderabbitai · 2024-05-20T05:54:43Z

config_str_language.md

+   type, and the second element being the name of the categorical key corresponding to the code in
+   question within the underlying measurement specified. If either of `value_min` and `value_max` are
+   present, then this measurement must be of a multivariate regression type, and the corresponding
+   `values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be
+   sourced from the ESGPT dataset configuration object.
+   b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must
+   be of the univariate regression type and its value, if needed, will be pulled from the corresponding
+   column.
+
+#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates
+
+These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of
+accepted operations that can be applied to other predicates, containing precisely the following:
+
+- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
+- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.
+
+Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced
+analytic operations over predicates.
+
+### Windows and Events:
+
+#### Windows: `WindowConfig`
+
+Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four
+parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,
+and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that
+specify the constraints that must be satisfied over the defined predicates for a possible realization of this
+window to be valid.
+
+##### The Time Range Fields
+
+###### `start` and `end`
+
+Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in
+symbolic form, the relationship between the start and end time of the window. These two fields must obey the
+following rules:
+
+_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as
+specified either through the name of the trigger event or the start or end event of another window. The other
+field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or
+must reference the field that references the external event.
+
+_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external
+event or an internal event, that reference must be expressed in one of the following ways.
+
+1. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.
+   In this case, the referencing event (either the start or end of the window) will be defined as occurring
+   exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
+   end or start of the window).
+   Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
+   `$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
+   the window fields.
+2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
+   In this case, the referencing event will be defined as the next or previous event satisfying the
+   predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate
+   ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
+   "previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
+   the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the
+   `$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
+   inclusive values are set.
+3. `$REFERENCING = $REFERENCED`
+   In this case, the referencing event will be defined as the same event as the referenced event.
+
+_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of
+the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's
+record. In either of these cases, the other field must reference an external event, per rule 1.
+
+###### `start_inclusive` and `end_inclusive`
+
+These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.
+This applies both to whether they are included in the calculation of the predicate values over the windows,
+but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`
+cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that
+`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the
+`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not
+consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as
+its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective
+timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that
+`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of
+the `start` event itself.
+
+##### The Constraints Field
+
+The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`
+that define the valid range the count of observations of the named predicate that must be found in a window
+for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
+those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note
+that as predicate counts are always integral, this specification does not need an additional
+inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
+to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
+the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
+`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
+to be included.
+
+#### Events: `EventConfig`
+
+The event config consists of only a single field, `predicate`, which specifies the predicate that must be
+observed with value greater than one to satisfy the event. There can only be one defined "event" with an
+"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.


The config_str_language.md document is well-written and provides a clear specification of the configuration language. Consider revising the document to address loose punctuation marks for enhanced readability.

- predicates`, stored as a dictionary from string predicate names + predicates, stored as a dictionary from string predicate names - windows`, stored as a dictionary from string window names + windows, stored as a dictionary from string window names

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

Configuration Language Specification

## Introduction and Terminology

This document specifies the configuration language for the automatic extraction of task dataframes and cohorts

from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)

format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format. This extraction

system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and

labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all

realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more

details on the recursive algorithm, see the `terminology.md` file. **TODO** better integrate, name, and link

to these documentation files.

As indicated above, these cohorts are specified through a combination of concepts (realized as event

_predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling

criteria which, conditioned on a set of predicate definitions, are _dataset agnostic_.

Predicates are currently limited to "count" predicates, which are predicates that count the number of times a

boolean condition is satisfied over a given time window, which can either be a single timepoint, thus tracking

whether how many observations there were that satisfied the boolean condition in that event (_aka_ at that

timepoint) or over 1-dimensional windows. In the future, predicates may expand to include other notions of

functional characterization, such as tracking the average/min/max value a concept takes on over a time-period,

etc.

Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or

temporal relationships on said events. The windows between these time-points can then either be constrained to

contain events that satisfy certain aggregation functions over predicates for these time frames.

## Machine Form (what is used by the algorithm)

In the machine form, the configuration file consists of two parts:

- `predicates`, stored as a dictionary from string predicate names (which must be unique) to either

`DirectPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or

`DerivedPredicateConfig` objects, which store predicates that build on other predicates.

- `windows`, stored as a dictionary from string window names (which must be unique) to `WindowConfig`

objects.

Next, we will detail each of these configuration objects.

### Predicates: `DirectPredicateConfig` and `DerivedPredicateConfig`

#### `DirectPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data

These configs consist of the following four fields:

- `code`: The string value for the categorical code object that is relevant for this predicate. An

observation will only satisfy this predicate if there is an occurrence of this code in the observation.

- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the

underlying `code` with a reported numerical value that is either greater than or greater than or equal to

`value_min` (with these options being decided on the basis of `value_min_inclusive`, where

`value_min_incusive=True` indicating that an observation satisfies this predicate if its value is greater

than or equal to `value_min`, and `value_min_inclusive=False` indicating a greater than but not equal to

will be used.

- `value_max`: If specified, an observation will only satisfy this predicate if the occurrence of the

underlying `code` with a reported numerical value that is either less than or less than or equal to

`value_max` (with these options being decided on the basis of `value_max_inclusive`, where

`value_max_incusive=True` indicating that an observation satisfies this predicate if its value is less

than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to

will be used.

- `value_min_inclusive`: See `value_min`

- `value_max_inclusive`: See `value_max`

A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending

on its source format.

1. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format

(recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`

and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax

does not currently support defining predicates that also rely on matching other, optional fields in the

MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull

request or upvoting any existing issue/PR that requests/implements this feature, and we will add support

for this capability.

2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the

`code` will be interpreted in the following manner:

a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the

`"//"` character, with the first element specifying the name of the ESGPT measurement under

consideration, which should either be of the multi-label classification or multivariate regression

type, and the second element being the name of the categorical key corresponding to the code in

question within the underlying measurement specified. If either of `value_min` and `value_max` are

present, then this measurement must be of a multivariate regression type, and the corresponding

`values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be

sourced from the ESGPT dataset configuration object.

b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must

be of the univariate regression type and its value, if needed, will be pulled from the corresponding

column.

#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates

These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of

accepted operations that can be applied to other predicates, containing precisely the following:

- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.

- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.

Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced

analytic operations over predicates.

### Windows and Events:

#### Windows: `WindowConfig`

Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four

parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,

and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that

specify the constraints that must be satisfied over the defined predicates for a possible realization of this

window to be valid.

##### The Time Range Fields

###### `start` and `end`

Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in

symbolic form, the relationship between the start and end time of the window. These two fields must obey the

following rules:

_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as

specified either through the name of the trigger event or the start or end event of another window. The other

field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or

must reference the field that references the external event.

_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external

event or an internal event, that reference must be expressed in one of the following ways.

1. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.

In this case, the referencing event (either the start or end of the window) will be defined as occurring

exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the

end or start of the window).

Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if

`$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of

the window fields.

2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`

In this case, the referencing event will be defined as the next or previous event satisfying the

predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate

ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the

"previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of

the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the

`$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and

inclusive values are set.

3. `$REFERENCING = $REFERENCED`

In this case, the referencing event will be defined as the same event as the referenced event.

_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of

the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's

record. In either of these cases, the other field must reference an external event, per rule 1.

###### `start_inclusive` and `end_inclusive`

These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.

This applies both to whether they are included in the calculation of the predicate values over the windows,

but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`

cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that

`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the

`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not

consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as

its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective

timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that

`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of

the `start` event itself.

##### The Constraints Field

The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`

that define the valid range the count of observations of the named predicate that must be found in a window

for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case

those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note

that as predicate counts are always integral, this specification does not need an additional

inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction

to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy

the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate

`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it

to be included.

#### Events: `EventConfig`

The event config consists of only a single field, `predicate`, which specifies the predicate that must be

observed with value greater than one to satisfy the event. There can only be one defined "event" with an

"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.

Configuration Language Specification

## Introduction and Terminology

This document specifies the configuration language for the automatic extraction of task dataframes and cohorts

from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)

format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format. This extraction

system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and

labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all

realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more

details on the recursive algorithm, see the `terminology.md` file. **TODO** better integrate, name, and link

to these documentation files.

As indicated above, these cohorts are specified through a combination of concepts (realized as event

_predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling

criteria which, conditioned on a set of predicate definitions, are _dataset agnostic_.

Predicates are currently limited to "count" predicates, which are predicates that count the number of times a

boolean condition is satisfied over a given time window, which can either be a single timepoint, thus tracking

whether how many observations there were that satisfied the boolean condition in that event (_aka_ at that

timepoint) or over 1-dimensional windows. In the future, predicates may expand to include other notions of

functional characterization, such as tracking the average/min/max value a concept takes on over a time-period,

etc.

Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or

temporal relationships on said events. The windows between these time-points can then either be constrained to

contain events that satisfy certain aggregation functions over predicates for these time frames.

## Machine Form (what is used by the algorithm)

In the machine form, the configuration file consists of two parts:

- predicates, stored as a dictionary from string predicate names (which must be unique) to either

`DirectPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or

`DerivedPredicateConfig` objects, which store predicates that build on other predicates.

- windows, stored as a dictionary from string window names (which must be unique) to `WindowConfig`

objects.

Next, we will detail each of these configuration objects.

### Predicates: `DirectPredicateConfig` and `DerivedPredicateConfig`

#### `DirectPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data

These configs consist of the following four fields:

- `code`: The string value for the categorical code object that is relevant for this predicate. An

observation will only satisfy this predicate if there is an occurrence of this code in the observation.

- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the

underlying `code` with a reported numerical value that is either greater than or greater than or equal to

`value_min` (with these options being decided on the basis of `value_min_inclusive`, where

`value_min_incusive=True` indicating that an observation satisfies this predicate if its value is greater

than or equal to `value_min`, and `value_min_inclusive=False` indicating a greater than but not equal to

will be used.

- `value_max`: If specified, an observation will only satisfy this predicate if the occurrence of the

underlying `code` with a reported numerical value that is either less than or less than or equal to

`value_max` (with these options being decided on the basis of `value_max_inclusive`, where

`value_max_incusive=True` indicating that an observation satisfies this predicate if its value is less

than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to

will be used.

- `value_min_inclusive`: See `value_min`

- `value_max_inclusive`: See `value_max`

A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending

on its source format.

1. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format

(recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`

and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax

does not currently support defining predicates that also rely on matching other, optional fields in the

MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull

request or upvoting any existing issue/PR that requests/implements this feature, and we will add support

for this capability.

2. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the

`code` will be interpreted in the following manner:

a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the

`"//"` character, with the first element specifying the name of the ESGPT measurement under

consideration, which should either be of the multi-label classification or multivariate regression

type, and the second element being the name of the categorical key corresponding to the code in

question within the underlying measurement specified. If either of `value_min` and `value_max` are

present, then this measurement must be of a multivariate regression type, and the corresponding

`values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be

sourced from the ESGPT dataset configuration object.

b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must

be of the univariate regression type and its value, if needed, will be pulled from the corresponding

column.

#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates

These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of

accepted operations that can be applied to other predicates, containing precisely the following:

- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.

- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.

Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced

analytic operations over predicates.

### Windows and Events:

#### Windows: `WindowConfig`

Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four

parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,

and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that

specify the constraints that must be satisfied over the defined predicates for a possible realization of this

window to be valid.

##### The Time Range Fields

###### `start` and `end`

Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in

symbolic form, the relationship between the start and end time of the window. These two fields must obey the

following rules:

_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as

specified either through the name of the trigger event or the start or end event of another window. The other

field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or

must reference the field that references the external event.

_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external

event or an internal event, that reference must be expressed in one of the following ways.

1. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.

In this case, the referencing event (either the start or end of the window) will be defined as occurring

exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the

end or start of the window).

Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if

`$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of

the window fields.

2. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`

In this case, the referencing event will be defined as the next or previous event satisfying the

predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate

ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the

"previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of

the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the

`$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and

inclusive values are set.

3. `$REFERENCING = $REFERENCED`

In this case, the referencing event will be defined as the same event as the referenced event.

_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of

the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's

record. In either of these cases, the other field must reference an external event, per rule 1.

###### `start_inclusive` and `end_inclusive`

These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.

This applies both to whether they are included in the calculation of the predicate values over the windows,

but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`

cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that

`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the

`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not

consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as

its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective

timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that

`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of

the `start` event itself.

##### The Constraints Field

The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`

that define the valid range the count of observations of the named predicate that must be found in a window

for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case

those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note

that as predicate counts are always integral, this specification does not need an additional

inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction

to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy

the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate

`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it

to be included.

#### Events: `EventConfig`

The event config consists of only a single field, `predicate`, which specifies the predicate that must be

observed with value greater than one to satisfy the event. There can only be one defined "event" with an

"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.

coderabbitai · 2024-05-20T05:54:44Z

terminology.md

+
+In the event that the subtree root timestamp is not the same as the subtree anchor timestamp (there may be a
+temporal offset between the two), the `offset` will be the difference between the two timestamps. If the two
+are not the same, they will guaranteed to be separated by a constant `offset` because the subtree root will


Simplify the phrase for clarity.

- In the event that the subtree root timestamp is not the same as the subtree anchor timestamp (there may be a temporal offset between the two), the `offset` will be the difference between the two timestamps. If the two are not the same, they will guaranteed to be separated by a constant `offset` because the subtree root will either correspond to a fixed time delta from the subtree anchor or will be an actual event itself, in which case it will be the subtree anchor. + If the subtree root timestamp differs from the subtree anchor timestamp, the `offset` will be the difference between the two timestamps. If they differ, they will be separated by a constant `offset` because the subtree root will either correspond to a fixed time delta from the subtree anchor or will be an actual event itself, in which case it will be the subtree anchor.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

In the event that the subtree root timestamp is not the same as the subtree anchor timestamp (there may be a

temporal offset between the two), the `offset` will be the difference between the two timestamps. If the two

are not the same, they will guaranteed to be separated by a constant `offset` because the subtree root will

If the subtree root timestamp differs from the subtree anchor timestamp, the `offset` will be the difference between the two timestamps. If they differ, they will be separated by a constant `offset` because the subtree root will

coderabbitai · 2024-05-20T05:54:44Z

terminology.md

+
+## Algorithm Terminology
+
+#### Event


Correct heading level for document structure consistency.

- #### Event + ### Event

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

#### Event

### Event

coderabbitai · 2024-05-20T05:54:44Z

terminology.md

+dependencies on one another. In particular, we can realize the following tree structure constructed by nodes
+inferred for the above configuration:
+
+```


Specify language for code block for proper formatting.

- ``` + ```yaml

coderabbitai · 2024-05-20T05:54:44Z

src/esgpt_task_querying/config.py

+@dataclasses.dataclass
+class PlainPredicateConfig:
+    code: str
+    value_min: float | None = None
+    value_max: float | None = None
+    value_min_inclusive: bool | None = None
+    value_max_inclusive: bool | None = None
+
+    def MEDS_eval_expr(self) -> pl.Expr:
+        """Returns a Polars expression that evaluates this predicate for a MEDS formatted dataset.
+
+        Examples:
+            >>> expr = PlainPredicateConfig("BP//systolic", 120, 140, True, False).MEDS_eval_expr()
+            >>> print(expr) # doctest: +NORMALIZE_WHITESPACE
+            [([([(col("code")) == (String(BP//systolic))]) &
+               ([(col("value")) >= (120)])]) &
+               ([(col("value")) < (140)])]
+            >>> cfg = PlainPredicateConfig("BP//systolic", value_min=120, value_min_inclusive=False)
+            >>> expr = cfg.MEDS_eval_expr()
+            >>> print(expr) # doctest: +NORMALIZE_WHITESPACE
+            [([(col("code")) == (String(BP//systolic))]) & ([(col("value")) > (120)])]
+            >>> cfg = PlainPredicateConfig("BP//diastolic")
+            >>> expr = cfg.MEDS_eval_expr()
+            >>> print(expr) # doctest: +NORMALIZE_WHITESPACE
+            [(col("code")) == (String(BP//diastolic))]
+        """
+
+        criteria = [pl.col("code") == self.code]
+
+        if self.value_min is not None:
+            if self.value_min_inclusive:
+                criteria.append(pl.col("value") >= self.value_min)
+            else:
+                criteria.append(pl.col("value") > self.value_min)
+        if self.value_max is not None:
+            if self.value_max_inclusive:
+                criteria.append(pl.col("value") <= self.value_max)
+            else:
+                criteria.append(pl.col("value") < self.value_max)
+
+        if len(criteria) == 1:
+            return criteria[0]
+        else:
+            return pl.all_horizontal(criteria)
+


Tip
Codebase Verification

Add unit tests for the MEDS_eval_expr method in the PlainPredicateConfig class to ensure comprehensive coverage.

Analysis chain

Ensure comprehensive unit tests cover the MEDS_eval_expr method.

Scripts Executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Verify unit tests for `MEDS_eval_expr` method in `PlainPredicateConfig`. # Test: Search for unit tests related to `MEDS_eval_expr`. rg --type python $'def test_MEDS_eval_expr'

Length of output: 75

Script:

#!/bin/bash # Description: Verify unit tests for `MEDS_eval_expr` method in `PlainPredicateConfig`. # Test: Search for occurrences of `MEDS_eval_expr` in test files. rg 'MEDS_eval_expr' --type py

Length of output: 394

coderabbitai · 2024-05-20T05:54:44Z

src/esgpt_task_querying/config.py

+        """

-    for node in nodes.values():
-        if node.parent:
-            node.parent = nodes[node.parent.name]
+        for name in self.windows:
+            if re.match(r"^\w+$", name) is None:
+                raise ValueError(
+                    f"Window name '{name}' is invalid; must be composed of alphanumeric or '_' characters."
+                )

-    root = next(iter(nodes.values())).root
+        if self.trigger_event.predicate not in self.predicates:
+            raise KeyError(
+                f"Trigger event predicate '{self.trigger_event.predicate}' not found in predicates: "
+                f"{', '.join(self.predicates.keys())}"
+            )

-    return root
+        trigger_node = Node("trigger")
+
+        window_nodes = {"trigger": trigger_node}
+        for name, window in self.windows.items():
+            start_node = Node(f"{name}.start", endpoint_expr=window.start_endpoint_expr)
+            end_node = Node(f"{name}.end", endpoint_expr=window.end_endpoint_expr)
+
+            if window.root_node == "end":
+                # In this case, the end_node will bound an unconstrained window, as it is the window between
+                # a prior window and the defined anchor for this window, so it has no constraints. But the
+                # start_node will have the constraints corresponding to this window, as it is defined relative
+                # to the end node.
+                end_node.constraints = {}
+                start_node.constraints = window.has
+                start_node.parent = end_node
+            else:
+                # In this case, the start_node will bound an unconstrained window, as it is the window between
+                # a prior window and the defined anchor for this window, so it has no constraints. But the
+                # start_node will have the constraints corresponding to this window, as it is defined relative
+                # to the end node.
+                end_node.constraints = window.has
+                start_node.constraints = {}
+                end_node.parent = start_node
+
+            window_nodes[f"{name}.start"] = start_node
+            window_nodes[f"{name}.end"] = end_node
+
+        for name, window in self.windows.items():
+            for predicate in window.referenced_predicates:
+                if predicate not in self.predicates:
+                    raise KeyError(
+                        f"Window '{name}' references undefined predicate '{predicate}'.\n"
+                        f"Window predicates: {', '.join(window.referenced_predicates)}\n"
+                        f"Defined predicates: {', '.join(self.predicates.keys())}"
+                    )
+
+            if len(window.referenced_event) == 1:
+                event = window.referenced_event[0]
+                if event != "trigger":
+                    raise KeyError(
+                        f"Window '{name}' references undefined trigger event '{event}' -- must be trigger!"
+                    )
+
+                window_nodes[f"{name}.{window.root_node}"].parent = window_nodes[event]
+
+            elif len(window.referenced_event) == 2:
+                referenced_window, referenced_event = window.referenced_event
+                if referenced_window not in self.windows:
+                    raise KeyError(
+                        f"Window '{name}' references undefined window '{referenced_window}' "
+                        f"for event '{referenced_event}'. Allowed windows: {', '.join(self.windows.keys())}"
+                    )
+                if referenced_event not in {"start", "end"}:
+                    raise KeyError(
+                        f"Window '{name}' references undefined event '{referenced_event}' "
+                        f"for window '{referenced_window}'. Allowed events: 'start', 'end'"
+                    )
+
+                parent_node = f"{referenced_window}.{referenced_event}"
+                window_nodes[f"{name}.{window.root_node}"].parent = window_nodes[parent_node]
+            else:
+                raise ValueError(
+                    f"Window '{name}' references invalid event '{window.referenced_event}' "
+                    "must be of length 1 or 2."
+                )
+
+        # Clean up the tree
+        final_node_names = []
+        for n, node in window_nodes.items():
+            if n == "trigger" or node.endpoint_expr is not None:
+                final_node_names.append(n)
+            else:
+                for child in node.children:
+                    child.parent = node.parent
+
+        self.window_nodes = {n: window_nodes[n] for n in final_node_names}
+
+    def __post_init__(self):
+        self._initialize_predicates()
+        self._initialize_windows()
+


Check the implementation of load, save, _initialize_predicates, and _initialize_windows in TaskExtractorConfig for potential improvements.

Consider optimizing the initialization process to reduce complexity and improve maintainability.

coderabbitai

Actionable comments posted: 0

Outside diff range and nitpick comments (3)

sample_configs/01-inhospital-mortality.yaml (1)

14-31: Windows are correctly defined with appropriate constraints and boundaries. Consider adding comments to clarify the 'has' field usage in the 'gap' window.

src/esgpt_task_querying/predicates.py (1)

Line range hint 11-83: Function generate_predicates_df is well-implemented with clear documentation and handles different data standards effectively. Consider adding more detailed logging at each significant step for better traceability.

src/esgpt_task_querying/config.py (1)

656-1004: Class TaskExtractorConfig is robustly implemented with comprehensive methods for managing configuration data. Consider enhancing some error messages to provide more detailed guidance to users, especially in methods like _initialize_predicates and _initialize_windows.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 33c6ebb and 7b0d59f.

Files selected for processing (7)

sample_configs/01-inhospital-mortality.yaml (1 hunks)
src/esgpt_task_querying/config.py (1 hunks)
src/esgpt_task_querying/constraints.py (1 hunks)
src/esgpt_task_querying/extract_subtree.py (1 hunks)
src/esgpt_task_querying/predicates.py (2 hunks)
src/esgpt_task_querying/query.py (2 hunks)
src/esgpt_task_querying/types.py (1 hunks)

Files not summarized due to errors (1)

src/esgpt_task_querying/config.py: Error: Message exceeds token limit

Files skipped from review as they are similar to previous changes (4)

src/esgpt_task_querying/constraints.py
src/esgpt_task_querying/extract_subtree.py
src/esgpt_task_querying/query.py
src/esgpt_task_querying/types.py

Additional comments not posted (2)

sample_configs/01-inhospital-mortality.yaml (2)

3-10: Predicates are well-defined and correctly structured.

11-11: Trigger event is appropriately set to 'admission'.

coderabbitai

Actionable comments posted: 2

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7b0d59f and afd2653.

Files selected for processing (1)

src/esgpt_task_querying/config.py (1 hunks)

Files not summarized due to errors (1)

src/esgpt_task_querying/config.py: Error: Message exceeds token limit

Additional comments not posted (4)

src/esgpt_task_querying/config.py (4)

38-74: The implementation of MEDS_eval_expr method appears robust and handles different configurations of predicates effectively.

75-129: The ESGPT_eval_expr method is well-implemented, providing flexible expression evaluation based on the provided predicate configuration.

748-791: The load method is well-implemented with robust error handling and clear logging, ensuring that configuration files are correctly parsed and loaded.

792-816: The save method correctly handles file writing operations with checks for file existence and format. Good use of detailed error messages.

coderabbitai · 2024-05-22T08:37:52Z

src/esgpt_task_querying/config.py

+    def __post_init__(self):
+        if not self.expr:
+            raise ValueError("Derived predicates must have a non-empty expression field.")

+        self.is_and = self.expr.startswith("and(") and self.expr.endswith(")")
+        self.is_or = self.expr.startswith("or(") and self.expr.endswith(")")
+        if not (self.is_and or self.is_or):
+            raise ValueError(f"Derived predicate expression must start with 'and(' or 'or(', got {self.expr}")

-def parse_timedelta(time_str: str) -> timedelta:
-    """Parse a time string and return a timedelta object.
+        if self.is_and:
+            self.input_predicates = [x.strip() for x in self.expr[4:-1].split(",")]
+        elif self.is_or:
+            self.input_predicates = [x.strip() for x in self.expr[3:-1].split(",")]

-    Using time expression parser: https://github.com/wroberts/pytimeparse
+        if len(self.input_predicates) < 2:
+            raise ValueError(
+                "Derived predicate expression must have at least two input predicates (comma separated), "


Consider enhancing the error messages in DerivedPredicateConfig to provide more context about the validation failures, which could aid in debugging.

- raise ValueError(f"Derived predicate expression must start with 'and(' or 'or(', got {self.expr}") + raise ValueError(f"Derived predicate expression must start with 'and(' or 'or(', but received: {self.expr}. Please check the syntax.")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def __post_init__(self):

if not self.expr:

raise ValueError("Derived predicates must have a non-empty expression field.")

self.is_and = self.expr.startswith("and(") and self.expr.endswith(")")

self.is_or = self.expr.startswith("or(") and self.expr.endswith(")")

if not (self.is_and or self.is_or):

raise ValueError(f"Derived predicate expression must start with 'and(' or 'or(', got {self.expr}")

def parse_timedelta(time_str: str) -> timedelta:

"""Parse a time string and return a timedelta object.

if self.is_and:

self.input_predicates = [x.strip() for x in self.expr[4:-1].split(",")]

elif self.is_or:

self.input_predicates = [x.strip() for x in self.expr[3:-1].split(",")]

Using time expression parser: https://github.com/wroberts/pytimeparse

if len(self.input_predicates) < 2:

raise ValueError(

"Derived predicate expression must have at least two input predicates (comma separated), "

def __post_init__(self):

if not self.expr:

raise ValueError("Derived predicates must have a non-empty expression field.")

self.is_and = self.expr.startswith("and(") and self.expr.endswith(")")

self.is_or = self.expr.startswith("or(") and self.expr.endswith(")")

if not (self.is_and or self.is_or):

raise ValueError(f"Derived predicate expression must start with 'and(' or 'or(', but received: {self.expr}. Please check the syntax.")

if self.is_and:

self.input_predicates = [x.strip() for x in self.expr[4:-1].split(",")]

elif self.is_or:

self.input_predicates = [x.strip() for x in self.expr[3:-1].split(",")]

if len(self.input_predicates) < 2:

raise ValueError(

"Derived predicate expression must have at least two input predicates (comma separated), "

coderabbitai

Actionable comments posted: 10

Outside diff range and nitpick comments (5)

src/esgpt_task_querying/predicates.py (3)

11-11: Ensure the function documentation is updated to reflect the new parameters and behavior.

17-17: The description of the standard parameter should clarify the expected values and their effects.

87-87: The final sorting and selection of columns is a critical operation. Ensure that it is optimized for performance, especially with large datasets.

run.py (1)

113-113: The final processing of the query results involves JSON encoding and writing to a CSV file. Ensure that this process handles large datasets efficiently and that the JSON encoding does not introduce data integrity issues.

Also applies to: 115-115

src/esgpt_task_querying/config.py (1)

75-131: Ensure consistent documentation for ESGPT_eval_expr.

The docstring for ESGPT_eval_expr mentions that it evaluates a predicate for a "MEDS formatted dataset," which seems to be a copy-paste error from MEDS_eval_expr. Please update the documentation to reflect the correct functionality.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between afd2653 and e35d5a4.

Files ignored due to path filters (1)

result.csv is excluded by !**/*.csv

Files selected for processing (6)

.gitignore (1 hunks)
run.py (5 hunks)
run.yaml (1 hunks)
src/esgpt_task_querying/config.py (1 hunks)
src/esgpt_task_querying/extract_subtree.py (1 hunks)
src/esgpt_task_querying/predicates.py (1 hunks)

Files not summarized due to errors (1)

src/esgpt_task_querying/config.py: Error: Message exceeds token limit

Files skipped from review due to trivial changes (2)

.gitignore
run.yaml

Files skipped from review as they are similar to previous changes (1)

src/esgpt_task_querying/extract_subtree.py

Additional comments not posted (3)

src/esgpt_task_querying/predicates.py (1)

31-31: The handling of different standards ('CSV', 'MEDS', 'ESGPT') is a good approach, but ensure that the configuration for each standard is robust and well-tested, especially since the logic branches significantly.

Also applies to: 44-44, 60-60

run.py (1)

95-95: The configuration loading using TaskExtractorConfig.load is a key operation. Verify that this method handles all edge cases, such as missing or malformed configuration files.
src/esgpt_task_querying/config.py (1)
978-985: Ensure initialization order in __post_init__.

The __post_init__ method in TaskExtractorConfig initializes predicates before windows. Verify that this order does not cause issues, especially since windows might depend on predicates being fully initialized.

Verification successful

The initialization order in __post_init__ is correct.

The _initialize_windows method references self.predicates and checks for the existence of predicates before proceeding with window initialization. This confirms that predicates must be initialized before windows to avoid issues.

_initialize_windows references self.predicates at lines 879 and 914.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for any dependencies in window initialization on predicates.

# Test: Search for dependencies in the initialization order.
rg --type python 'self._initialize_windows()'
Length of output: 78

Script:
#!/bin/bash
# Correcting the previous command to search for the initialization of windows and predicates.

# Search for the definition and usage of the _initialize_windows method.
rg '_initialize_windows'

# Search for the definition and usage of the _initialize_predicates method.
rg '_initialize_predicates'
Length of output: 334

Script:
#!/bin/bash
# Extract the full definition of the _initialize_windows method.
ast-grep --lang python --pattern 'def _initialize_windows(self) { $$$ }'

# Extract the full definition of the _initialize_predicates method.
ast-grep --lang python --pattern 'def _initialize_predicates(self) { $$$ }'
Length of output: 1075

Script:
#!/bin/bash
# Correcting the pattern syntax to extract the full definition of the _initialize_windows method.
ast-grep --lang python --pattern 'def _initialize_windows(self): $$$'

# Correcting the pattern syntax to extract the full definition of the _initialize_predicates method.
ast-grep --lang python --pattern 'def _initialize_predicates(self): $$$'
Length of output: 13581

coderabbitai · 2024-05-22T09:35:21Z

src/esgpt_task_querying/config.py

+    def MEDS_eval_expr(self) -> pl.Expr:
+        """Returns a Polars expression that evaluates this predicate for a MEDS formatted dataset.
+
+        Examples:
+            >>> expr = PlainPredicateConfig("BP//systolic", 120, 140, True, False).MEDS_eval_expr()
+            >>> print(expr) # doctest: +NORMALIZE_WHITESPACE
+            [([([(col("code")) == (String(BP//systolic))]) &
+               ([(col("value")) >= (120)])]) &
+               ([(col("value")) < (140)])]
+            >>> cfg = PlainPredicateConfig("BP//systolic", value_min=120, value_min_inclusive=False)
+            >>> expr = cfg.MEDS_eval_expr()
+            >>> print(expr) # doctest: +NORMALIZE_WHITESPACE
+            [([(col("code")) == (String(BP//systolic))]) & ([(col("value")) > (120)])]
+            >>> cfg = PlainPredicateConfig("BP//diastolic")
+            >>> expr = cfg.MEDS_eval_expr()
+            >>> print(expr) # doctest: +NORMALIZE_WHITESPACE
+            [(col("code")) == (String(BP//diastolic))]
+        """
+
+        criteria = [pl.col("code") == self.code]
+
+        if self.value_min is not None:
+            if self.value_min_inclusive:
+                criteria.append(pl.col("value") >= self.value_min)
+            else:
+                criteria.append(pl.col("value") > self.value_min)
+        if self.value_max is not None:
+            if self.value_max_inclusive:
+                criteria.append(pl.col("value") <= self.value_max)
+            else:
+                criteria.append(pl.col("value") < self.value_max)
+
+        if len(criteria) == 1:
+            return criteria[0]
+        else:
+            return pl.all_horizontal(criteria)
+


Refactor MEDS_eval_expr to improve readability and maintainability.

The method MEDS_eval_expr in PlainPredicateConfig class is quite complex. Consider breaking down the method into smaller, more manageable functions. This will not only improve readability but also make the code easier to maintain and test. For example, separate the criteria building and the final expression assembly into different methods.

coderabbitai · 2024-05-22T09:35:21Z

src/esgpt_task_querying/config.py

+    def __post_init__(self):
+        # Parse the has constraints from the string representation to the tuple representation
+        if self.has is not None:
+            for each_constraint in self.has:
+                elements = self.has[each_constraint].strip("()").split(",")
+                elements = [element.strip() for element in elements]
+                self.has[each_constraint] = tuple(
+                    int(element) if element != "None" else None for element in elements
+                )

+        if self.start is None and self.end is None:
+            raise ValueError("Window cannot progress from the start of the record to the end of the record.")
+
+        if self.start is None:
+            self._parsed_start = {
+                "referenced": "end",
+                "offset": None,
+                "event_bound": f"-{START_OF_RECORD_KEY}",
+                "occurs_before": True,
+            }
+        else:
+            self._parsed_start = self._parse_boundary(self.start)
+
+        if self.end is None:
+            self._parsed_end = {
+                "referenced": "start",
+                "offset": None,
+                "event_bound": END_OF_RECORD_KEY,
+                "occurs_before": False,
+            }
+        else:
+            self._parsed_end = self._parse_boundary(self.end)
+
+        if self._parsed_start["referenced"] == "end" and self._parsed_end["referenced"] == "start":
+            raise ValueError(
+                "Exactly one of the start or end of the window must reference the other. "
+                f"Got: {self.start} -> {self.end}"
+            )
+        elif self._parsed_start["referenced"] == "end":
+            self._start_references_end = True
+            # We use `is False` because it may be None, which is distinct from True or False
+            if self._parsed_start["occurs_before"] is False:
+                raise ValueError(
+                    f"Window start will not occur before window end! Got: {self.start} -> {self.end}"
+                )
+        elif self._parsed_end["referenced"] == "start":
+            self._start_references_end = False
+            # We use `is True` because it may be None, which is distinct from True or False
+            if self._parsed_end["occurs_before"] is True:
+                raise ValueError(
+                    f"Window start will not occur before window end! Got: {self.start} -> {self.end}"
+                )
+        else:
+            raise ValueError(
+                "Exactly one of the start or end of the window must reference the other. "
+                f"Got: {self.start} -> {self.end}"
+            )


Review the complexity of __post_init__ in WindowConfig.

The __post_init__ method in WindowConfig is quite complex and handles multiple validation scenarios. Consider refactoring this method to separate the validation logic into smaller, more focused methods. This will improve the readability and maintainability of the code.

coderabbitai · 2024-05-22T09:35:21Z

src/esgpt_task_querying/config.py

+    def save(self, config_path: str | Path, do_overwrite: bool = False):
+        """Load a configuration file from the given path and return it as a dict.
+
+        Args:
+            config_path: The path to which the calling object will be saved in YAML form.
+            do_overwrite: Whether or not to overwrite any existing saved configuration file at that filepath.
+
+        Raises:
+            FileExistsError: If there exists a file at the given location and ``do_overwrite`` is not `True`.
+            ValueError: If the filepath is not a ".yaml" file.
+        """
+        if isinstance(config_path, str):
+            config_path = Path(config_path)

-        # search for the parent node in tree
-        if get_config(window_info, "start", None):
-            root_name = window_info["start"].split(".")[0]
-            node_root = next(
-                (substring for substring in windows if substring == root_name),
-                None,
+        if config_path.is_file() and not do_overwrite:
+            raise FileExistsError(
+                f"Can't overwrite extant {str(config_path.resolve())} as do_overwrite={do_overwrite}"
            )
-        elif get_config(window_info, "end", None):
-            root_name = window_info["end"].split(".")[0]
-            node_root = next(
-                (substring for substring in windows if substring == root_name),
-                None,
+
+        if config_path.suffix == ".yaml":
+            yaml = ruamel.yaml.YAML(typ="safe", pure=True)
+            config_path.write_text(yaml.dump(dataclasses.asdict(self)))
+        else:
+            raise ValueError(f"Only supports writing to '.yaml' files currently. Got {config_path.suffix}")
+


Clarify exception handling in save.

The save method in TaskExtractorConfig class raises a FileExistsError if the file exists and do_overwrite is not True. However, the error message could be more informative by explicitly stating that the file cannot be overwritten because do_overwrite is set to False.

- f"Can't overwrite extant {str(config_path.resolve())} as do_overwrite={do_overwrite}" + f"Cannot overwrite existing file {str(config_path.resolve())} because 'do_overwrite' is set to {do_overwrite}. To overwrite, set 'do_overwrite' to True."

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def save(self, config_path: str | Path, do_overwrite: bool = False):

"""Load a configuration file from the given path and return it as a dict.

Args:

config_path: The path to which the calling object will be saved in YAML form.

do_overwrite: Whether or not to overwrite any existing saved configuration file at that filepath.

Raises:

FileExistsError: If there exists a file at the given location and ``do_overwrite`` is not `True`.

ValueError: If the filepath is not a ".yaml" file.

"""

if isinstance(config_path, str):

config_path = Path(config_path)

# search for the parent node in tree

if get_config(window_info, "start", None):

root_name = window_info["start"].split(".")[0]

node_root = next(

(substring for substring in windows if substring == root_name),

None,

if config_path.is_file() and not do_overwrite:

raise FileExistsError(

f"Can't overwrite extant {str(config_path.resolve())} as do_overwrite={do_overwrite}"

)

elif get_config(window_info, "end", None):

root_name = window_info["end"].split(".")[0]

node_root = next(

(substring for substring in windows if substring == root_name),

None,

if config_path.suffix == ".yaml":

yaml = ruamel.yaml.YAML(typ="safe", pure=True)

config_path.write_text(yaml.dump(dataclasses.asdict(self)))

else:

raise ValueError(f"Only supports writing to '.yaml' files currently. Got {config_path.suffix}")

if config_path.is_file() and not do_overwrite:

raise FileExistsError(

f"Cannot overwrite existing file {str(config_path.resolve())} because 'do_overwrite' is set to {do_overwrite}. To overwrite, set 'do_overwrite' to True."

)

coderabbitai · 2024-05-22T09:35:22Z

src/esgpt_task_querying/config.py

+        """Load a configuration file from the given path and return it as a dict.
+
+        Args:
+            config_path: The path to which a configuration object will be read from in YAML form.
+
+        Raises:
+            FileNotFoundError: If the file does not exist.
+            ValueError: If the file is not a ".yaml" file.
+        """
+        if isinstance(config_path, str):
+            config_path = Path(config_path)
+
+        if not config_path.is_file():
+            raise FileNotFoundError(f"Cannot load missing configuration file {str(config_path.resolve())}!")
+
+        if config_path.suffix == ".yaml":
+            yaml = ruamel.yaml.YAML(typ="safe", pure=True)
+            loaded_dict = yaml.load(config_path.read_text())
+        else:
+            raise ValueError(f"Only supports reading from '.yaml' files currently. Got {config_path.suffix}")
+
+        predicates = loaded_dict.pop("predicates")
+        trigger_event = loaded_dict.pop("trigger_event")
+        windows = loaded_dict.pop("windows")
+
+        if loaded_dict:
+            raise ValueError(f"Unrecognized keys in configuration file: {', '.join(loaded_dict.keys())}")
+
+        logger.debug("Parsing predicates...")
+        predicates = {
+            n: DerivedPredicateConfig(**p) if "expr" in p else PlainPredicateConfig(**p)
+            for n, p in predicates.items()
+        }
+
+        logger.debug("Parsing trigger event...")
+        trigger_event = EventConfig(trigger_event)
+
+        logger.debug("Parsing windows...")
+        windows = {n: WindowConfig(**w) for n, w in windows.items()}
+
+        return cls(predicates=predicates, trigger_event=trigger_event, windows=windows)
+


Validate YAML file extension more robustly in load.

The method load in TaskExtractorConfig class only checks for the .yaml extension in a case-sensitive manner. Consider enhancing this check to be case-insensitive to handle cases where the file extension might be in uppercase (e.g., .YAML).

- if config_path.suffix == ".yaml": + if config_path.suffix.lower() == ".yaml":

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

"""Load a configuration file from the given path and return it as a dict.

Args:

config_path: The path to which a configuration object will be read from in YAML form.

Raises:

FileNotFoundError: If the file does not exist.

ValueError: If the file is not a ".yaml" file.

"""

if isinstance(config_path, str):

config_path = Path(config_path)

if not config_path.is_file():

raise FileNotFoundError(f"Cannot load missing configuration file {str(config_path.resolve())}!")

if config_path.suffix == ".yaml":

yaml = ruamel.yaml.YAML(typ="safe", pure=True)

loaded_dict = yaml.load(config_path.read_text())

else:

raise ValueError(f"Only supports reading from '.yaml' files currently. Got {config_path.suffix}")

predicates = loaded_dict.pop("predicates")

trigger_event = loaded_dict.pop("trigger_event")

windows = loaded_dict.pop("windows")

if loaded_dict:

raise ValueError(f"Unrecognized keys in configuration file: {', '.join(loaded_dict.keys())}")

logger.debug("Parsing predicates...")

predicates = {

n: DerivedPredicateConfig(**p) if "expr" in p else PlainPredicateConfig(**p)

for n, p in predicates.items()

}

logger.debug("Parsing trigger event...")

trigger_event = EventConfig(trigger_event)

logger.debug("Parsing windows...")

windows = {n: WindowConfig(**w) for n, w in windows.items()}

return cls(predicates=predicates, trigger_event=trigger_event, windows=windows)

"""Load a configuration file from the given path and return it as a dict.

Args:

config_path: The path to which a configuration object will be read from in YAML form.

Raises:

FileNotFoundError: If the file does not exist.

ValueError: If the file is not a ".yaml" file.

"""

if isinstance(config_path, str):

config_path = Path(config_path)

if not config_path.is_file():

raise FileNotFoundError(f"Cannot load missing configuration file {str(config_path.resolve())}!")

if config_path.suffix.lower() == ".yaml":

yaml = ruamel.yaml.YAML(typ="safe", pure=True)

loaded_dict = yaml.load(config_path.read_text())

else:

raise ValueError(f"Only supports reading from '.yaml' files currently. Got {config_path.suffix}")

predicates = loaded_dict.pop("predicates")

trigger_event = loaded_dict.pop("trigger_event")

windows = loaded_dict.pop("windows")

if loaded_dict:

raise ValueError(f"Unrecognized keys in configuration file: {', '.join(loaded_dict.keys())}")

logger.debug("Parsing predicates...")

predicates = {

n: DerivedPredicateConfig(**p) if "expr" in p else PlainPredicateConfig(**p)

for n, p in predicates.items()

}

logger.debug("Parsing trigger event...")

trigger_event = EventConfig(trigger_event)

logger.debug("Parsing windows...")

windows = {n: WindowConfig(**w) for n, w in windows.items()}

return cls(predicates=predicates, trigger_event=trigger_event, windows=windows)

coderabbitai · 2024-05-22T09:35:22Z

src/esgpt_task_querying/predicates.py

-        cfg: The configuration object containing the predicate information.
-        ESD: The Polars DataFrame to add the predicate columns to.
+        cfg: The TaskExtractorConfig object containing the predicates information.
+        df_data: The Polars DataFrame containing the original external data.


Typographical error in the argument name in the docstring; it should be data instead of df_data.

coderabbitai · 2024-05-22T09:35:22Z

src/esgpt_task_querying/predicates.py

-    │ 3          ┆ 1         ┆ 0    ┆ 1    ┆ … ┆ 0    ┆ 1         ┆ 0                ┆ 1      │
-    │ 3          ┆ 3         ┆ 0    ┆ 0    ┆ … ┆ 0    ┆ 0         ┆ 0                ┆ 1      │
-    └────────────┴───────────┴──────┴──────┴───┴──────┴───────────┴──────────────────┴────────┘
+    Examples: TODO


Complete the TODO for examples in the docstring to help users understand how to use this function.

Would you like me to help by providing some examples here?

coderabbitai · 2024-05-22T09:35:22Z

src/esgpt_task_querying/predicates.py

+                predicate_cols.append(name)
+        case "ESGPT":
+            for name, plain_predicate in cfg.plain_predicates.items():
+                if "event_type" in plain_predicate.code:


The conditional handling within the 'ESGPT' case seems to assume a specific structure of data. This could lead to errors if data does not meet these expectations. Consider adding checks or clearer documentation on the expected structure of data.

Also applies to: 50-50

coderabbitai · 2024-05-22T09:35:22Z

src/esgpt_task_querying/predicates.py

-                pl.first("event_id").over(["subject_id", "timestamp"]).rank("dense") - 1,
-            )
-            events = data.select(
+            data = data[0].join(data[1], on="event_id", how="left")


The join operation on event_id should handle potential mismatches or missing IDs gracefully. Consider adding error handling or a fallback mechanism.

coderabbitai · 2024-05-22T09:35:22Z

run.py

@@ -25,24 +26,29 @@ def load_using_esgpt(cfg, path):
    dynamic_measurements_df = ESD.dynamic_measurements_df

    try:
-        df_predicates = predicates.generate_predicate_columns(cfg, [events_df, dynamic_measurements_df])
+        predicates_df = predicates.generate_predicates_df(


The calls to generate_predicates_df are crucial. Ensure that the error handling is robust and that the configuration files are validated before this point to prevent runtime errors.

Also applies to: 64-64, 73-73

coderabbitai · 2024-05-22T09:35:22Z

run.py

@@ -12,7 +12,8 @@
 from esgpt_task_querying import config, predicates, query


-def load_using_esgpt(cfg, path):
+def load_using_directory(cfg, path):


Tip
Codebase Verification

The functions load_using_directory and load_using_file lack proper docstrings and dedicated test cases.

run.py: Both functions have comments but no detailed docstrings.

No test files or test cases were found for these functions.

Please ensure that these functions are well-documented and adequately tested.

Analysis chain

The functions load_using_directory and load_using_file have been updated to handle specific data formats. Ensure that these functions are well-documented and tested, especially since they are critical for data loading.

Also applies to: 41-41

Scripts Executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Verify the documentation and tests for data loading functions. # Test: Search for documentation and tests for `load_using_directory` and `load_using_file`. Expect: Adequate documentation and tests. rg --type python $'load_using_directory|load_using_file'

Length of output: 88

Script:

#!/bin/bash # Description: Verify the documentation and tests for data loading functions. # Test: Search for documentation and tests for `load_using_directory` and `load_using_file`. Expect: Adequate documentation and tests. rg 'def load_using_directory|def load_using_file' --glob '*.py'

Length of output: 148

Script:

#!/bin/bash # Description: Verify the presence of docstrings for `load_using_directory` and `load_using_file`. # Search for docstrings in `load_using_directory` and `load_using_file`. rg -A 3 'def load_using_directory|def load_using_file' --glob '*.py'

Length of output: 416

Script:

#!/bin/bash # Description: Verify the presence of test cases for `load_using_directory` and `load_using_file`. # Search for test cases related to `load_using_directory` and `load_using_file`. rg 'load_using_directory|load_using_file' --glob '*test*.py'

Length of output: 211

Script:

#!/bin/bash # Description: List all test files. # List all test files to identify where tests might be located. fd --glob '*test*.py'

Length of output: 23

Script:

#!/bin/bash # Description: Verify the presence of test cases for `load_using_directory` and `load_using_file`. # Search for test cases related to `load_using_directory` and `load_using_file` within identified test files. fd --glob '*test*.py' --exec rg 'load_using_directory|load_using_file'

Length of output: 72

Script:

#!/bin/bash # Description: List all Python files to identify potential test files. # List all Python files to identify where tests might be located. fd --glob '*.py'

Length of output: 413

Script:

#!/bin/bash # Description: Verify the presence of test cases for `load_using_directory` and `load_using_file`. # Search for test cases related to `load_using_directory` and `load_using_file` within identified Python files. fd --glob '*.py' --exec rg 'load_using_directory|load_using_file'

Length of output: 263

Updated summarize API to simplify summarize_window so it can be ignor…

9937b82

…ed in testing.

mmcdermott requested a review from justin13601 May 9, 2024 10:41

mmcdermott and others added 26 commits May 9, 2024 08:51

Split summarize_temporal_window into two functions; added namedtuple …

ac2bc27

…for type checking to aggregate function, aggregate function not working...

Fix the odd test error. This fix is very strange.

9a66d9e

Expanded aggregate's doctest to cover other edge cases.

5a76ddb

Added a (currently unused) named tuple for event bound parameters; ad…

cb4a6e7

…ded a (currently failing) doctest for event bound summarizer

updated docstrings and fixed doctest given proper understand of summa…

e985ef9

…rize event bound window; many more tests are needed.

added another test and marginal refinement to the summarize_event_bou…

0abbf35

…nd docstring.

Corrected small typo, used ToEventWindowBound in summarizer.

ae92ed1

Added a terminology description file.

36cb8eb

Updated aggregate_temporal_window tests.

276c481

Some edits

2bc74bc

Merge branch 'tests' of https://github.com/justin13601/ESGPTTaskQuerying

668965b

into tests

Edits to the terminology

e230dd6

Small edits to the recursion description

eb61d90

Added a set of doctest cases for an aggregate event bound window func…

ff6a82c

…tion.

Got aggregate_event_bound working for all cases where offset is 0

bc3d33c

With offset doctest partially passing; need to subtract temporal grou…

6f117ba

…p by stage as well.

Got it working in the case with a positive offset too.

bfecd5c

Separating out functions a bit and adding doctests for check_constrai…

f2e3dea

…nts.

Added forggoten files.

462846c

Include timestamp_at_start in aggregate functions to future proof and…

b9c0032

… to make property testing easier.

Some in-progress updates.

f59c810

Corrected some more of the doctests.

6500db0

Fixed the doctests up to the case with offset

5373800

Fixed at least one test case's worth of the with offset period.

6cc12d4

Found a bug in the aggregate function; test is failing where (I belie…

2090f63

…ve) the test case has the correct output. We are likely double subtracting the row's counts through the offset aggregation and the cum_sum

Yep, it was a double subtraction issue. Fixed the error.

312dc89

mmcdermott added 14 commits May 13, 2024 04:26

Another test failure; likely the counterpart to the prior one.

b436e1f

Corrected one more bug. It was a slightly different issue than I anti…

68f7d51

…cipated.

Corrected the issue.

4837e49

Added another passing test case.

724b82a

All test cases passing!

37a4737

Removed malformed constraint checking test case and corrected typo in…

fa635fc

… extract subtree doctest.

Trying to eliminate the other event bound functions to use the new ge…

76361c8

…neral aggregation function; some test differences that warrant further assessment

Added a tiny additional comment pointing to additional testing in the…

2b837a6

… event bound aggregation function.

extract subtree (partial test only) appears to be working.

47ef545

Added more tests for extract_subtree. It is a little awkward as somet…

118d0a7

…imes start and end are swapped unexpectedly, but the tests do seem to pass.

Removed old API files and functions. Tests are passing but query_subt…

fbe0cd8

…ree may not be.

Removing duplicated tests.

734a708

fixed query script

0fea925

mmcdermott mentioned this pull request May 16, 2024

Config language updates and doc #29

Merged

justin13601 marked this pull request as ready for review May 20, 2024 05:50

coderabbitai bot reviewed May 20, 2024

View reviewed changes

Working pipeline

7b0d59f

justin13601 approved these changes May 22, 2024

View reviewed changes

coderabbitai bot reviewed May 22, 2024

View reviewed changes

justin13601 added 2 commits May 22, 2024 09:31

Remove outdated unit test file

dfcbfbd

Fix whitespace?

afd2653

coderabbitai bot reviewed May 22, 2024

View reviewed changes

justin13601 added 2 commits May 22, 2024 09:39

didn't work, undo - ask matthew

141a4e2

Run script working on ESGPT / CSV / presumably MEDS parquet

e35d5a4

coderabbitai bot reviewed May 22, 2024

View reviewed changes

Update sample configs

5ddfc1e

justin13601 merged commit 6939fcd into master May 22, 2024
1 of 2 checks passed

justin13601 deleted the tests branch May 22, 2024 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding more comprehensive tests across the API. #28

Adding more comprehensive tests across the API. #28

mmcdermott commented May 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 9, 2024 •

edited

Loading

Rate Limit Exceeded

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

codecov-commenter commented May 9, 2024 •

edited by codecov bot

Loading

coderabbitai bot left a comment

coderabbitai bot May 20, 2024 •

edited

Loading

mmcdermott May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot May 20, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot May 22, 2024

coderabbitai bot left a comment

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

coderabbitai bot May 22, 2024

-Configuration Language Specification
-## Introduction and Terminology
-This document specifies the configuration language for the automatic extraction of task dataframes and cohorts
-from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)
-format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format. This extraction
-system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and
-labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all
-realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more
-details on the recursive algorithm, see the `terminology.md` file. **TODO** better integrate, name, and link
-to these documentation files.
-As indicated above, these cohorts are specified through a combination of concepts (realized as event
-_predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling
-criteria which, conditioned on a set of predicate definitions, are _dataset agnostic_.
-Predicates are currently limited to "count" predicates, which are predicates that count the number of times a
-boolean condition is satisfied over a given time window, which can either be a single timepoint, thus tracking
-whether how many observations there were that satisfied the boolean condition in that event (_aka_ at that
-timepoint) or over 1-dimensional windows. In the future, predicates may expand to include other notions of
-functional characterization, such as tracking the average/min/max value a concept takes on over a time-period,
-etc.
-Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or
-temporal relationships on said events. The windows between these time-points can then either be constrained to
-contain events that satisfy certain aggregation functions over predicates for these time frames.
-## Machine Form (what is used by the algorithm)
-In the machine form, the configuration file consists of two parts:
-- `predicates`, stored as a dictionary from string predicate names (which must be unique) to either
-  `DirectPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or
-  `DerivedPredicateConfig` objects, which store predicates that build on other predicates.
-- `windows`, stored as a dictionary from string window names (which must be unique) to `WindowConfig`
-  objects.
-Next, we will detail each of these configuration objects.
-### Predicates: `DirectPredicateConfig` and `DerivedPredicateConfig`
-#### `DirectPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data
-These configs consist of the following four fields:
-- `code`: The string value for the categorical code object that is relevant for this predicate. An
-  observation will only satisfy this predicate if there is an occurrence of this code in the observation.
-- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
-  underlying `code` with a reported numerical value that is either greater than or greater than or equal to
-  `value_min` (with these options being decided on the basis of `value_min_inclusive`, where
-  `value_min_incusive=True` indicating that an observation satisfies this predicate if its value is greater
-  than or equal to `value_min`, and `value_min_inclusive=False` indicating a greater than but not equal to
-  will be used.
-- `value_max`: If specified, an observation will only satisfy this predicate if the occurrence of the
-  underlying `code` with a reported numerical value that is either less than or less than or equal to
-  `value_max` (with these options being decided on the basis of `value_max_inclusive`, where
-  `value_max_incusive=True` indicating that an observation satisfies this predicate if its value is less
-  than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to
-  will be used.
-- `value_min_inclusive`: See `value_min`
-- `value_max_inclusive`: See `value_max`
-A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending
-on its source format.
-. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format
-   (recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`
-   and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax
-   does not currently support defining predicates that also rely on matching other, optional fields in the
-   MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull
-   request or upvoting any existing issue/PR that requests/implements this feature, and we will add support
-   for this capability.
-. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the
-   `code` will be interpreted in the following manner:
-   a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the
-   `"//"` character, with the first element specifying the name of the ESGPT measurement under
-   consideration, which should either be of the multi-label classification or multivariate regression
-   type, and the second element being the name of the categorical key corresponding to the code in
-   question within the underlying measurement specified. If either of `value_min` and `value_max` are
-   present, then this measurement must be of a multivariate regression type, and the corresponding
-   `values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be
-   sourced from the ESGPT dataset configuration object.
-   b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must
-   be of the univariate regression type and its value, if needed, will be pulled from the corresponding
-   column.
-#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates
-These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of
-accepted operations that can be applied to other predicates, containing precisely the following:
-- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
-- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.
-Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced
-analytic operations over predicates.
-### Windows and Events:
-#### Windows: `WindowConfig`
-Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four
-parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,
-and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that
-specify the constraints that must be satisfied over the defined predicates for a possible realization of this
-window to be valid.
-##### The Time Range Fields
-###### `start` and `end`
-Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in
-symbolic form, the relationship between the start and end time of the window. These two fields must obey the
-following rules:
-_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as
-specified either through the name of the trigger event or the start or end event of another window. The other
-field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or
-must reference the field that references the external event.
-_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external
-event or an internal event, that reference must be expressed in one of the following ways.
-. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.
-   In this case, the referencing event (either the start or end of the window) will be defined as occurring
-   exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
-   end or start of the window).
-   Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
-   `$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
-   the window fields.
-. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
-   In this case, the referencing event will be defined as the next or previous event satisfying the
-   predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate
-   ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
-   "previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
-   the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the
-   `$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
-   inclusive values are set.
-. `$REFERENCING = $REFERENCED`
-   In this case, the referencing event will be defined as the same event as the referenced event.
-_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of
-the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's
-record. In either of these cases, the other field must reference an external event, per rule 1.
-###### `start_inclusive` and `end_inclusive`
-These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.
-This applies both to whether they are included in the calculation of the predicate values over the windows,
-but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`
-cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that
-`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the
-`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not
-consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as
-its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective
-timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that
-`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of
-the `start` event itself.
-##### The Constraints Field
-The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`
-that define the valid range the count of observations of the named predicate that must be found in a window
-for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
-those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note
-that as predicate counts are always integral, this specification does not need an additional
-inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
-to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
-the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
-`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
-to be included.
-#### Events: `EventConfig`
-The event config consists of only a single field, `predicate`, which specifies the predicate that must be
-observed with value greater than one to satisfy the event. There can only be one defined "event" with an
-"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.
+Configuration Language Specification
+## Introduction and Terminology
+This document specifies the configuration language for the automatic extraction of task dataframes and cohorts
+from structured EHR data organized either via the [MEDS](https://github.com/Medical-Event-Data-Standard/meds)
+format (recommended) or the [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format. This extraction
+system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and
+labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all
+realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more
+details on the recursive algorithm, see the `terminology.md` file. **TODO** better integrate, name, and link
+to these documentation files.
+As indicated above, these cohorts are specified through a combination of concepts (realized as event
+_predicate_ functions, _aka_ "predicates") which are _dataset specific_ and inclusion/exclusion/labeling
+criteria which, conditioned on a set of predicate definitions, are _dataset agnostic_.
+Predicates are currently limited to "count" predicates, which are predicates that count the number of times a
+boolean condition is satisfied over a given time window, which can either be a single timepoint, thus tracking
+whether how many observations there were that satisfied the boolean condition in that event (_aka_ at that
+timepoint) or over 1-dimensional windows. In the future, predicates may expand to include other notions of
+functional characterization, such as tracking the average/min/max value a concept takes on over a time-period,
+etc.
+Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or
+temporal relationships on said events. The windows between these time-points can then either be constrained to
+contain events that satisfy certain aggregation functions over predicates for these time frames.
+## Machine Form (what is used by the algorithm)
+In the machine form, the configuration file consists of two parts:
+- predicates, stored as a dictionary from string predicate names (which must be unique) to either
+  `DirectPredicateConfig` objects, which store raw predicates with no dependencies on other predicates, or
+  `DerivedPredicateConfig` objects, which store predicates that build on other predicates.
+- windows, stored as a dictionary from string window names (which must be unique) to `WindowConfig`
+  objects.
+Next, we will detail each of these configuration objects.
+### Predicates: `DirectPredicateConfig` and `DerivedPredicateConfig`
+#### `DirectPredicateConfig`: Configuration of Predicates that can be Computed Directly from Raw Data
+These configs consist of the following four fields:
+- `code`: The string value for the categorical code object that is relevant for this predicate. An
+  observation will only satisfy this predicate if there is an occurrence of this code in the observation.
+- `value_min`: If specified, an observation will only satisfy this predicate if the occurrence of the
+  underlying `code` with a reported numerical value that is either greater than or greater than or equal to
+  `value_min` (with these options being decided on the basis of `value_min_inclusive`, where
+  `value_min_incusive=True` indicating that an observation satisfies this predicate if its value is greater
+  than or equal to `value_min`, and `value_min_inclusive=False` indicating a greater than but not equal to
+  will be used.
+- `value_max`: If specified, an observation will only satisfy this predicate if the occurrence of the
+  underlying `code` with a reported numerical value that is either less than or less than or equal to
+  `value_max` (with these options being decided on the basis of `value_max_inclusive`, where
+  `value_max_incusive=True` indicating that an observation satisfies this predicate if its value is less
+  than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to
+  will be used.
+- `value_min_inclusive`: See `value_min`
+- `value_max_inclusive`: See `value_max`
+A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending
+on its source format.
+. If the source data is in [MEDS](https://github.com/Medical-Event-Data-Standard/meds) format
+   (recommended), then the `code` will be checked directly against MEDS' `code` field and the `value_min`
+   and `value_max` constraints will be compared against MEDS' `numerical_value` field. **Note**: This syntax
+   does not currently support defining predicates that also rely on matching other, optional fields in the
+   MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull
+   request or upvoting any existing issue/PR that requests/implements this feature, and we will add support
+   for this capability.
+. If the source data is in [ESGPT](https://eventstreamml.readthedocs.io/en/latest/) format, then the
+   `code` will be interpreted in the following manner:
+   a. If the code contains a `"//"`, it will be interpreted as being a two element list joined by the
+   `"//"` character, with the first element specifying the name of the ESGPT measurement under
+   consideration, which should either be of the multi-label classification or multivariate regression
+   type, and the second element being the name of the categorical key corresponding to the code in
+   question within the underlying measurement specified. If either of `value_min` and `value_max` are
+   present, then this measurement must be of a multivariate regression type, and the corresponding
+   `values_column` for extracting numerical observations from ESGPT's `dynamic_measurements_df` will be
+   sourced from the ESGPT dataset configuration object.
+   b. If the code does not contain a `"//"`, it will be interpreted as a direct measurement name that must
+   be of the univariate regression type and its value, if needed, will be pulled from the corresponding
+   column.
+#### `DerivedPredicateConfig`: Configuration of Predicates that Depend on Other Predicates
+These confiuration objects consist of only a single string field--`expr`--which contains a limited grammar of
+accepted operations that can be applied to other predicates, containing precisely the following:
+- `and(pred_1_name, pred_2_name, ...)`: Asserts that all of the specified predicates must be true.
+- `or(pred_1_name, pred_2_name, ...)`: Asserts that any of the specified predicates must be true.
+Note that, currently, `and`s and `or`s cannot be nested. Upon user request, we may support further advanced
+analytic operations over predicates.
+### Windows and Events:
+#### Windows: `WindowConfig`
+Windows contain a tracking `name` field, and otherwise are specified with two parts: (1) A set of four
+parameters (`start`, `end`, `start_inclusive`, and `end_inclusive`) that specify the time range of the window,
+and (2) a set of constraints specified through two fields, dictionary of constraints (the `has` field) that
+specify the constraints that must be satisfied over the defined predicates for a possible realization of this
+window to be valid.
+##### The Time Range Fields
+###### `start` and `end`
+Valid windows always progress in time from the `start` field to the `end` field. These two fields define, in
+symbolic form, the relationship between the start and end time of the window. These two fields must obey the
+following rules:
+_Linkage to other windows_: Firstly, exactly one of these two fields must reference an external event, as
+specified either through the name of the trigger event or the start or end event of another window. The other
+field must either be `null`/`None`/omitted (which has a very specific meaning, to be explained shortly) or
+must reference the field that references the external event.
+_Linkage reference language_: Secondly, for both events, regardless of whether they reference an external
+event or an internal event, that reference must be expressed in one of the following ways.
+. `$REFERENCING = $REFERENCED + $TIME_DELTA`, `$REFERENCING = $REFERENCED - $TIME_DELTA`, etc.
+   In this case, the referencing event (either the start or end of the window) will be defined as occurring
+   exactly `$TIME_DELTA` either after or before the event being referenced (either the external event or the
+   end or start of the window).
+   Note that if `$REFERENCED` is the `start` field, then `$TIME_DELTA` must be positive, and if
+   `$REFERENCED` is the `end` field, then `$TIME_DELTA` must be negative to preserve the time ordering of
+   the window fields.
+. `$REFERENCING = $REFERENCED -> $PREDICATE`, `$REFERENCING = $REFERENCED <- $PREDICATE`
+   In this case, the referencing event will be defined as the next or previous event satisfying the
+   predicate, `$PREDICATE`. Note that if the `$REFERENCED` is the `start` field, then the "next predicate
+   ordering" (`$REFERENCED -> $PREDICATE`) must be used, and if the `$REFERENCED` is the `end` field, then the
+   "previous predicate ordering" (`$REFERENCED <- $PREDICATE`) must be used to preserve the time ordering of
+   the window fields. Note that these forms can lead to windows being defined as single pointe vents, if the
+   `$REFERENCED` event itself satisfies `$PREDICATE` and the appropriate constraints are satisfied and
+   inclusive values are set.
+. `$REFERENCING = $REFERENCED`
+   In this case, the referencing event will be defined as the same event as the referenced event.
+_`null`/`None`/omitted_: If `start` is `null`/`None`/omitted, then the window will start at the beginning of
+the patient's record. If `end` is `null`/`None`/omitted, then the window will end at the end of the patient's
+record. In either of these cases, the other field must reference an external event, per rule 1.
+###### `start_inclusive` and `end_inclusive`
+These two fields specify whether the start and end of the window are inclusive or exclusive, respectively.
+This applies both to whether they are included in the calculation of the predicate values over the windows,
+but also, in the `$REFERENCING = $REFERENCED -> $PREDICATE` and `$REFERENCING = $PREDICATE -> $REFERENCED`
+cases, to which events are possible to use for valid next or prior `$PREDCIATE` events. E.g., if we have that
+`start_inclusive=False` and the `end` field is equal to `start -> $PREDICATE`, and it so happens that the
+`start` event itself satisfies `$PREDICATE`, the fact that `start_inclusive=False` will mean that we do not
+consider the `start` event itself to be a valid start to any window that ends at the same `start` event, as
+its timestamp when considered as the prospective "window start timestamp" occurs "after" the effective
+timestamp of itself when considered as the `$PREDICATE` event that marks the window end given that
+`start_inclusive=False` and thus we will think of the window as truly starting an iota after the timestamp of
+the `start` event itself.
+##### The Constraints Field
+The constraints field is a dictionary that maps predicate names to tuples of the form `(min_valid, max_valid)`
+that define the valid range the count of observations of the named predicate that must be found in a window
+for it to be considered valid. Either `min_valid` or `max_valid` constraints can be `None`, in which case
+those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained. Note
+that as predicate counts are always integral, this specification does not need an additional
+inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction
+to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy
+the constraint for predicate `name` with constraint `name: (1, 2)` if the count of observations of predicate
+`name` in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it
+to be included.
+#### Events: `EventConfig`
+The event config consists of only a single field, `predicate`, which specifies the predicate that must be
+observed with value greater than one to satisfy the event. There can only be one defined "event" with an
+"EventConfig" in a valid configuration, and it will define the "trigger" event of the cohort.

Adding more comprehensive tests across the API. #28

Adding more comprehensive tests across the API. #28

Conversation

mmcdermott commented May 9, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented May 9, 2024 • edited Loading

Rate Limit Exceeded

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

codecov-commenter commented May 9, 2024 • edited by codecov bot Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024 • edited Loading

Choose a reason for hiding this comment

mmcdermott May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot May 20, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

coderabbitai bot May 22, 2024

Choose a reason for hiding this comment

mmcdermott commented May 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 9, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)

codecov-commenter commented May 9, 2024 •

edited by codecov bot

Loading

coderabbitai bot May 20, 2024 •

edited

Loading