Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added autogluon support, more models, more preprocessing strategies #81

Merged
merged 65 commits into from
Sep 10, 2024

Conversation

Oufattole
Copy link
Collaborator

@Oufattole Oufattole commented Aug 19, 2024

Summary by CodeRabbit

  • New Features

    • Introduced the EvaluationCallback class for enhanced evaluation of machine learning models based on performance metrics.
    • Added methods for logging model performance and managing top-performing models, improving model management efficiency.
  • Bug Fixes

    • Improved error handling for missing data columns, enhancing robustness in data processing.
  • Refactor

    • Simplified data handling logic in model configurations to enhance maintainability.
    • Streamlined test setup by utilizing tmp_path for temporary directory management.
    • Updated terminology throughout the codebase from patient_id to subject_id for consistency.
    • Consolidated script functionality under a more generic name for improved clarity.

Copy link
Contributor

coderabbitai bot commented Aug 19, 2024

Walkthrough

The recent updates enhance the CI/CD pipeline by implementing a matrix strategy for managing Python versions, allowing testing against both Python 3.11 and 3.12. Workflow files have been updated to reflect these changes, including version upgrades for actions used in GitHub Actions. A new BaseModel class has been introduced to standardize model implementations, and several tests have been added or modified to validate model configurations and functionality. Additionally, the testing framework has been streamlined for improved efficiency and clarity.

Changes

Files Change Summary
.github/workflows/code-quality-main.yaml, .github/workflows/code-quality-pr.yaml Upgraded actions/checkout from v3 to v4 and actions/setup-python from v3 to v5; specified Python version as "3.11".
.github/workflows/publish-to-pypi.yml Updated Python version from "3.x" to "3.11"; updated PyPI URL; removed TestPyPI job.
.github/workflows/tests.yaml Introduced a matrix strategy for Python versions (3.11, 3.12) in the GitHub Actions workflow.
.pre-commit-config.yaml Changed default Python version from 3.12 to 3.11; removed exclusion pattern for certain directories.
README.md Replaced patient_id with subject_id in various sections to align with updated terminology.
src/MEDS_tabular_automl/base_model.py Added BaseModel class as an abstract base class for model training and evaluation with defined abstract methods.
src/MEDS_tabular_automl/evaluation_callback.py Introduced EvaluationCallback class for evaluating models based on performance metrics logged in CSV files.
src/MEDS_tabular_automl/configs/launch_model.yaml Introduced a new YAML configuration for launching machine learning models, specifying default settings and paths.
src/MEDS_tabular_automl/configs/default.yaml Renamed directory paths for input and output operations for clarity.
tests/test_integration.py, tests/test_tabularize.py Updated test functions to utilize tmp_path for temporary directories; modified tests to reflect changes in terminology and configuration.

Possibly related issues

  • Make compatible with MEDS v0.3.2 #80: The changes address the transition from patient_id to subject_id, aligning with the issue's objective of making the code compatible with MEDS v0.3.2.

Possibly related PRs

  • Dev #78: The changes in this PR involve updates to GitHub Actions workflow configurations, specifically upgrading action versions and specifying Python versions, which aligns with the main PR's focus on enhancing workflow configurations.

Poem

🐰 In the meadow, changes bloom,
Python hops, dispelling gloom.
With dependencies anew,
And types that fit just right, it's true!
Let's celebrate with joy and cheer,
A brighter path is drawing near! 🌼

Tip

Announcements
  • The review status is no longer posted as a separate comment when there are no actionable or nitpick comments. In such cases, the review status is included in the walkthrough comment.
  • We have updated our review workflow to use the Anthropic's Claude family of models. Please share any feedback in the discussion post on our Discord.
  • Possibly related PRs: Walkthrough comment now includes a list of potentially related PRs to help you recall past context. Please share any feedback in the discussion post on our Discord.
  • Suggested labels: CodeRabbit can now suggest labels by learning from your past PRs in the walkthrough comment. You can also provide custom labeling instructions in the UI or configuration file.
  • Possibly related PRs, automatic label suggestions based on past PRs, learnings, and possibly related issues require data opt-in (enabled by default).

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8316365 and f7e03dd.

Files selected for processing (1)
  • src/MEDS_tabular_automl/evaluation_callback.py (1 hunks)
Additional comments not posted (2)
src/MEDS_tabular_automl/evaluation_callback.py (2)

39-50: LGTM!

The log_performance method is well-implemented and follows good practices such as using a logger instead of printing.


52-90: Add error handling for file deletion.

The delete_below_top_k_models method can be improved by handling exceptions that might occur during file operations, such as permissions issues or files being in use.

Consider adding a try-except block around the shutil.rmtree call to handle such exceptions:

try:
    shutil.rmtree(trial_dir)
except PermissionError as e:
    logger.error(f"Permission denied: {trial_dir}, error: {e}")
except FileNotFoundError as e:
    logger.error(f"File not found: {trial_dir}, error: {e}")

This will make the method more robust and prevent the program from crashing if an exception occurs during file deletion.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0ec527a and 5fde57a.

Files selected for processing (2)
  • pyproject.toml (3 hunks)
  • src/MEDS_tabular_automl/mapper.py (2 hunks)
Additional comments not posted (5)
pyproject.toml (3)

Line range hint 15-15: Verify impact of removing MEDS-transforms.

The dependency MEDS-transforms==0.0.5 has been removed. Ensure that its removal does not affect any functionality that relied on it.

Verification successful

No direct usage of MEDS-transforms found in the codebase.

The search did not reveal any instances of MEDS-transforms being directly imported or used in the Python files. This suggests that its removal is unlikely to impact the current functionality. However, ensure that there are no indirect dependencies or configurations that might rely on it.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the removal of `MEDS-transforms` does not affect functionality.

# Test: Search for any usage of `MEDS-transforms` in the codebase.
rg --type python "from MEDS_transforms|import MEDS_transforms"

Length of output: 95


Script:

#!/bin/bash
# Description: Verify that the removal of `MEDS-transforms` does not affect functionality.

# Test: Search for any usage of `MEDS-transforms` in Python files.
rg --glob '*.py' "from MEDS_transforms|import MEDS_transforms"

Length of output: 64


35-35: Ensure correct usage of autogluon with environment marker.

The autogluon dependency has been added with a restriction to Python 3.11. Verify that this environment marker is correctly applied and that autogluon is necessary for the project.


11-11: Verify compatibility with Python 3.11.

The Python version requirement has been downgraded from 3.12 to 3.11. Ensure that all features and dependencies are compatible with Python 3.11.

src/MEDS_tabular_automl/mapper.py (2)

8-13: Good use of TypeVar for type safety.

The introduction of DF_T as a type variable enhances type safety and flexibility in the code.


87-89: Simplified function signature enhances readability.

The changes to the wrap function signature improve clarity and maintain its intended functionality.

@codecov-commenter
Copy link

codecov-commenter commented Aug 19, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 72.74119% with 178 lines in your changes missing coverage. Please review.

Project coverage is 81.91%. Comparing base (0ec527a) to head (f7e03dd).
Report is 66 commits behind head on main.

Files with missing lines Patch % Lines
src/MEDS_tabular_automl/tabular_dataset.py 60.76% 82 Missing ⚠️
...rc/MEDS_tabular_automl/scripts/launch_autogluon.py 35.84% 34 Missing ⚠️
src/MEDS_tabular_automl/evaluation_callback.py 43.24% 21 Missing ⚠️
src/MEDS_tabular_automl/sklearn_model.py 91.34% 9 Missing ⚠️
src/MEDS_tabular_automl/utils.py 77.50% 9 Missing ⚠️
...rc/MEDS_tabular_automl/generate_static_features.py 60.00% 4 Missing ⚠️
src/MEDS_tabular_automl/xgboost_model.py 94.73% 4 Missing ⚠️
src/MEDS_tabular_automl/base_model.py 86.36% 3 Missing ⚠️
src/MEDS_tabular_automl/generate_ts_features.py 57.14% 3 Missing ⚠️
src/MEDS_tabular_automl/scripts/cache_task.py 91.89% 3 Missing ⚠️
... and 3 more
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #81       +/-   ##
===========================================
- Coverage   92.18%   81.91%   -10.28%     
===========================================
  Files          14       20        +6     
  Lines         819     1255      +436     
===========================================
+ Hits          755     1028      +273     
- Misses         64      227      +163     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Outside diff range, codebase verification and nitpick comments (2)
src/MEDS_tabular_automl/configs/launch_basemodel.yaml (2)

24-26: Clarify the use of n_iter.

The commented line # n_iter: ${model_params.epochs} suggests uncertainty about its behavior. If this parameter is needed, ensure it is correctly configured and documented.

# n_iter: ${model_params.epochs} # Uncomment if iterative behavior is desired

12-14: Verify variable initialization for directory paths.

The paths for input_dir and input_label_dir are not correctly resolved, resulting in //task_cache and //labels/. Please ensure that the output_cohort_dir and task_name variables are properly initialized in the runtime environment before using this configuration. This will ensure that the directories are correctly specified.

  • Verify the initialization of output_cohort_dir and task_name in the environment.
Analysis chain

Verify directory paths.

Ensure that the paths specified for input_dir and input_label_dir are correctly resolved in the runtime environment. These paths depend on the output_cohort_dir and task_name variables.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the paths for input_dir and input_label_dir are correctly resolved.

# Test: Check if the directories exist. Expect: Directories should exist or be created during runtime.
echo "${output_cohort_dir}/${task_name}/task_cache"
echo "${output_cohort_dir}/${task_name}/labels/"

Length of output: 59

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5fde57a and d6832cb.

Files selected for processing (7)
  • src/MEDS_tabular_automl/base_model.py (1 hunks)
  • src/MEDS_tabular_automl/configs/launch_basemodel.yaml (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_basemodel.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_xgboost.py (2 hunks)
  • src/MEDS_tabular_automl/tabular_dataset.py (1 hunks)
  • src/MEDS_tabular_automl/xgboost_model.py (1 hunks)
  • tests/test_tabularize.py (2 hunks)
Additional context used
Ruff
src/MEDS_tabular_automl/tabular_dataset.py

78-78: Remove unnecessary True if ... else False

Remove unnecessary True if ... else False

(SIM210)

Additional comments not posted (20)
src/MEDS_tabular_automl/xgboost_model.py (2)

39-53: Verify iterator initialization.

Ensure that the XGBIterator is correctly initialized with valid data shards and labels. The constructor raises a ValueError if no labels are found, which is a good practice.


135-171: Confirm model training parameters.

Verify that the parameters passed to xgb.train are correctly configured and that early stopping rounds are appropriately set. The use of OmegaConf.to_container is a good practice for passing configuration parameters.

Verification successful

Model Training Parameters Confirmed

The model training parameters, including num_boost_round and early_stopping_rounds, are correctly configured in the launch_xgboost.yaml configuration file. The use of OmegaConf.to_container is appropriate for passing these parameters to xgb.train.

  • Configuration file: src/MEDS_tabular_automl/configs/launch_xgboost.yaml
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the model training parameters are correctly configured.

# Test: Check for the presence of model parameters in the configuration. Expect: Parameters should be present.
rg --type yaml 'model_params' configs/

Length of output: 87


Script:

#!/bin/bash
# Search for 'model_params' across the entire codebase to locate the configuration source.
rg 'model_params'

Length of output: 3389

src/MEDS_tabular_automl/base_model.py (2)

104-125: Verify model initialization.

Ensure that the model initialized from cfg.model_params.model is valid and has the necessary methods like fit. The check for the fit method is a good validation step.


135-171: Check partial fit support.

The _fit_from_partial method requires models to support partial_fit. Ensure that the models used are compatible with this method, or handle unsupported models appropriately.

tests/test_tabularize.py (3)

332-336: Configuration setup looks good.

The basemodel_config_kwargs setup is consistent and well-structured for the launch_basemodel module.


346-348: Execution and assertions are correct.

The execution of launch_basemodel.main(cfg) and the subsequent assertions ensure that the expected output files are generated.


350-368: Additional configurations and execution are well-implemented.

The second set of configurations and the execution of launch_basemodel.main(cfg) are appropriately set up, with assertions ensuring expected outcomes.

src/MEDS_tabular_automl/tabular_dataset.py (13)

40-59: Initialization logic is correct, but review commented-out code.

The __init__ method correctly initializes the TabularDataset class. Consider reviewing the commented-out line for loading IDs and labels to ensure it's either necessary or can be removed.


83-100: Matrix loading logic is correct.

The _load_matrix method correctly loads a sparse matrix and handles unexpected formats appropriately.


160-178: Logic for determining the code set is correct.

The _get_code_set method accurately retrieves feature columns and determines the codes set, masks, and feature count.


181-198: Shard loading and filtering logic is correct.

The _load_dynamic_shard_from_file method accurately loads a data shard and applies filtering based on aggregation type.


201-227: Dynamic shard retrieval and error handling logic is correct.

The _get_dynamic_shard_by_index method correctly retrieves a shard, applies filtering, and handles missing files with a ValueError.


230-242: Shard combination logic is correct.

The _get_shard_by_index method correctly retrieves and combines feature data and labels for a given shard.


270-293: Data retrieval and error handling logic is correct.

The get_data_shards method accurately retrieves feature data and labels for specified shards and handles empty data with a ValueError.


295-302: Data retrieval logic is correct.

The get_data method correctly delegates data retrieval to get_data_shards.


304-314: Event ID setting logic is correct, but consider implementing custom lists.

The set_event_ids method correctly loads or sets event IDs. Consider implementing the logic for handling custom event lists.


316-326: Label setting logic is correct, but consider implementing custom lists.

The set_labels method correctly loads or sets labels. Consider implementing the logic for handling custom label lists.


328-335: Code setting and mask updating logic is correct.

The set_codes method correctly updates the codes set and redeclares code masks.


337-345: Code addition and mask updating logic is correct.

The add_code method correctly adds a code to the set and updates the masks if the code is not already present.


347-355: Code removal and mask updating logic is correct.

The remove_code method correctly removes a code from the set and updates the masks if the code is present.

src/MEDS_tabular_automl/scripts/launch_xgboost.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/scripts/launch_basemodel.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/scripts/launch_basemodel.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/tabular_dataset.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/tabular_dataset.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Outside diff range, codebase verification and nitpick comments (7)
src/MEDS_tabular_automl/base_model.py (1)

9-23: Consider adding docstrings to abstract methods.

Adding docstrings to the abstract methods will provide clarity on their intended use and improve code documentation.

Add docstrings to each abstract method to describe its purpose and expected behavior.

src/MEDS_tabular_automl/configs/models/sgd_classifier.yaml (1)

19-19: Add a newline at the end of the file.

Adding a newline at the end of the file is a best practice that helps avoid issues with some text processing tools.

Add a newline character at the end of the file.

+ 
Tools
yamllint

[error] 19-19: no new line character at the end of file

(new-line-at-end-of-file)

src/MEDS_tabular_automl/configs/launch_model.yaml (1)

22-22: Add a newline at the end of the file.

Adding a newline at the end of the file is a best practice that helps avoid issues with some text processing tools.

Add a newline character at the end of the file.

+ 
Tools
yamllint

[error] 22-22: no new line character at the end of file

(new-line-at-end-of-file)

src/MEDS_tabular_automl/configs/models/xgboost.yaml (1)

27-27: Add a newline at the end of the file.

Adding a newline at the end of the file is a best practice that helps avoid issues with some text processing tools.

Add a newline character at the end of the file.

+ 
Tools
yamllint

[error] 27-27: no new line character at the end of file

(new-line-at-end-of-file)

src/MEDS_tabular_automl/configs/launch_sklearnmodel.yaml (1)

26-26: Clarify the commented-out n_iter line.

The n_iter parameter is commented out with a note. Clarify whether this behavior is intended or if it should be included in the configuration.

-    # n_iter: ${model_params.epochs} # not sure if we want this behaviour
+    # n_iter: ${model_params.epochs} # Clarify if this should be included
src/MEDS_tabular_automl/scripts/launch_xgboost.py (1)

51-53: Re-enable and enhance exception handling.

The exception handling block is commented out. Consider re-enabling it and handling specific exceptions to improve robustness and debugging.

-    # except Exception as e:
-    #     logger.error(f"Error occurred: {e}")
-    #     auc = 0.0
+    except Exception as e:
+        logger.error(f"Unexpected error occurred: {e}")
+        auc = 0.0
src/MEDS_tabular_automl/sklearn_model.py (1)

61-78: Clarify docstring for SklearnMatrix.

The docstring for SklearnMatrix should include a description of the labels parameter for completeness.

Update the docstring as follows:

    """Initializes the SklearnMatrix with the provided configuration and data split.

    Args:
        data: The feature data matrix.
+       labels: The array of labels corresponding to the data.
    """
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d6832cb and f3c985a.

Files selected for processing (18)
  • src/MEDS_tabular_automl/base_model.py (1 hunks)
  • src/MEDS_tabular_automl/configs/launch_model.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/launch_sklearnmodel.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/models/sgd_classifier.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/models/xgboost.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/tabularization/default.yaml (2 hunks)
  • src/MEDS_tabular_automl/dense_iterator.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_autogluon.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_model.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_sklearnmodel.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_xgboost.py (2 hunks)
  • src/MEDS_tabular_automl/scripts/tabularize_static.py (1 hunks)
  • src/MEDS_tabular_automl/sklearn_model.py (1 hunks)
  • src/MEDS_tabular_automl/tabular_dataset.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (3 hunks)
  • src/MEDS_tabular_automl/xgboost_model.py (1 hunks)
  • tests/test_integration.py (3 hunks)
  • tests/test_tabularize.py (6 hunks)
Files skipped from review due to trivial changes (1)
  • tests/test_integration.py
Files skipped from review as they are similar to previous changes (3)
  • src/MEDS_tabular_automl/tabular_dataset.py
  • src/MEDS_tabular_automl/xgboost_model.py
  • tests/test_tabularize.py
Additional context used
Ruff
src/MEDS_tabular_automl/base_model.py

1-1: typing.Dict imported but unused

Remove unused import

(F401)


1-1: typing.Type imported but unused

Remove unused import

(F401)


4-4: omegaconf.DictConfig imported but unused

Remove unused import: omegaconf.DictConfig

(F401)

src/MEDS_tabular_automl/dense_iterator.py

1-1: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)


3-3: hydra imported but unused

Remove unused import: hydra

(F401)


6-6: loguru.logger imported but unused

Remove unused import: loguru.logger

(F401)


9-9: sklearn.metrics.roc_auc_score imported but unused

Remove unused import: sklearn.metrics.roc_auc_score

(F401)


12-12: .base_model.BaseModel imported but unused

Remove unused import: .base_model.BaseModel

(F401)

src/MEDS_tabular_automl/scripts/launch_autogluon.py

2-2: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)


56-56: Local variable predictor is assigned to but never used

Remove assignment to unused variable predictor

(F841)

yamllint
src/MEDS_tabular_automl/configs/models/sgd_classifier.yaml

[error] 19-19: no new line character at the end of file

(new-line-at-end-of-file)

src/MEDS_tabular_automl/configs/launch_model.yaml

[error] 22-22: no new line character at the end of file

(new-line-at-end-of-file)

src/MEDS_tabular_automl/configs/models/xgboost.yaml

[error] 27-27: no new line character at the end of file

(new-line-at-end-of-file)

Additional comments not posted (15)
src/MEDS_tabular_automl/configs/tabularization/default.yaml (1)

Line range hint 4-24: Verify the impact of parameter changes on code inclusion logic.

The introduction of min_code_inclusion_count and changes to _resolved_codes suggest a shift in how codes are included. Ensure these changes align with the intended functionality and do not disrupt existing processes.

Consider reviewing the logic in the codebase where these parameters are used to confirm they are correctly implemented.

Verification successful

Parameter Changes Verified: No Disruption Detected

The introduction and usage of min_code_inclusion_count and _resolved_codes are consistent with their intended functionality. They are correctly integrated into the codebase, with appropriate filtering logic and code management. The presence of related tests further supports their correct implementation.

  • min_code_inclusion_count is used for filtering based on frequency in utils.py.
  • _resolved_codes is used for managing allowed codes in multiple scripts.

These changes do not disrupt existing processes.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `min_code_inclusion_count` and `_resolved_codes` in the codebase.

# Test: Search for the usage of these parameters. Expect: Correct implementation in relevant code sections.
rg --type yaml 'min_code_inclusion_count|_resolved_codes'

Length of output: 447


Script:

#!/bin/bash
# Description: Search for the usage of `min_code_inclusion_count` and `_resolved_codes` in the codebase.

# Test: Search for these parameters in all code files to understand their usage.
rg 'min_code_inclusion_count|_resolved_codes' --type py --type js --type ts --type java --type go --type rs

Length of output: 136


Script:

#!/bin/bash
# Description: Search for the usage of `min_code_inclusion_count` and `_resolved_codes` in the codebase.

# Test: Search for these parameters in all code files to understand their usage.
rg 'min_code_inclusion_count|_resolved_codes' --type py --type js --type ts --type java --type go --type rust

Length of output: 2039

src/MEDS_tabular_automl/configs/launch_sklearnmodel.yaml (2)

12-14: Verify the use of placeholders in paths.

Ensure that the placeholders like ${output_cohort_dir} and ${now:%Y-%m-%d_%H-%M-%S} are correctly set in the environment or configuration to avoid runtime errors.

Also applies to: 16-17, 31-31


1-33: Configuration structure looks good.

The overall structure of the YAML configuration is clear and well-organized.

src/MEDS_tabular_automl/scripts/launch_xgboost.py (1)

30-50: Refactoring improves clarity.

The refactoring to directly use the XGBoostModel class and streamline the model saving process is a positive change.

src/MEDS_tabular_automl/dense_iterator.py (1)

15-52: Class structure is well-designed.

The DenseIterator class is well-structured and effectively handles data densification.

src/MEDS_tabular_automl/scripts/launch_model.py (1)

14-59: Script structure is well-organized.

The script is well-organized and effectively manages model launching with error handling.

src/MEDS_tabular_automl/scripts/launch_sklearnmodel.py (1)

16-54: Script structure is well-organized.

The script is well-organized and effectively manages Sklearn model launching with error handling.

src/MEDS_tabular_automl/scripts/launch_autogluon.py (2)

13-15: Ensure configuration file existence.

The code checks for the existence of the configuration file and raises an error if not found. This is a good practice to ensure the necessary configuration is available.


31-35: Handle missing AutoGluon installation gracefully.

The code correctly logs an error if AutoGluon is not installed, guiding the user to install it.

src/MEDS_tabular_automl/scripts/tabularize_static.py (1)

94-98: Verify parameter changes in filter_to_codes.

The parameters passed to filter_to_codes have been altered. Ensure these changes align with the intended logic and that the new parameters are correctly configured in the configuration file.

Run the following script to verify the parameter usage in the configuration:

src/MEDS_tabular_automl/sklearn_model.py (4)

15-38: Ensure proper initialization and error handling in SklearnIterator.

The constructor initializes data and raises an error if no labels are found, which is a good practice to prevent downstream errors.


81-126: Ensure model compatibility in SklearnModel.

The constructor checks if the model has a fit method, which is essential for compatibility with the training process.


137-163: Implement early stopping logic in _fit_from_partial.

The method implements early stopping based on AUC, which is a good practice for preventing overfitting.


195-234: Handle empty predictions in evaluate.

The method raises an error if predictions or true labels are empty, ensuring the evaluation process is valid.

src/MEDS_tabular_automl/utils.py (1)

Line range hint 50-90: Review parameter changes in filter_to_codes.

The function's parameters have been restructured for better clarity and flexibility. Ensure that the logic aligns with the intended filtering criteria.

src/MEDS_tabular_automl/base_model.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/dense_iterator.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/scripts/launch_model.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/scripts/launch_sklearnmodel.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f3c985a and b65754c.

Files selected for processing (11)
  • src/MEDS_tabular_automl/base_model.py (1 hunks)
  • src/MEDS_tabular_automl/configs/launch_model.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/models/sgd_classifier.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/models/xgboost.yaml (1 hunks)
  • src/MEDS_tabular_automl/dense_iterator.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_autogluon.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_model.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_xgboost.py (2 hunks)
  • src/MEDS_tabular_automl/sklearn_model.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (2 hunks)
  • src/MEDS_tabular_automl/xgboost_model.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • src/MEDS_tabular_automl/configs/launch_model.yaml
Files skipped from review as they are similar to previous changes (8)
  • src/MEDS_tabular_automl/base_model.py
  • src/MEDS_tabular_automl/configs/models/sgd_classifier.yaml
  • src/MEDS_tabular_automl/configs/models/xgboost.yaml
  • src/MEDS_tabular_automl/dense_iterator.py
  • src/MEDS_tabular_automl/scripts/launch_autogluon.py
  • src/MEDS_tabular_automl/scripts/launch_model.py
  • src/MEDS_tabular_automl/sklearn_model.py
  • src/MEDS_tabular_automl/xgboost_model.py
Additional comments not posted (1)
src/MEDS_tabular_automl/scripts/launch_xgboost.py (1)

50-53: Enhance exception handling for better debugging.

The current exception handling logs a generic error message. Consider handling specific exceptions to provide more detailed error information, which can aid in debugging.

except FileNotFoundError as e:
    logger.error(f"Configuration file not found: {e}")
    auc = 0.0
except Exception as e:
    logger.error(f"Unexpected error occurred: {e}")
    auc = 0.0

src/MEDS_tabular_automl/utils.py Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/utils.py Outdated Show resolved Hide resolved
Repository owner deleted a comment from coderabbitai bot Aug 20, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b65754c and d07f6a2.

Files selected for processing (6)
  • src/MEDS_tabular_automl/configs/launch_autogluon.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/launch_xgboost.yaml (1 hunks)
  • src/MEDS_tabular_automl/dense_iterator.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_autogluon.py (1 hunks)
  • src/MEDS_tabular_automl/tabular_dataset.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (4 hunks)
Files skipped from review as they are similar to previous changes (2)
  • src/MEDS_tabular_automl/dense_iterator.py
  • src/MEDS_tabular_automl/scripts/launch_autogluon.py
Additional comments not posted (14)
src/MEDS_tabular_automl/configs/launch_autogluon.yaml (4)

1-7: Verify the correctness of default overrides.

Ensure that the overrides for hydra/sweeper, hydra/sweeper/sampler, and hydra/launcher are correctly set up for your use case. These settings dictate how the sweeper and launcher will behave, which can impact the performance and results of your AutoGluon tasks.


12-14: Check the directory paths for task data and labels.

The paths for input_dir and input_label_dir are constructed using output_cohort_dir and task_name. Verify that these variables are correctly defined in your environment to avoid path resolution issues.


19-24: Review model parameters for AutoGluon.

The model_params section includes parameters like keep_data_in_memory and binarize_task. Ensure these settings align with your intended model configuration and data handling requirements.


25-26: Ensure logging paths are correctly set.

The log_dir and log_filepath are set under model_dir. Confirm that these paths are writable and correctly set up to capture logs for debugging and analysis.

src/MEDS_tabular_automl/configs/launch_xgboost.yaml (2)

56-56: Ensure correct usage of min_code_inclusion_count.

The parameter min_code_inclusion_frequency has been renamed to min_code_inclusion_count. Verify that this change is reflected throughout the codebase to prevent potential mismatches or errors.


56-56: Review the addition of max_depth parameter.

The max_depth parameter has been added to the model configuration. Ensure that this parameter is necessary for your model's complexity and performance requirements.

src/MEDS_tabular_automl/utils.py (3)

50-53: Ensure parameter restructuring aligns with intended functionality.

The filter_to_codes function has been restructured with new parameters. Verify that the new structure supports the intended filtering logic and that all parameters are used correctly in the function.


79-84: Implement logic for min_code_inclusion_frequency.

The logic for handling min_code_inclusion_frequency is currently commented out. Ensure that this functionality is implemented correctly to filter codes based on their frequency relative to the dataset size.


88-89: Optimize sorting logic for max_include_codes.

The current sorting logic sorts the entire dataset before taking the top max_include_codes. Consider using a more efficient approach if the dataset is large, such as using a partial sort.

src/MEDS_tabular_automl/tabular_dataset.py (5)

41-49: Verify initialization logic for TabularDataset.

Ensure that the initialization logic correctly sets up the dataset based on the configuration and data split. Verify that the paths and configurations are correctly resolved and that any dependencies are met.


64-83: Review the logic for creating code masks.

The _get_code_masks method creates boolean masks for filtering features. Ensure that the logic correctly reflects the intended filtering criteria and that it efficiently handles large datasets.


106-134: Consider addressing the TODO comment regarding NaN values.

The _load_ids_and_labels method functions correctly, but the TODO comment suggests potential issues with NaN values.


162-194: Review feature code inclusion logic.

The _get_code_set method determines which feature codes to include. Ensure that the logic aligns with your feature selection criteria and that it handles edge cases appropriately.


401-412: Simplify boolean mask creation.

The True if ... else False construct is unnecessary. Simplify the mask creation as follows:

src/MEDS_tabular_automl/tabular_dataset.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (8)
src/MEDS_tabular_automl/configs/tabularization/default.yaml (1)

6-6: Consider setting a default value for max_included_codes.

While max_included_codes is introduced, it is currently set to null. Consider setting a default value to prevent potential issues if this parameter is used without being explicitly defined.

- max_included_codes: null
+ max_included_codes: 100  # Example default value
src/MEDS_tabular_automl/configs/model/sgd_classifier.yaml (1)

14-15: Consider parameterizing epochs and early_stopping_rounds.

The values for epochs and early_stopping_rounds are hardcoded. Consider parameterizing these values to allow for more flexible configuration.

- epochs: 20
- early_stopping_rounds: 5
+ epochs: ${epochs}
+ early_stopping_rounds: ${early_stopping_rounds}
src/MEDS_tabular_automl/configs/model/xgboost.yaml (1)

14-16: Consider parameterizing num_boost_round and early_stopping_rounds.

The values for num_boost_round and early_stopping_rounds are hardcoded. Consider parameterizing these values to allow for more flexible configuration.

- num_boost_round: 1000
- early_stopping_rounds: 5
+ num_boost_round: ${num_boost_round}
+ early_stopping_rounds: ${early_stopping_rounds}
src/MEDS_tabular_automl/scripts/launch_model.py (1)

12-12: Consider making the configuration path more flexible.

The configuration file path is hardcoded, which might limit flexibility. Consider allowing this path to be specified via command-line arguments or environment variables.

- config_yaml = files("MEDS_tabular_automl").joinpath("configs/launch_model.yaml")
+ config_yaml = Path(os.getenv("CONFIG_PATH", "src/MEDS_tabular_automl/configs/launch_model.yaml"))
tests/test_configs.py (1)

20-27: Consider using a more robust method for running commands.

Using subprocess.run with shell=True can be risky. Consider using a list of arguments instead to avoid shell injection vulnerabilities.

- command_out = subprocess.run(" ".join(command_parts), shell=True, capture_output=True)
+ command_out = subprocess.run(command_parts, capture_output=True)
src/MEDS_tabular_automl/generate_static_features.py (3)

Line range hint 14-38: Consider optimizing the iteration for sparse matrix conversion.

The current implementation uses nested loops to iterate over the DataFrame, which can be inefficient for large datasets. Consider using vectorized operations or built-in functions to optimize this process.

-  for row in range(dense_matrix.shape[0]):
-      for col in range(dense_matrix.shape[1]):
-          data = dense_matrix[row, col]
-          if (data is not None) and (data != 0):
-              data_list.append(data)
-              rows.append(row)
-              cols.append(col)
+  non_zero_indices = np.nonzero(dense_matrix)
+  data_list = dense_matrix[non_zero_indices]
+  rows, cols = non_zero_indices

Line range hint 41-71: Replace assertions with proper error handling.

Assertions are used to check data integrity, but they should be replaced with proper error handling to ensure robustness in production environments.

-  assert static_df.select(pl.col("patient_id")).collect().to_series().is_sorted()
-  assert (
-      static_df.select(pl.len()).collect().item()
-      == static_df.select(pl.col("patient_id").n_unique()).collect().item()
-  )
+  if not static_df.select(pl.col("patient_id")).collect().to_series().is_sorted():
+      raise ValueError("The static_df is not sorted by patient_id.")
+  if static_df.select(pl.len()).collect().item() != static_df.select(pl.col("patient_id").n_unique()).collect().item():
+      raise ValueError("The patient_id column in static_df is not unique.")

Ensure proper error handling for get_flat_static_rep usage

The function get_flat_static_rep is used in src/MEDS_tabular_automl/scripts/tabularize_static.py, but there is no error handling for the potential ValueError. Please add appropriate error handling to manage this exception.

  • File: src/MEDS_tabular_automl/scripts/tabularize_static.py
  • Lines: Around the call to get_flat_static_rep
Analysis chain

Line range hint 158-194: Validation step enhances robustness.

The addition of a validation step to check for empty static features is a good practice to prevent downstream errors.

Ensure that all calls to get_flat_static_rep handle the potential ValueError appropriately.

Run the following script to verify the function usage:

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `get_flat_static_rep` handle the potential ValueError.

# Test: Search for the function usage. Expect: Proper error handling around the calls.
rg --type python -A 5 $'get_flat_static_rep'

Length of output: 74


Script:

#!/bin/bash
# Description: Verify all function calls to `get_flat_static_rep` handle the potential ValueError.

# Test: Search for the function usage. Expect: Proper error handling around the calls.
rg --type py -A 5 $'get_flat_static_rep'

Length of output: 2031

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d07f6a2 and 8c54317.

Files selected for processing (11)
  • src/MEDS_tabular_automl/base_model.py (1 hunks)
  • src/MEDS_tabular_automl/configs/launch_model.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/model/sgd_classifier.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/model/xgboost.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/tabularization/default.yaml (2 hunks)
  • src/MEDS_tabular_automl/generate_static_features.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_model.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_sklearnmodel.py (1 hunks)
  • src/MEDS_tabular_automl/sklearn_model.py (1 hunks)
  • tests/test_configs.py (1 hunks)
  • tests/test_tabularize.py (5 hunks)
Files skipped from review due to trivial changes (1)
  • src/MEDS_tabular_automl/configs/launch_model.yaml
Files skipped from review as they are similar to previous changes (4)
  • src/MEDS_tabular_automl/base_model.py
  • src/MEDS_tabular_automl/scripts/launch_sklearnmodel.py
  • src/MEDS_tabular_automl/sklearn_model.py
  • tests/test_tabularize.py
Additional comments not posted (7)
src/MEDS_tabular_automl/configs/tabularization/default.yaml (2)

4-5: Clarify the rationale for replacing min_code_inclusion_frequency with min_code_inclusion_count.

The change from min_code_inclusion_frequency to min_code_inclusion_count suggests a shift in how inclusion criteria are defined. Ensure that this change aligns with the intended logic and update any documentation or dependent logic accordingly.


24-24: Ensure _resolved_codes logic is updated to reflect configuration changes.

The _resolved_codes entry now includes min_code_inclusion_count and max_included_codes. Verify that the logic for resolving codes correctly handles these parameters.

src/MEDS_tabular_automl/configs/model/sgd_classifier.yaml (1)

23-30: Verify hyperparameter sweep ranges.

Ensure that the specified ranges for hyperparameter sweeps are appropriate for the model and dataset. Adjust if necessary to optimize performance.

src/MEDS_tabular_automl/configs/model/xgboost.yaml (1)

27-38: Verify hyperparameter sweep ranges.

Ensure that the specified ranges for hyperparameter sweeps are appropriate for the model and dataset. Adjust if necessary to optimize performance.

tests/test_configs.py (1)

37-65: Ensure test coverage for all configurations.

Verify that all model configurations are adequately tested. Add tests for any missing configurations to ensure comprehensive coverage.

src/MEDS_tabular_automl/generate_static_features.py (2)

Line range hint 74-155: LGTM! The function is well-structured.

The summarize_static_measurements function correctly handles different aggregation types and raises an appropriate error for invalid inputs.


Line range hint 1-12: LGTM! The file-level docstring and imports are appropriate.

The docstring provides a clear overview of the module's purpose and functionality, and the imports are relevant and necessary.

src/MEDS_tabular_automl/configs/model/xgboost.yaml Outdated Show resolved Hide resolved
src/MEDS_tabular_automl/scripts/launch_model.py Outdated Show resolved Hide resolved
…none, mean, median, mode), and three normalization methods (none, standard_scaler, and min_max_scaler)
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8c54317 and e6cf085.

Files selected for processing (12)
  • src/MEDS_tabular_automl/configs/imputer/default.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/imputer/mean_imputer.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/imputer/median_imputer.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/imputer/mode_imputer.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/launch_model.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/model/sgd_classifier.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/model/xgboost.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/normalization/default.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/normalization/min_max_scaler.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/normalization/standard_scaler.yaml (1 hunks)
  • src/MEDS_tabular_automl/tabular_dataset.py (1 hunks)
  • tests/test_configs.py (1 hunks)
Files skipped from review due to trivial changes (7)
  • src/MEDS_tabular_automl/configs/imputer/default.yaml
  • src/MEDS_tabular_automl/configs/imputer/mean_imputer.yaml
  • src/MEDS_tabular_automl/configs/imputer/mode_imputer.yaml
  • src/MEDS_tabular_automl/configs/launch_model.yaml
  • src/MEDS_tabular_automl/configs/normalization/default.yaml
  • src/MEDS_tabular_automl/configs/normalization/min_max_scaler.yaml
  • src/MEDS_tabular_automl/configs/normalization/standard_scaler.yaml
Files skipped from review as they are similar to previous changes (4)
  • src/MEDS_tabular_automl/configs/model/sgd_classifier.yaml
  • src/MEDS_tabular_automl/configs/model/xgboost.yaml
  • src/MEDS_tabular_automl/tabular_dataset.py
  • tests/test_configs.py
Additional comments not posted (1)
src/MEDS_tabular_automl/configs/imputer/median_imputer.yaml (1)

1-3: Configuration looks good!

The YAML configuration for the median imputer is correctly set up using sklearn.impute.SimpleImputer with the median strategy. This is a common approach for handling missing values in datasets.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0db7bd6 and a6d9103.

Files selected for processing (12)
  • .github/workflows/code-quality-main.yaml (1 hunks)
  • .github/workflows/code-quality-pr.yaml (1 hunks)
  • .github/workflows/publish-to-pypi.yml (3 hunks)
  • .github/workflows/tests.yaml (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • README.md (8 hunks)
  • src/MEDS_tabular_automl/evaluation_callback.py (1 hunks)
  • src/MEDS_tabular_automl/scripts/launch_autogluon.py (1 hunks)
  • src/MEDS_tabular_automl/sklearn_model.py (1 hunks)
  • src/MEDS_tabular_automl/tabular_dataset.py (1 hunks)
  • src/MEDS_tabular_automl/xgboost_model.py (1 hunks)
  • tests/test_tabularize.py (6 hunks)
Files skipped from review as they are similar to previous changes (5)
  • .github/workflows/publish-to-pypi.yml
  • .github/workflows/tests.yaml
  • src/MEDS_tabular_automl/evaluation_callback.py
  • src/MEDS_tabular_automl/scripts/launch_autogluon.py
  • src/MEDS_tabular_automl/tabular_dataset.py
Additional context used
Ruff
tests/test_tabularize.py

317-317: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

LanguageTool
README.md

[uncategorized] ~108-~108: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~108-~108: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique subject_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)


[uncategorized] ~122-~122: Loose punctuation mark.
Context: ...3. meds-tab-tabularize-time-series: Iterates through combinations of a shar...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~137-~137: Loose punctuation mark.
Context: ...ax] ``` 4. meds-tab-cache-task: Aligns task-specific labels with the ne...

(UNLIKELY_OPENING_PUNCTUATION)

Additional comments not posted (11)
.github/workflows/code-quality-main.yaml (2)

16-16: Approve the upgrade to actions/checkout@v4.

Ensure to review the release notes for actions/checkout@v4 to confirm compatibility and understand any new features or breaking changes.


19-21: Approve the upgrade to actions/setup-python@v5 and the specification of Python version 3.11.

This change ensures that the workflow uses a consistent and possibly more stable Python environment. It's important to verify that all dependencies are compatible with Python 3.11.

.github/workflows/code-quality-pr.yaml (2)

19-19: Approve the upgrade to actions/checkout@v4.

Ensure to review the release notes for actions/checkout@v4 to confirm compatibility and understand any new features or breaking changes.


22-24: Approve the upgrade to actions/setup-python@v5 and the specification of Python version 3.11.

This change ensures that the workflow uses a consistent and possibly more stable Python environment. It's important to verify that all dependencies are compatible with Python 3.11.

.pre-commit-config.yaml (2)

2-2: Approve the change to the default Python version to 3.11.

This change may address compatibility issues. Ensure that all hooks are tested to confirm they function correctly with Python 3.11.


2-2: Approve the removal of the exclusion pattern.

This change will increase the scope of pre-commit checks, potentially improving code quality. It's important to verify that the newly included files meet the standards expected by the pre-commit hooks.

src/MEDS_tabular_automl/sklearn_model.py (1)

14-32: Well-implemented data handling class for SciKit-Learn models.

The SklearnMatrix class is appropriately designed to handle data matrices, using types compatible with SciKit-Learn. The methods are straightforward and effectively encapsulate the functionality needed for data retrieval.

README.md (1)

108-108: Correct punctuation and grammar.

The existing comment from the bot about the punctuation and grammatical error is still valid. The use of "thus" as a conjunction and the unnecessary colon should be corrected to improve readability and grammatical correctness.

The existing comment from the bot is still applicable. Please refer to the suggested changes to correct the punctuation and grammar issues.

Tools
LanguageTool

[uncategorized] ~108-~108: Loose punctuation mark.
Context: ...nt. 2. meds-tab-tabularize-static: Filters and processes the dataset based...

(UNLIKELY_OPENING_PUNCTUATION)


[typographical] ~108-~108: The word “thus” is an adverb that can’t be used like a conjunction, and therefore needs to be separated from the sentence.
Context: ...o a unique subject_id and timestamp combination, thus rows are duplicated across multiple tim...

(THUS_SENTENCE)

src/MEDS_tabular_automl/xgboost_model.py (2)

14-73: Well-implemented XGBIterator class.

The XGBIterator class is well-documented and correctly implements the functionality required for iterating through data shards in XGBoost models. The inheritance from xgb.DataIter and TabularDataset is appropriately used to leverage functionalities from both bases.


75-190: Well-implemented XGBoostModel class.

The XGBoostModel class is well-documented and correctly implements the functionality required for configuring, training, and evaluating an XGBoost model. The methods are well-organized, and the separation of concerns is evident, enhancing both readability and maintainability.

tests/test_tabularize.py (1)

147-421: Comprehensive and well-documented test_tabularize function.

The test_tabularize function is comprehensive and covers various aspects of the tabularization process in the MEDS_tabular_automl project. The use of hydra for configuration management and polars for data handling is appropriately implemented. The function is well-documented, enhancing both readability and maintainability.

Tools
Ruff

317-317: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

src/MEDS_tabular_automl/sklearn_model.py Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (3)
src/MEDS_tabular_automl/scripts/launch_model.py (2)

13-15: Consider making the configuration path more flexible.

The configuration file path is hardcoded, which might limit flexibility. Consider allowing this path to be specified via command-line arguments or environment variables.

- config_yaml = files("MEDS_tabular_automl").joinpath("configs/launch_model.yaml")
+ config_yaml = Path(os.getenv("CONFIG_PATH", "src/MEDS_tabular_automl/configs/launch_model.yaml"))

46-51: Consider improving the model file naming scheme.

The current method of using timestamps might lead to issues with reproducibility or collisions in filenames. Consider using a more robust naming scheme that includes a unique identifier or a version number.

tests/test_tabularize.py (1)

Line range hint 5-32: Refactor imports for clarity and maintainability.

Consider grouping standard library imports together and separating them from third-party imports for better readability and maintainability. This is a common best practice that helps in quickly identifying dependencies of the module.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a6d9103 and 23eb4d4.

Files selected for processing (2)
  • src/MEDS_tabular_automl/scripts/launch_model.py (1 hunks)
  • tests/test_tabularize.py (6 hunks)
Additional context used
Ruff
src/MEDS_tabular_automl/scripts/launch_model.py

36-36: Found useless expression. Either assign it to a variable or remove it.

(B018)

tests/test_tabularize.py

317-317: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


346-346: Yoda condition detected

Rewrite as launch_model.main(cfg) == 0.0

(SIM300)

Additional comments not posted (2)
src/MEDS_tabular_automl/scripts/launch_model.py (1)

28-30: Resolve the TODO regarding tabularization.

The TODO comment indicates that tabularization should be handled in the YAML configuration. Address this to improve code maintainability.

@mmcdermott, I've created a GitHub issue to track the task of moving tabularization to the YAML configuration. You can find the issue here: Move tabularization to YAML configuration. You can link this issue in the code to keep track of the task.

tests/test_tabularize.py (1)

147-277: Refactor suggestion: Improve readability and maintainability of test_tabularize.

The test_tabularize function is comprehensive but lengthy and complex. Consider refactoring it into smaller, more manageable functions or methods to improve readability and maintainability. This could involve separating setup, execution, and validation phases into distinct methods or functions.

tests/test_tabularize.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (1)
src/MEDS_tabular_automl/scripts/launch_model.py (1)

46-56: Improve the model file naming scheme for better reproducibility.

The current method of using timestamps might lead to issues with reproducibility or collisions in filenames. Consider using a more robust naming scheme that includes a unique identifier or a version number.

Here's a suggestion to improve the model file naming scheme:

import uuid

model_filename = f"{path_cfg.model_file_stem}_{auc:.4f}_{uuid.uuid4().hex}{path_cfg.model_file_extension}"

This will generate a unique identifier for each model file, ensuring better reproducibility and avoiding filename collisions.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 23eb4d4 and d390658.

Files selected for processing (1)
  • src/MEDS_tabular_automl/scripts/launch_model.py (1 hunks)
Additional context used
Ruff
src/MEDS_tabular_automl/scripts/launch_model.py

36-36: Found useless expression. Either assign it to a variable or remove it.

(B018)

Additional comments not posted (1)
src/MEDS_tabular_automl/scripts/launch_model.py (1)

19-57: LGTM!

The code changes in the main function are approved. The function handles the model training and evaluation flow correctly.

Tools
Ruff

36-36: Found useless expression. Either assign it to a variable or remove it.

(B018)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d390658 and a564886.

Files selected for processing (2)
  • src/MEDS_tabular_automl/evaluation_callback.py (1 hunks)
  • tests/test_integration.py (3 hunks)
Files skipped from review as they are similar to previous changes (1)
  • src/MEDS_tabular_automl/evaluation_callback.py
Additional context used
Ruff
tests/test_integration.py

199-199: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


251-251: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Additional comments not posted (4)
tests/test_integration.py (4)

45-45: LGTM!

The change to the test_integration function signature, taking a tmp_path argument, is approved. This is a positive change that enhances the test's efficiency and clarity by directly leveraging the temporary path for directory creation.


47-50: LGTM!

The changes to the directory setup logic, replacing tempfile.TemporaryDirectory() with direct assignments using tmp_path, are approved. This change simplifies the setup process for the test environment and is a logical continuation of the transition to using tmp_path.


52-302: LGTM!

The remaining changes to the test function, including adjustments to the test logic and assertions, are approved. The changes maintain the core functionality and validation logic of the test while enhancing clarity and efficiency.

Tools
Ruff

199-199: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


251-251: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


81-81: Verify the impact of the key change.

The key has been changed from "patient_id" to "subject_id" in the final data selection. Ensure that this change is consistent with the rest of the codebase and doesn't introduce any issues.

Run the following script to verify the usage of "subject_id" across the codebase:

Verification successful

The change from "patient_id" to "subject_id" is consistent across the codebase.

The key "subject_id" is used consistently in both test and source files, indicating a well-integrated change. There is no evidence of "patient_id" in the search results, suggesting a complete transition. This change should not introduce any issues related to inconsistent key usage.

  • Files with "subject_id" usage:
    • tests/test_integration.py
    • tests/test_tabularize.py
    • src/MEDS_tabular_automl/utils.py
    • src/MEDS_tabular_automl/generate_static_features.py
    • src/MEDS_tabular_automl/describe_codes.py
    • src/MEDS_tabular_automl/generate_summarized_reps.py
    • src/MEDS_tabular_automl/generate_ts_features.py
    • src/MEDS_tabular_automl/scripts/cache_task.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `"subject_id"` across the codebase.

# Test: Search for the usage of `"subject_id"`. Expect: Consistent usage across the codebase.
rg --type python $'"subject_id"'

Length of output: 64


Script:

#!/bin/bash
# Description: Verify the usage of `"subject_id"` across Python files in the codebase.

# Test: Search for the usage of `"subject_id"` in Python files.
rg '"subject_id"' --glob '*.py'

Length of output: 3589

f"Time-Series Data matrix Should have {expected_num_rows}" f" rows but has {ts_matrix.shape[0]}!"
)
output_files = list_subdir_files(str(Path(cfg.output_tabularized_dir).resolve()), "npz")
for split in split_json.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply Ruff suggestions to simplify dictionary key checks.

Ruff suggests removing .keys() from dictionary key checks at lines 199 and 251. This can be applied to simplify the code without changing the functionality.

Apply this diff to fix the issues:

-for split in split_json.keys():
+for split in split_json:
-for split in split_json.keys():
+for split in split_json:

Also applies to: 251-251

Tools
Ruff

199-199: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (4)
tests/test_integration.py (2)

199-199: Remove redundant .keys() in dictionary access.

The use of .keys() when checking for membership in a dictionary is unnecessary and can be omitted for cleaner and more efficient code.

Apply this diff to remove .keys():

- for split in split_json.keys():
+ for split in split_json:
Tools
Ruff

199-199: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


251-251: Remove redundant .keys() in dictionary access.

The use of .keys() when checking for membership in a dictionary is unnecessary and can be omitted for cleaner and more efficient code.

Apply this diff to remove .keys():

- for split in split_json.keys():
+ for split in split_json:
Tools
Ruff

251-251: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

tests/test_tabularize.py (2)

278-278: Remove redundant .keys() in dictionary access.

The use of .keys() when checking for membership in a dictionary is unnecessary and can be omitted for cleaner and more efficient code.

Apply this diff to remove .keys():

- for split in split_json.keys():
+ for split in split_json:

317-317: Remove redundant .keys() in dictionary access.

The use of .keys() when checking for membership in a dictionary is unnecessary and can be omitted for cleaner and more efficient code.

Apply this diff to remove .keys():

- for split in split_json.keys():
+ for split in split_json:
Tools
Ruff

317-317: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a564886 and 430afba.

Files selected for processing (4)
  • src/MEDS_tabular_automl/evaluation_callback.py (1 hunks)
  • src/MEDS_tabular_automl/utils.py (7 hunks)
  • tests/test_integration.py (3 hunks)
  • tests/test_tabularize.py (6 hunks)
Files skipped from review as they are similar to previous changes (1)
  • src/MEDS_tabular_automl/evaluation_callback.py
Additional context used
Ruff
tests/test_integration.py

199-199: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


251-251: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

tests/test_tabularize.py

317-317: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


346-346: Yoda condition detected

Rewrite as launch_model.main(cfg) == 0.0

(SIM300)

Additional comments not posted (9)
tests/test_integration.py (1)

45-308: Approve updates to test_integration with suggestions for additional testing and refactoring.

The updates to the test_integration function reflect improvements in test setup and configuration handling:

  1. Use of tmp_path: The adoption of tmp_path for directory management is a best practice that enhances the clarity and reliability of the test environment.
  2. Schema and Configuration Changes: The updates to accommodate changes in data schema (e.g., patient_id to subject_id) and configuration parameters are well-handled. Ensure that these changes are consistently applied across all tests.
  3. Additional Test Cases: Consider adding more test cases to cover edge cases and error conditions, especially for new configurations and data schema changes.
  4. Refactor for Readability: The function is comprehensive but lengthy and complex. Consider refactoring it into smaller, more manageable functions or methods to improve readability and maintainability. This could involve separating setup, execution, and validation phases into distinct methods or functions.
Tools
Ruff

199-199: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


251-251: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

src/MEDS_tabular_automl/utils.py (5)

Line range hint 44-113: Approve updates to filter_to_codes function.

The updates to the filter_to_codes function enhance its functionality and provide more flexibility in filtering codes based on various criteria:

  1. Additional Parameters: The function now accepts min_code_inclusion_count, min_code_inclusion_frequency, and max_include_codes as optional parameters, allowing for more granular control over the filtering process.
  2. Frequency-based Filtering: The logic for handling min_code_inclusion_frequency has been implemented correctly, ensuring that codes are filtered based on their frequency relative to the dataset size.
  3. Count-based Filtering: The min_code_inclusion_count parameter is used to filter codes based on their absolute count, providing another layer of filtering.
  4. Top Codes Selection: The max_include_codes parameter allows selecting the top N most frequent codes, which can be useful for focusing on the most prevalent codes in the dataset.

The implementation is well-structured, with appropriate error handling and informative error messages. The docstring and examples have been updated to reflect the changes in functionality.


340-355: Approve updates to get_unique_time_events_df docstring.

The docstring for the get_unique_time_events_df function has been updated to reflect the change in terminology from patient_id to subject_id. This change aligns the function's documentation with the updated data model and ensures consistency throughout the codebase.


421-446: Approve addition of log_to_logfile utility function.

The new log_to_logfile function is a useful utility for logging model hyperparameters and performance to log files. It takes the model, configuration, and output file path as input and writes the relevant information to separate log files for configuration and performance.

The function is well-structured and follows good logging practices. It creates the necessary directories if they don't exist and writes the configuration as YAML and the performance metrics in a CSV format. The use of descriptive file names and the inclusion of both configuration and performance logs enhance the traceability and reproducibility of the model training process.


448-462: Approve addition of current_script_name utility function.

The new current_script_name function is a handy utility that returns the name of the module that called it. It attempts to retrieve the name of the main function's module, falling back to the script name if the main function is not found.

The implementation is robust, handling different scenarios such as when the main function is defined in the __main__ module or imported from another module. The use of a warning log statement when falling back to the script name provides visibility into potential edge cases.

This utility can be useful for logging, error reporting, or dynamically determining the context in which the code is being executed.


464-494: Approve addition of stage_init utility function.

The new stage_init function is a convenient utility for initializing a stage by logging the configuration and stage-specific paths. It takes the global configuration object and a list of keys as input and logs the configuration as YAML along with the status and paths of the specified keys.

The function is well-designed and provides useful functionality for stage initialization:

  1. Configuration Logging: It logs the entire configuration as YAML, which is valuable for debugging and reproducibility.
  2. Path Checking: It checks the existence of paths specified by the provided keys and logs their status and resolved paths. The use of checkboxes to indicate the existence of paths enhances readability.
  3. Flexibility: The function allows specifying the keys of interest, making it adaptable to different stages and configurations.

The log statements are well-formatted and informative, making it easy to understand the stage setup and identify potential issues with missing paths.

Overall, this utility function promotes good practices for stage initialization, configuration management, and logging.

tests/test_tabularize.py (3)

147-412: Approve updates to test_tabularize with suggestions for additional testing and refactoring.

The updates to the test_tabularize function reflect improvements in test setup and configuration handling:

  1. Use of tmp_path: The adoption of tmp_path for directory management is a best practice that enhances the clarity and reliability of the test environment.
  2. Schema and Configuration Changes: The updates to accommodate changes in data schema (e.g., patient_id to subject_id) and configuration parameters are well-handled. Ensure that these changes are consistently applied across all tests.
  3. Additional Test Cases: Consider adding more test cases to cover edge cases and error conditions, especially for new configurations and data schema changes.
  4. Refactor for Readability: The function is comprehensive but lengthy and complex. Consider refactoring it into smaller, more manageable functions or methods to improve readability and maintainability. This could involve separating setup, execution, and validation phases into distinct methods or functions.
Tools
Ruff

317-317: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


346-346: Yoda condition detected

Rewrite as launch_model.main(cfg) == 0.0

(SIM300)


19-19: Approve renaming of launch_xgboost to launch_model.

The renaming of the launch_xgboost function to launch_model reflects a broader approach to model launching that supports multiple model types. This change improves the flexibility and extensibility of the code, allowing for the integration of different models beyond just XGBoost.

Ensure that the corresponding changes in configuration handling and function calls are consistently applied throughout the codebase to match the new naming convention.


414-435: Approve conditional import and testing of autogluon.

The conditional import and testing of the autogluon library demonstrate the flexibility of the code to support additional libraries when available. This approach allows for the integration of AutoGluon models without introducing a hard dependency on the library.

The test configuration and launching of the AutoGluon model are well-structured and follow a similar pattern to the other model types. The loading and validation of the trained model ensure that the end-to-end functionality works as expected.

Consider adding more specific assertions to validate the performance and predictions of the AutoGluon model, similar to the other model types, to ensure comprehensive testing coverage.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (1)
MIMICIV_TUTORIAL/README.MD (1)

32-32: Remove trailing punctuation from the heading.

The heading at line 32 ends with a colon. Remove the trailing punctuation to adhere to the markdown formatting best practices.

Apply this diff to remove the trailing punctuation:

-### Download pre-extracted labels from gcp:
+### Download pre-extracted labels from gcp
Tools
Markdownlint

32-32: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 430afba and 6a89a9f.

Files selected for processing (4)
  • MIMICIV_TUTORIAL/README.MD (1 hunks)
  • MIMICIV_TUTORIAL/tabularize_meds.sh (1 hunks)
  • src/MEDS_tabular_automl/configs/default.yaml (1 hunks)
  • src/MEDS_tabular_automl/configs/launch_model.yaml (1 hunks)
Files skipped from review due to trivial changes (1)
  • src/MEDS_tabular_automl/configs/default.yaml
Files skipped from review as they are similar to previous changes (1)
  • src/MEDS_tabular_automl/configs/launch_model.yaml
Additional context used
Markdownlint
MIMICIV_TUTORIAL/README.MD

32-32: Punctuation: ':'
Trailing punctuation in heading

(MD026, no-trailing-punctuation)

Additional comments not posted (8)
MIMICIV_TUTORIAL/tabularize_meds.sh (8)

6-19: LGTM!

The help function is well-structured and provides clear usage instructions. It lists all the required arguments and their descriptions.


22-56: LGTM!

The argument handling logic is robust and handles the help flag and the minimum number of required arguments. It assigns the arguments to variables with clear names and prints them for debugging.


60-71: LGTM!

The data resharding logic is clear and uses the appropriate command for resharding. It passes the required arguments to the command, including the input and output directories. The number of subjects per shard and the polling time are hardcoded, but they can be overridden by passing them as additional arguments to the script.


73-76: LGTM!

The code description logic is clear and uses the appropriate command for describing the codes. It passes the required arguments to the command, including the input and output directories.


78-81: LGTM!

The static data tabularization logic is clear and uses the appropriate command for tabularizing the static data. It passes the required arguments to the command, including the input and output directories. Setting the do_overwrite flag to False is a good practice to prevent accidental overwriting of existing files. Passing any additional arguments to the command using "$@" allows for flexibility and customization.


83-88: LGTM!

The time series data tabularization logic is clear and uses the appropriate command for tabularizing the time series data. Using the --multirun flag and the hydra/launcher=joblib flag allows for parallel execution, which can speed up the process. It passes the required arguments to the command, including the input and output directories. Setting the do_overwrite flag to False is a good practice to prevent accidental overwriting of existing files. Passing any additional arguments to the command using "$@" allows for flexibility and customization.


90-97: LGTM!

The task-specific caching logic is clear and uses the appropriate command for caching the task-specific data. Looping over the tasks allows for caching multiple tasks in a single run. It passes the required arguments to the command, including the input and output directories, the task-specific label directory, and the task name. Setting the do_overwrite flag to False is a good practice to prevent accidental overwriting of existing files. Passing any additional arguments to the command using "$@" allows for flexibility and customization.


98-105: LGTM!

The XGBoost training logic is clear and uses the appropriate command for training XGBoost models. Looping over the tasks allows for training models for multiple tasks in a single run. Using the --multirun flag allows for parallel execution, which can speed up the training process. It passes the required arguments to the command, including the input and output directories, the output model directory, and the task name. Setting the do_overwrite flag to False is a good practice to prevent accidental overwriting of existing files. Passing the hydra.sweeper.n_trials and hydra.sweeper.n_jobs arguments allows for controlling the hyperparameter search. Passing any additional arguments to the command using "$@" allows for flexibility and customization.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (3)
tests/test_integration.py (2)

46-342: Approve updates to test_integration with suggestions for additional testing and refactoring.

The updates to the test_integration function reflect improvements in test setup and configuration handling:

  1. Use of tmp_path: The adoption of tmp_path for directory management is a best practice that enhances the clarity and reliability of the test environment.
  2. Schema and Configuration Changes: The updates to accommodate changes in data schema (e.g., patient_id to subject_id) and configuration parameters are well-handled. Ensure that these changes are consistently applied across all tests.

Suggestions for Improvement:

  1. Additional Test Cases: Consider adding more test cases to cover edge cases and error conditions, especially for new configurations and data schema changes.
  2. Refactor for Readability: The function is comprehensive but lengthy and complex. Consider refactoring it into smaller, more manageable functions or methods to improve readability and maintainability. This could involve separating setup, execution, and validation phases into distinct methods or functions.
Tools
Ruff

200-200: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


252-252: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


200-200: Apply Ruff suggestions to simplify dictionary key checks.

Ruff suggests removing .keys() from dictionary key checks at lines 200 and 252. This can be applied to simplify the code without changing the functionality.

Apply this diff to fix the issues:

-for split in split_json.keys():
+for split in split_json:

Also applies to: 252-252

Tools
Ruff

200-200: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

tests/test_tabularize.py (1)

Line range hint 1-456: Offer assistance in addressing previous review comments.

The previous review comments suggest several improvements to the codebase:

  1. Refactor imports for clarity and maintainability: Grouping standard library imports together and separating them from third-party imports can improve readability and maintainability.
  2. Refactor configuration loading for better error handling: Adding error handling to the configuration loading process using hydra can ensure robustness and graceful handling of missing or incorrect configurations.
  3. Improve file existence checks and assertions: Adding more descriptive error messages to file existence checks and assertions can help in debugging when tests fail.
  4. Refactor repeated code for configuration composition: Refactoring the repeated pattern of initializing hydra, composing configurations, and overriding settings into a helper function can reduce code duplication and improve maintainability.
  5. Address the useless expression issue: The useless expression at line 299 should be either removed or replaced with a meaningful operation if it's intended to have side effects.

Do you want me to assist in addressing these improvements? I can provide guidance or propose changes to refactor the codebase.

Tools
Ruff

318-318: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


347-347: Yoda condition detected

Rewrite as launch_model.main(cfg) == 0.0

(SIM300)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 6a89a9f and 9e6d99a.

Files selected for processing (3)
  • src/MEDS_tabular_automl/xgboost_model.py (1 hunks)
  • tests/test_integration.py (3 hunks)
  • tests/test_tabularize.py (6 hunks)
Additional context used
Ruff
tests/test_integration.py

200-200: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


252-252: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

tests/test_tabularize.py

318-318: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


347-347: Yoda condition detected

Rewrite as launch_model.main(cfg) == 0.0

(SIM300)

Additional comments not posted (5)
src/MEDS_tabular_automl/xgboost_model.py (2)

14-72: LGTM!

The XGBIterator class is well-designed and implemented. It provides a clear interface for loading and processing data shards for XGBoost models. The class is properly documented with a clear purpose and usage instructions. The implementation follows best practices and looks correct.


75-190: LGTM!

The XGBoostModel class is well-designed and implemented. It provides a clear interface for configuring, training, and evaluating an XGBoost model. The class is properly documented with a clear purpose and usage instructions. The implementation follows best practices and looks correct.

The class provides flexibility in data handling by supporting both in-memory data handling and direct streaming from disk using XGBIterators. This allows for efficient training and evaluation of XGBoost models.

tests/test_tabularize.py (3)

148-333: Approve updates to test_tabularize with suggestions for additional testing and refactoring.

The updates to the test_tabularize function reflect improvements in test setup and configuration handling:

  1. Use of tmp_path: The adoption of tmp_path for directory management is a best practice that enhances the clarity and reliability of the test environment.
  2. Schema and Configuration Changes: The updates to accommodate changes in data schema (e.g., patient_id to subject_id) and configuration parameters are well-handled. Ensure that these changes are consistently applied across all tests.

Suggestions for Improvement:

  1. Additional Test Cases: Consider adding more test cases to cover edge cases and error conditions, especially for new configurations and data schema changes.
  2. Refactor for Readability: The function is comprehensive but lengthy and complex. Consider refactoring it into smaller, more manageable functions or methods to improve readability and maintainability. This could involve separating setup, execution, and validation phases into distinct methods or functions.
Tools
Ruff

318-318: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


20-20: Approve the renaming of launch_xgboost to launch_model.

The renaming of launch_xgboost to launch_model is a good change that makes the function more generic and reusable for launching different types of models. It improves the modularity and maintainability of the codebase by accommodating various model types.


318-318: Remove redundant .keys() in dictionary access.

The use of .keys() when checking for membership in a dictionary is unnecessary and can be omitted for cleaner and more efficient code.

Apply this diff to remove .keys():

-if key in dict.keys():
+if key in dict:

Also applies to: 324-324

Tools
Ruff

318-318: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (6)
src/MEDS_tabular_automl/scripts/launch_model.py (1)

46-70: Consider refactoring the model saving process to improve readability and maintainability.

The model saving process can be refactored to improve readability and maintainability. Consider extracting the code for creating the output directory, saving the model, and saving the model performance into separate functions.

For example, you can create the following functions:

def create_output_dir(path_cfg: DictConfig, cfg: DictConfig) -> Path:
    model_config_hash = abs(hash(json.dumps(OmegaConf.to_container(cfg), sort_keys=True)))
    trial_output_dir = Path(path_cfg.sweep_results_dir) / str(model_config_hash)
    trial_output_dir.mkdir(parents=True, exist_ok=True)
    return trial_output_dir

def save_model(model_launcher: BaseModel, trial_output_dir: Path, path_cfg: DictConfig):
    model_filename = f"{path_cfg.model_file_stem}{path_cfg.model_file_extension}"
    model_launcher.save_model(trial_output_dir / model_filename)

def save_model_config(cfg: DictConfig, trial_output_dir: Path):
    config_fp = trial_output_dir / f"{cfg.path.config_log_stem}.log"
    with open(config_fp, "w") as f:
        f.write(OmegaConf.to_yaml(cfg))

def save_model_performance(model_launcher: BaseModel, trial_output_dir: Path, cfg: DictConfig):
    model_performance_fp = trial_output_dir / f"{cfg.path.performance_log_stem}.log"
    with open(model_performance_fp, "w") as f:
        f.write("trial_name,tuning_auc,test_auc\n")
        f.write(
            f"{trial_output_dir.stem},{model_launcher.evaluate()},"
            f"{model_launcher.evaluate(split='held_out')}\n"
        )

Then, you can use these functions in the main function as follows:

trial_output_dir = create_output_dir(path_cfg, cfg)
save_model(model_launcher, trial_output_dir, path_cfg)
save_model_config(cfg, trial_output_dir)
save_model_performance(model_launcher, trial_output_dir, cfg)
src/MEDS_tabular_automl/sklearn_model.py (2)

89-110: Enhance the error messages in the _fit_from_partial method.

The error handling in the _fit_from_partial method can be enhanced by providing more descriptive error messages. Consider including what is expected vs. what is received in the error message.

For example, you can update the error message as follows:

raise ValueError(
    f"Data is loaded in shards, but {self.model.__class__.__name__} does not support partial_fit. "
    f"Expected a model with partial_fit method, but received {self.model.__class__.__name__}."
)

81-87: Add more detailed comments to the _build_data and _fit_from_partial methods.

The _build_data and _fit_from_partial methods could benefit from more detailed comments explaining the steps involved, especially when handling different data loading strategies.

Consider adding comments to explain the purpose and logic of each step in these methods. For example:

def _build_data(self):
    """Builds necessary data structures for training."""
    if self.keep_data_in_memory:
        # Build iterators and load all data into memory
        self._build_iterators()
        self._build_matrix_in_memory()
    else:
        # Build iterators for streaming data from disk
        self._build_iterators()

def _fit_from_partial(self):
    """Fits model incrementally until convergence or maximum epochs."""
    if not hasattr(self.model, "partial_fit"):
        raise ValueError(
            f"Data is loaded in shards, but {self.model.__class__.__name__} does not support partial_fit."
        )
    classes = self.itrain.get_classes()
    best_auc = 0
    best_epoch = 0
    for epoch in range(self.cfg.training_params.epochs):
        # Train on each data shard
        for shard_idx in range(len(self.itrain._data_shards)):
            data, labels = self.itrain.get_data_shards(shard_idx)
            self.model.partial_fit(data, labels, classes=classes)
        # Evaluate on tuning set
        auc = self.evaluate()
        # Update best model based on tuning AUC
        if auc > best_auc:
            best_auc = auc
            best_epoch = epoch
        # Early stopping if no improvement for specified rounds
        if epoch - best_epoch > self.cfg.training_params.early_stopping_rounds:
            break

Also applies to: 89-110

tests/test_integration.py (2)

47-347: Refactor the test_integration function for better readability and maintainability.

The test_integration function is quite large and can be broken down into smaller, more manageable functions. This can improve readability and maintainability.

Consider extracting logical sections of the test into separate functions. For example:

def setup_environment(tmp_path):
    # Create directories and store MEDS outputs
    ...

def run_describe_codes(cfg):
    # Run the describe_codes script and validate the output
    ...

def run_static_tabularization(cfg):
    # Run the static data tabularization script and validate the output
    ...

def run_time_series_tabularization(cfg):
    # Run the time series tabularization script and validate the output
    ...

def run_task_specific_caching(cfg, split_json):
    # Run the task_specific_caching script and validate the output
    ...

def run_model_training(cfg, output_model_dir, stdout_agg):
    # Run model training for various models and validate the output
    ...

def test_integration(tmp_path):
    # Setup environment
    setup_environment(tmp_path)
    
    # Run describe_codes
    run_describe_codes(cfg)
    
    # Run static tabularization
    run_static_tabularization(cfg)
    
    # Run time series tabularization
    run_time_series_tabularization(cfg)
    
    # Run task specific caching
    run_task_specific_caching(cfg, split_json)
    
    # Run model training
    run_model_training(cfg, output_model_dir, stdout_agg)

This refactoring makes the test more modular and easier to understand and maintain.

Tools
Ruff

201-201: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


253-253: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


47-347: Enhance the comments to explain each step of the test more clearly.

The comments in the test_integration function can be enhanced to provide more details about each step of the test. This can help future maintainers understand the test's flow and purpose more quickly.

Consider adding more detailed comments explaining the purpose and expected outcome of each section of the test. For example:

def test_integration(tmp_path

<details>
<summary>Tools</summary>

<details>
<summary>Ruff</summary><blockquote>

201-201: Use `key in dict` instead of `key in dict.keys()`

Remove `.keys()`

(SIM118)

---

253-253: Use `key in dict` instead of `key in dict.keys()`

Remove `.keys()`

(SIM118)

</blockquote></details>

</details>

</blockquote></details>
<details>
<summary>tests/test_tabularize.py (1)</summary><blockquote>

`148-478`: **Approve updates to `test_tabularize` with suggestions for additional testing and refactoring.**

The updates to the `test_tabularize` function reflect improvements in test setup and configuration handling:

1. **Use of `tmp_path`:** The adoption of `tmp_path` for directory management is a best practice that enhances the clarity and reliability of the test environment.
2. **Schema and Configuration Changes:** The updates to accommodate changes in data schema (e.g., `patient_id` to `subject_id`) and configuration parameters are well-handled. Ensure that these changes are consistently applied across all tests.
3. **Additional Test Cases:** Consider adding more test cases to cover edge cases and error conditions, especially for new configurations and data schema changes.
4. **Refactor for Readability:** The function is comprehensive but lengthy and complex. Consider refactoring it into smaller, more manageable functions or methods to improve readability and maintainability. This could involve separating setup, execution, and validation phases into distinct methods or functions.

<details>
<summary>Tools</summary>

<details>
<summary>Ruff</summary><blockquote>

347-347: Yoda condition detected

Rewrite as `launch_model.main(cfg) == 0.0`

(SIM300)

</blockquote></details>

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>Review details</summary>

**Configuration used: CodeRabbit UI**
**Review profile: CHILL**

<details>
<summary>Commits</summary>

Files that changed from the base of the PR and between 9e6d99acfebe4ce1a450969866f9d1ff99d47af3 and 83163656aceaae9798fc0e04b4e90cda70215f50.

</details>


<details>
<summary>Files selected for processing (15)</summary>

* src/MEDS_tabular_automl/configs/launch_model.yaml (1 hunks)
* src/MEDS_tabular_automl/configs/model_launcher/autogluon.yaml (1 hunks)
* src/MEDS_tabular_automl/configs/model_launcher/knn_classifier.yaml (1 hunks)
* src/MEDS_tabular_automl/configs/model_launcher/logistic_regression.yaml (1 hunks)
* src/MEDS_tabular_automl/configs/model_launcher/path/default.yaml (1 hunks)
* src/MEDS_tabular_automl/configs/model_launcher/random_forest_classifier.yaml (1 hunks)
* src/MEDS_tabular_automl/configs/model_launcher/sgd_classifier.yaml (1 hunks)
* src/MEDS_tabular_automl/configs/model_launcher/xgboost.yaml (1 hunks)
* src/MEDS_tabular_automl/evaluation_callback.py (1 hunks)
* src/MEDS_tabular_automl/scripts/launch_autogluon.py (1 hunks)
* src/MEDS_tabular_automl/scripts/launch_model.py (1 hunks)
* src/MEDS_tabular_automl/sklearn_model.py (1 hunks)
* src/MEDS_tabular_automl/utils.py (7 hunks)
* tests/test_integration.py (3 hunks)
* tests/test_tabularize.py (6 hunks)

</details>






<details>
<summary>Files skipped from review due to trivial changes (1)</summary>

* src/MEDS_tabular_automl/configs/model_launcher/logistic_regression.yaml

</details>

<details>
<summary>Files skipped from review as they are similar to previous changes (9)</summary>

* src/MEDS_tabular_automl/configs/launch_model.yaml
* src/MEDS_tabular_automl/configs/model_launcher/autogluon.yaml
* src/MEDS_tabular_automl/configs/model_launcher/knn_classifier.yaml
* src/MEDS_tabular_automl/configs/model_launcher/path/default.yaml
* src/MEDS_tabular_automl/configs/model_launcher/random_forest_classifier.yaml
* src/MEDS_tabular_automl/configs/model_launcher/sgd_classifier.yaml
* src/MEDS_tabular_automl/configs/model_launcher/xgboost.yaml
* src/MEDS_tabular_automl/evaluation_callback.py
* src/MEDS_tabular_automl/scripts/launch_autogluon.py

</details>

<details>
<summary>Additional context used</summary>

<details>
<summary>Ruff</summary><blockquote>

<details>
<summary>src/MEDS_tabular_automl/scripts/launch_model.py</summary><blockquote>

36-36: Found useless expression. Either assign it to a variable or remove it.

(B018)

</blockquote></details>
<details>
<summary>tests/test_integration.py</summary><blockquote>

201-201: Use `key in dict` instead of `key in dict.keys()`

Remove `.keys()`

(SIM118)

---

253-253: Use `key in dict` instead of `key in dict.keys()`

Remove `.keys()`

(SIM118)

</blockquote></details>
<details>
<summary>tests/test_tabularize.py</summary><blockquote>

347-347: Yoda condition detected

Rewrite as `launch_model.main(cfg) == 0.0`

(SIM300)

</blockquote></details>

</blockquote></details>

</details>
<details>
<summary>Additional comments not posted (4)</summary><blockquote>

<details>
<summary>src/MEDS_tabular_automl/sklearn_model.py (1)</summary><blockquote>

`14-31`: **LGTM!**

The `SklearnMatrix` class is implemented correctly. The code changes are approved.

</blockquote></details>
<details>
<summary>src/MEDS_tabular_automl/utils.py (2)</summary><blockquote>

Line range hint `44-113`: **LGTM!**

The changes to the `filter_to_codes` function enhance its filtering capabilities by adding new parameters and implementing their corresponding logic. The updated error message provides clear information about the filtering criteria.

The code changes are approved.

---

`340-355`: **LGTM!**

The changes to the `get_unique_time_events_df` function reflect a shift in terminology from `patient_id` to `subject_id`. The docstring and logic have been updated consistently to use `subject_id`.

The code changes are approved.

</blockquote></details>
<details>
<summary>tests/test_tabularize.py (1)</summary><blockquote>

`347-347`: **Nitpick: Yoda condition detected.**

Ruff suggests rewriting the condition `0.0 == launch_model.main(cfg)` as `launch_model.main(cfg) == 0.0` to avoid a Yoda condition. However, this is a matter of personal preference and coding style. The current condition is functionally correct.

<details>
<summary>Tools</summary>

<details>
<summary>Ruff</summary><blockquote>

347-347: Yoda condition detected

Rewrite as `launch_model.main(cfg) == 0.0`

(SIM300)

</blockquote></details>

</details>

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +35 to +39
try:
cfg.tabularization._resolved_codes
except ValueError as e:
logger.warning(f"No codes meet loading criteria, trial returning 0 AUC: {str(e)}")
return 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle the ValueError exception effectively.

The try block is using the cfg.tabularization._resolved_codes attribute without handling the ValueError exception effectively. If the attribute is not found, the function simply returns 0.0 without logging any error message or raising an exception.

Consider logging an error message and raising an exception to handle the ValueError exception effectively. For example:

try:
    cfg.tabularization._resolved_codes
except ValueError as e:
    logger.error(f"No codes meet loading criteria: {str(e)}")
    raise
Tools
Ruff

36-36: Found useless expression. Either assign it to a variable or remove it.

(B018)

Comment on lines +136 to +176
def evaluate(self, split: str = "tuning") -> float:
"""Evaluates the model on the tuning set.

Returns:
The evaluation metric as the ROC AUC score.
"""
# depending on split point to correct data
if split == "tuning":
dsplit = self.dtuning
isplit = self.ituning
elif split == "held_out":
dsplit = self.dheld_out
isplit = self.iheld_out
elif split == "train":
dsplit = self.dtrain
isplit = self.itrain
else:
raise ValueError(f"Split {split} is not valid.")

# check if model has predict_proba method
if not hasattr(self.model, "predict_proba"):
raise ValueError(f"Model {self.model.__class__.__name__} does not have a predict_proba method.")

# two cases: data is in memory or data is streamed
if self.keep_data_in_memory:
y_pred = self.model.predict_proba(dsplit.get_data())[:, 1]
y_true = dsplit.get_label()
else:
y_pred = []
y_true = []
for shard_idx in range(len(isplit._data_shards)):
data, labels = isplit.get_data_shards(shard_idx)
y_pred.extend(self.model.predict_proba(data)[:, 1])
y_true.extend(labels)
y_pred = np.array(y_pred)
y_true = np.array(y_true)

# check if y_pred and y_true are not empty
if len(y_pred) == 0 or len(y_true) == 0:
raise ValueError("Predictions or true labels are empty.")
return roc_auc_score(y_true, y_pred)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify the conditional logic in the evaluate method.

The conditional logic in the evaluate method can be simplified by using a dictionary to map the split values to the corresponding data and iterator attributes.

Consider refactoring the code as follows:

def evaluate(self, split: str = "tuning") -> float:
    """Evaluates the model on the specified dataset split."""
    split_map = {
        "tuning": (self.dtuning, self.ituning),
        "held_out": (self.dheld_out, self.iheld_out),
        "train": (self.dtrain, self.itrain),
    }
    if split not in split_map:
        raise ValueError(f"Split {split} is not valid.")
    
    dsplit, isplit = split_map[split]
    
    if not hasattr(self.model, "predict_proba"):
        raise ValueError(f"Model {self.model.__class__.__name__} does not have a predict_proba method.")
    
    if self.keep_data_in_memory:
        y_pred = self.model.predict_proba(dsplit.get_data())[:, 1]
        y_true = dsplit.get_label()
    else:
        y_pred = []
        y_true = []
        for shard_idx in range(len(isplit._data_shards)):
            data, labels = isplit.get_data_shards(shard_idx)
            y_pred.extend(self.model.predict_proba(data)[:, 1])
            y_true.extend(labels)
        y_pred = np.array(y_pred)
        y_true = np.array(y_true)
    
    if len(y_pred) == 0 or len(y_true) == 0:
        raise ValueError("Predictions or true labels are empty.")
    return roc_auc_score(y_true, y_pred)

This refactoring eliminates the need for multiple conditional branches and makes the code more concise and readable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8316365 and f7e03dd.

Files selected for processing (1)
  • src/MEDS_tabular_automl/evaluation_callback.py (1 hunks)
Additional comments not posted (2)
src/MEDS_tabular_automl/evaluation_callback.py (2)

39-50: LGTM!

The log_performance method is well-implemented and follows good practices such as using a logger instead of printing.


52-90: Add error handling for file deletion.

The delete_below_top_k_models method can be improved by handling exceptions that might occur during file operations, such as permissions issues or files being in use.

Consider adding a try-except block around the shutil.rmtree call to handle such exceptions:

try:
    shutil.rmtree(trial_dir)
except PermissionError as e:
    logger.error(f"Permission denied: {trial_dir}, error: {e}")
except FileNotFoundError as e:
    logger.error(f"File not found: {trial_dir}, error: {e}")

This will make the method more robust and prevent the program from crashing if an exception occurs during file deletion.

Comment on lines +11 to +37
def on_multirun_end(self, config: DictConfig, **kwargs):
"""Find best model based on log files and logger.info its performance and hyperparameters."""
log_fp = Path(config.path.sweep_results_dir)

try:
performance = pl.read_csv(log_fp / f"*/*{config.path.performance_log_stem}.log")
except Exception as e:
raise FileNotFoundError(f"Log files incomplete or not found at {log_fp}") from e

performance = performance.sort("tuning_auc", descending=True, nulls_last=True)
logger.info(f"\nPerformance of the top 10 models:\n{performance.head(10)}")

self.log_performance(performance[0, :])
if hasattr(config, "delete_below_top_k") and config.delete_below_top_k >= 0:
self.delete_below_top_k_models(
performance, config.delete_below_top_k, config.path.sweep_results_dir
)
else:
logger.info(
"All models were saved. To automatically delete models, set delete_below_top_k in config."
)
best_trial_dir = Path(config.path.sweep_results_dir) / performance["trial_name"].cast(pl.String)[0]
output_best_trial_dir = Path(config.path.best_trial_dir)
shutil.copytree(best_trial_dir, output_best_trial_dir)
performance.write_parquet(config.time_output_model_dir / "sweep_results_summary.parquet")

return performance.head(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider breaking down the on_multirun_end method into smaller methods.

The on_multirun_end method is quite long and performs multiple tasks. Consider breaking it down into smaller methods, each performing a single task. This will make the code more readable and maintainable.

For example, you can extract the following tasks into separate methods:

  • Finding the best model based on log files.
  • Logging the performance and hyperparameters of the best model.
  • Deleting models below the top k models if specified in the config.
  • Copying the best model to the output directory.
  • Writing the sweep results summary to a parquet file.

@Oufattole Oufattole merged commit 07001be into main Sep 10, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants