Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix split calculation and allow for not embedding #120

Merged
merged 1 commit into from
Sep 29, 2024

Conversation

talmo
Copy link
Contributor

@talmo talmo commented Sep 29, 2024

This pull request fixes the split calculation in the make_training_splits function and adds the option to not embed user-labeled frame images in the saved files.

  • Fix: The function now correctly splits the labels into training, validation, and test sets based on the specified proportions (fixes Error with make_training_splits method #117).
  • Enhancement: Labels.make_training_splits(..., embed=False)
    Previously, the function would always embed the images, which could be slow for large projects. With this change, the embed parameter is introduced, allowing the user to choose whether to embed the images or save the labels with references to the source video files.

Copy link
Contributor

coderabbitai bot commented Sep 29, 2024

Walkthrough

The changes introduced in this pull request involve modifications to the write_labels and make_training_splits functions in the Sleap library. The write_labels function now checks the truthiness of the embed parameter, refining when video embedding occurs. The make_training_splits function has been updated to include an embed parameter, allowing control over whether user-labeled images are embedded in saved files. Additionally, the test suite has been enhanced to validate these functionalities with new assertions and parameterized tests.

Changes

File Change Summary
sleap_io/io/slp.py Updated the conditional check for the embed parameter in the write_labels function to refine when embedding occurs.
sleap_io/model/labels.py Added a new embed parameter to the make_training_splits function, allowing control over embedding images in saved files, with updated logic for file saving.
tests/model/test_labels.py Modified the test_make_training_splits function to remove tmp_path and added assertions for split lengths. Introduced test_make_training_splits_save to validate embedding behavior.

Possibly related PRs

🐇 In the garden where changes bloom,
A new embed brings light to the room.
Labels saved with care and delight,
Splits now shine, oh what a sight!
With tests that dance and verify,
Our code hops high, reaching the sky! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (4)
sleap_io/io/slp.py (2)

Line range hint 1-1119: Consider modularizing the file for improved maintainability

While the changes made are appropriate, this file (sleap_io/io/slp.py) is quite large and contains multiple complex functions for reading and writing different components of SLEAP labels. To improve maintainability and readability, consider splitting this file into smaller, more focused modules.

For example:

  1. slp_reader.py: Contains functions for reading SLEAP labels.
  2. slp_writer.py: Contains functions for writing SLEAP labels.
  3. slp_utils.py: Contains utility functions used by both reader and writer.

This modularization would make the codebase easier to navigate, test, and maintain in the long run.


Line range hint 1-1119: Consider performance optimizations for large datasets

While the current implementation is functional, there are potential areas for performance improvement, especially when dealing with large datasets:

  1. Batch Processing: In functions like write_lfs, consider implementing batch processing for writing large datasets to HDF5 files.
  2. Vectorization: Look for opportunities to replace loops with vectorized operations using NumPy, especially in data processing functions.
  3. Caching: Implement caching mechanisms for frequently accessed data, such as video metadata or skeleton information.
  4. Lazy Loading: Consider implementing lazy loading for large datasets to reduce memory usage.
  5. Parallel Processing: For CPU-bound operations, explore the use of multiprocessing to leverage multiple cores.

Implementing these optimizations could significantly improve the performance of the library, especially for large SLEAP label files.

To identify potential bottlenecks, we can search for loops and list comprehensions:

#!/bin/bash
# Description: Identify potential performance bottlenecks

# Search for for loops
echo "For loops:"
rg --type python 'for\s+\w+\s+in' -A 2

# Search for list comprehensions
echo "\nList comprehensions:"
rg --type python '\[.*for.*in.*\]'
sleap_io/model/labels.py (2)

687-690: Avoid comparing floating-point numbers for exact equality.

Using n_val == 1.0 can be unreliable due to floating-point precision issues.

Consider checking if n_val is greater than or equal to the size of labels_rest instead:

-if isinstance(n_val, float) and n_val == 1.0:
+if n_val >= len(labels_rest):
    labels_val = labels_rest
else:
    labels_val, _ = labels_rest.split(n=n_val, seed=seed)

This approach is more robust and handles cases where n_val may be slightly less than 1.0 due to floating-point representation.


674-690: Consider edge cases when splitting the dataset.

The current logic does not account for cases where the number of labels is less than the desired split sizes, which could lead to empty splits or errors.

Add checks to ensure that the split sizes do not exceed the available number of labels:

total_labels = len(labels)
+if n_train >= total_labels:
+    labels_train = labels
+    labels_val = Labels()
+    labels_test = Labels()
+    return labels_train, labels_val, labels_test

This addition preemptively handles scenarios with insufficient data.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 391df6e and 1385de5.

📒 Files selected for processing (3)
  • sleap_io/io/slp.py (1 hunks)
  • sleap_io/model/labels.py (5 hunks)
  • tests/model/test_labels.py (3 hunks)
🧰 Additional context used
🪛 Ruff
tests/model/test_labels.py

598-598: Redefinition of unused test_make_training_splits_save from line 577

(F811)

🔇 Additional comments (3)
sleap_io/io/slp.py (3)

Line range hint 1-1119: Summary of review findings and recommendations

Overall, the changes made to the write_labels function improve the flexibility of frame embedding in SLEAP label files. However, there are several areas for potential improvement:

  1. Modularization: Consider splitting this large file into smaller, focused modules for better maintainability.
  2. Security: Enhance security measures for file operations and user input processing.
  3. Performance: Implement optimizations for handling large datasets, such as batch processing and vectorization.
  4. Error Handling: Improve error handling and resource management throughout the file.

These recommendations aim to enhance the robustness, performance, and maintainability of the codebase. Consider prioritizing these improvements in future iterations of the project.


Line range hint 1-1119: Enhance security measures for file operations and user input

While the current implementation seems functional, there are areas where security could be improved:

  1. Path Traversal: Ensure that all file paths (e.g., labels_path) are properly sanitized to prevent path traversal attacks.
  2. Input Validation: Implement strict input validation for user-provided data, especially when processing JSON or creating HDF5 datasets.
  3. Error Handling: Improve error handling to avoid exposing sensitive information in case of exceptions.
  4. Resource Management: Ensure proper closing of file handles, especially when using h5py.File().

Consider implementing these security enhancements to make the file operations more robust against potential attacks or misuse.

To check for proper error handling and resource management, we can search for try-except blocks and context managers:

#!/bin/bash
# Description: Check for error handling and resource management

# Search for try-except blocks
echo "Try-Except blocks:"
rg --type python 'try:' -A 5

# Search for context managers (with statements)
echo "\nContext managers:"
rg --type python 'with.*:' -A 2

Line range hint 1115-1119: Improved handling of embedding in write_labels function

The changes to the write_labels function enhance the flexibility of embedding frames in the SLEAP labels file. The new implementation correctly handles the embed parameter, allowing for more granular control over which frames are embedded.

To ensure that this change doesn't introduce any regressions, let's verify the usage of the embed parameter throughout the codebase:

✅ Verification successful

embed Parameter Usage in write_labels Verified Successfully

All instances of the embed parameter in write_labels function calls across the codebase have been correctly implemented and handle embedding as intended.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for usage of the `embed` parameter in `write_labels` function calls

# Search for calls to write_labels with the embed parameter
rg --type python 'write_labels\s*\([^)]*embed\s*=' -A 5

Length of output: 220


Script:

#!/bin/bash
# Description: Check for usage of the `embed` parameter in `write_labels` function calls with corrected file type.

# Search for calls to write_labels with the embed parameter using 'py' type
rg --type py 'write_labels\s*\([^)]*embed\s*=' -A 5

Length of output: 1421

Comment on lines +571 to +575
train, val, test = labels.make_training_splits(n_train=0.4, n_val=0.4, n_test=0.2)
assert len(train) == 2
assert len(val) == 2
assert len(test) == 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential rounding issues with split proportions

Using floating-point numbers for n_train=0.4, n_val=0.4, and n_test=0.2 may lead to unexpected split sizes due to floating-point precision errors. Ensure that the sum of these proportions equals the total number of frames in the dataset.

Consider using integer values for n_train, n_val, and n_test or adjusting the split logic to handle floating-point precision.



@pytest.mark.parametrize("embed", [True, False])
def test_make_training_splits_save(slp_real_data, tmp_path, embed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Redefinition of function test_make_training_splits_save

The function test_make_training_splits_save is redefined at line 598, which duplicates the previous definition and will cause a NameError.

Apply the following diff to rename the second function and avoid the redefinition:

 @pytest.mark.parametrize("embed", [True, False])
-def test_make_training_splits_save(slp_real_data, tmp_path, embed):
+def test_make_training_splits_save_with_embed(slp_real_data, tmp_path, embed):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_make_training_splits_save(slp_real_data, tmp_path, embed):
def test_make_training_splits_save_with_embed(slp_real_data, tmp_path, embed):
🧰 Tools
🪛 Ruff

598-598: Redefinition of unused test_make_training_splits_save from line 577

(F811)

Comment on lines +699 to +706
if embed:
labels_train.save(save_dir / "train.pkg.slp", embed="user")
labels_val.save(save_dir / "val.pkg.slp", embed="user")
labels_test.save(save_dir / "test.pkg.slp", embed="user")
else:
labels_train.save(save_dir / "train.slp", embed=False)
labels_val.save(save_dir / "val.slp", embed=False)
labels_test.save(save_dir / "test.slp", embed=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure labels_test is defined before attempting to save.

If n_test is None, labels_test may not be defined, leading to a NameError when trying to save it. The code should check whether labels_test exists before saving.

Modify the saving logic to include a condition for labels_test:

if save_dir is not None:
    save_dir = Path(save_dir)
    save_dir.mkdir(exist_ok=True, parents=True)

    if embed:
        labels_train.save(save_dir / "train.pkg.slp", embed="user")
        labels_val.save(save_dir / "val.pkg.slp", embed="user")
+       if n_test is not None:
            labels_test.save(save_dir / "test.pkg.slp", embed="user")
    else:
        labels_train.save(save_dir / "train.slp", embed=False)
        labels_val.save(save_dir / "val.slp", embed=False)
+       if n_test is not None:
            labels_test.save(save_dir / "test.slp", embed=False)

This ensures that labels_test.save() is only called when labels_test is defined.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if embed:
labels_train.save(save_dir / "train.pkg.slp", embed="user")
labels_val.save(save_dir / "val.pkg.slp", embed="user")
labels_test.save(save_dir / "test.pkg.slp", embed="user")
else:
labels_train.save(save_dir / "train.slp", embed=False)
labels_val.save(save_dir / "val.slp", embed=False)
labels_test.save(save_dir / "test.slp", embed=False)
if embed:
labels_train.save(save_dir / "train.pkg.slp", embed="user")
labels_val.save(save_dir / "val.pkg.slp", embed="user")
if n_test is not None:
labels_test.save(save_dir / "test.pkg.slp", embed="user")
else:
labels_train.save(save_dir / "train.slp", embed=False)
labels_val.save(save_dir / "val.slp", embed=False)
if n_test is not None:
labels_test.save(save_dir / "test.slp", embed=False)

Comment on lines +677 to +682
# Make test split.
if n_test is not None:
if n_test < 1:
n_test = (n_test * len(labels)) / len(labels_rest)
labels_test, labels_rest = labels_rest.split(n=n_test, seed=seed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct the calculation of n_test when it's a fraction.

The current calculation of n_test when n_test < 1 may not produce the intended number of test samples. Dividing by len(labels_rest) can lead to incorrect scaling, especially after the training split has been removed.

Consider revising the calculation to correctly scale n_test based on the total number of labels:

if n_test < 1:
-    n_test = (n_test * len(labels)) / len(labels_rest)
+    n_test = n_test * len(labels)
n_test = int(n_test)
labels_test, labels_rest = labels_rest.split(n=n_test, seed=seed)

This adjustment ensures that n_test reflects the correct fraction of the total dataset.

Committable suggestion was skipped due to low confidence.

Comment on lines +683 to +690
# Make val split.
if n_val is not None:
if n_val < 1:
n_val = (n_val * len(labels)) / len(labels_rest)
labels_val, _ = labels_rest.split(n=n_val, seed=seed)
if isinstance(n_val, float) and n_val == 1.0:
labels_val = labels_rest
else:
labels_val, _ = labels_rest.split(n=n_val, seed=seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Adjust the calculation of n_val when it's a fraction.

Similarly, the calculation of n_val may not yield the correct number of validation samples. Using len(labels_rest) in the scaling can cause inconsistencies, especially if n_test has altered the size of labels_rest.

Update the calculation as follows to base n_val on the total number of labels:

if n_val < 1:
-    n_val = (n_val * len(labels)) / len(labels_rest)
+    n_val = n_val * len(labels)
n_val = int(n_val)
if n_val >= len(labels_rest):
    labels_val = labels_rest
else:
    labels_val, _ = labels_rest.split(n=n_val, seed=seed)

This ensures that n_val represents the intended fraction of the entire dataset.

Committable suggestion was skipped due to low confidence.

Copy link

codecov bot commented Sep 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.15%. Comparing base (391df6e) to head (1385de5).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #120      +/-   ##
==========================================
+ Coverage   96.13%   96.15%   +0.01%     
==========================================
  Files          15       15              
  Lines        2020     2026       +6     
==========================================
+ Hits         1942     1948       +6     
  Misses         78       78              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error with make_training_splits method
1 participant