-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix split calculation and allow for not embedding #120
Conversation
WalkthroughThe changes introduced in this pull request involve modifications to the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (4)
sleap_io/io/slp.py (2)
Line range hint
1-1119
: Consider modularizing the file for improved maintainabilityWhile the changes made are appropriate, this file (
sleap_io/io/slp.py
) is quite large and contains multiple complex functions for reading and writing different components of SLEAP labels. To improve maintainability and readability, consider splitting this file into smaller, more focused modules.For example:
slp_reader.py
: Contains functions for reading SLEAP labels.slp_writer.py
: Contains functions for writing SLEAP labels.slp_utils.py
: Contains utility functions used by both reader and writer.This modularization would make the codebase easier to navigate, test, and maintain in the long run.
Line range hint
1-1119
: Consider performance optimizations for large datasetsWhile the current implementation is functional, there are potential areas for performance improvement, especially when dealing with large datasets:
- Batch Processing: In functions like
write_lfs
, consider implementing batch processing for writing large datasets to HDF5 files.- Vectorization: Look for opportunities to replace loops with vectorized operations using NumPy, especially in data processing functions.
- Caching: Implement caching mechanisms for frequently accessed data, such as video metadata or skeleton information.
- Lazy Loading: Consider implementing lazy loading for large datasets to reduce memory usage.
- Parallel Processing: For CPU-bound operations, explore the use of multiprocessing to leverage multiple cores.
Implementing these optimizations could significantly improve the performance of the library, especially for large SLEAP label files.
To identify potential bottlenecks, we can search for loops and list comprehensions:
#!/bin/bash # Description: Identify potential performance bottlenecks # Search for for loops echo "For loops:" rg --type python 'for\s+\w+\s+in' -A 2 # Search for list comprehensions echo "\nList comprehensions:" rg --type python '\[.*for.*in.*\]'sleap_io/model/labels.py (2)
687-690
: Avoid comparing floating-point numbers for exact equality.Using
n_val == 1.0
can be unreliable due to floating-point precision issues.Consider checking if
n_val
is greater than or equal to the size oflabels_rest
instead:-if isinstance(n_val, float) and n_val == 1.0: +if n_val >= len(labels_rest): labels_val = labels_rest else: labels_val, _ = labels_rest.split(n=n_val, seed=seed)This approach is more robust and handles cases where
n_val
may be slightly less than1.0
due to floating-point representation.
674-690
: Consider edge cases when splitting the dataset.The current logic does not account for cases where the number of labels is less than the desired split sizes, which could lead to empty splits or errors.
Add checks to ensure that the split sizes do not exceed the available number of labels:
total_labels = len(labels) +if n_train >= total_labels: + labels_train = labels + labels_val = Labels() + labels_test = Labels() + return labels_train, labels_val, labels_testThis addition preemptively handles scenarios with insufficient data.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- sleap_io/io/slp.py (1 hunks)
- sleap_io/model/labels.py (5 hunks)
- tests/model/test_labels.py (3 hunks)
🧰 Additional context used
🪛 Ruff
tests/model/test_labels.py
598-598: Redefinition of unused
test_make_training_splits_save
from line 577(F811)
🔇 Additional comments (3)
sleap_io/io/slp.py (3)
Line range hint
1-1119
: Summary of review findings and recommendationsOverall, the changes made to the
write_labels
function improve the flexibility of frame embedding in SLEAP label files. However, there are several areas for potential improvement:
- Modularization: Consider splitting this large file into smaller, focused modules for better maintainability.
- Security: Enhance security measures for file operations and user input processing.
- Performance: Implement optimizations for handling large datasets, such as batch processing and vectorization.
- Error Handling: Improve error handling and resource management throughout the file.
These recommendations aim to enhance the robustness, performance, and maintainability of the codebase. Consider prioritizing these improvements in future iterations of the project.
Line range hint
1-1119
: Enhance security measures for file operations and user inputWhile the current implementation seems functional, there are areas where security could be improved:
- Path Traversal: Ensure that all file paths (e.g.,
labels_path
) are properly sanitized to prevent path traversal attacks.- Input Validation: Implement strict input validation for user-provided data, especially when processing JSON or creating HDF5 datasets.
- Error Handling: Improve error handling to avoid exposing sensitive information in case of exceptions.
- Resource Management: Ensure proper closing of file handles, especially when using
h5py.File()
.Consider implementing these security enhancements to make the file operations more robust against potential attacks or misuse.
To check for proper error handling and resource management, we can search for
try
-except
blocks and context managers:#!/bin/bash # Description: Check for error handling and resource management # Search for try-except blocks echo "Try-Except blocks:" rg --type python 'try:' -A 5 # Search for context managers (with statements) echo "\nContext managers:" rg --type python 'with.*:' -A 2
Line range hint
1115-1119
: Improved handling of embedding inwrite_labels
functionThe changes to the
write_labels
function enhance the flexibility of embedding frames in the SLEAP labels file. The new implementation correctly handles theembed
parameter, allowing for more granular control over which frames are embedded.To ensure that this change doesn't introduce any regressions, let's verify the usage of the
embed
parameter throughout the codebase:✅ Verification successful
embed
Parameter Usage inwrite_labels
Verified SuccessfullyAll instances of the
embed
parameter inwrite_labels
function calls across the codebase have been correctly implemented and handle embedding as intended.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for usage of the `embed` parameter in `write_labels` function calls # Search for calls to write_labels with the embed parameter rg --type python 'write_labels\s*\([^)]*embed\s*=' -A 5Length of output: 220
Script:
#!/bin/bash # Description: Check for usage of the `embed` parameter in `write_labels` function calls with corrected file type. # Search for calls to write_labels with the embed parameter using 'py' type rg --type py 'write_labels\s*\([^)]*embed\s*=' -A 5Length of output: 1421
train, val, test = labels.make_training_splits(n_train=0.4, n_val=0.4, n_test=0.2) | ||
assert len(train) == 2 | ||
assert len(val) == 2 | ||
assert len(test) == 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential rounding issues with split proportions
Using floating-point numbers for n_train=0.4
, n_val=0.4
, and n_test=0.2
may lead to unexpected split sizes due to floating-point precision errors. Ensure that the sum of these proportions equals the total number of frames in the dataset.
Consider using integer values for n_train
, n_val
, and n_test
or adjusting the split logic to handle floating-point precision.
|
||
|
||
@pytest.mark.parametrize("embed", [True, False]) | ||
def test_make_training_splits_save(slp_real_data, tmp_path, embed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redefinition of function test_make_training_splits_save
The function test_make_training_splits_save
is redefined at line 598, which duplicates the previous definition and will cause a NameError
.
Apply the following diff to rename the second function and avoid the redefinition:
@pytest.mark.parametrize("embed", [True, False])
-def test_make_training_splits_save(slp_real_data, tmp_path, embed):
+def test_make_training_splits_save_with_embed(slp_real_data, tmp_path, embed):
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def test_make_training_splits_save(slp_real_data, tmp_path, embed): | |
def test_make_training_splits_save_with_embed(slp_real_data, tmp_path, embed): |
🧰 Tools
🪛 Ruff
598-598: Redefinition of unused
test_make_training_splits_save
from line 577(F811)
if embed: | ||
labels_train.save(save_dir / "train.pkg.slp", embed="user") | ||
labels_val.save(save_dir / "val.pkg.slp", embed="user") | ||
labels_test.save(save_dir / "test.pkg.slp", embed="user") | ||
else: | ||
labels_train.save(save_dir / "train.slp", embed=False) | ||
labels_val.save(save_dir / "val.slp", embed=False) | ||
labels_test.save(save_dir / "test.slp", embed=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure labels_test
is defined before attempting to save.
If n_test
is None
, labels_test
may not be defined, leading to a NameError
when trying to save it. The code should check whether labels_test
exists before saving.
Modify the saving logic to include a condition for labels_test
:
if save_dir is not None:
save_dir = Path(save_dir)
save_dir.mkdir(exist_ok=True, parents=True)
if embed:
labels_train.save(save_dir / "train.pkg.slp", embed="user")
labels_val.save(save_dir / "val.pkg.slp", embed="user")
+ if n_test is not None:
labels_test.save(save_dir / "test.pkg.slp", embed="user")
else:
labels_train.save(save_dir / "train.slp", embed=False)
labels_val.save(save_dir / "val.slp", embed=False)
+ if n_test is not None:
labels_test.save(save_dir / "test.slp", embed=False)
This ensures that labels_test.save()
is only called when labels_test
is defined.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if embed: | |
labels_train.save(save_dir / "train.pkg.slp", embed="user") | |
labels_val.save(save_dir / "val.pkg.slp", embed="user") | |
labels_test.save(save_dir / "test.pkg.slp", embed="user") | |
else: | |
labels_train.save(save_dir / "train.slp", embed=False) | |
labels_val.save(save_dir / "val.slp", embed=False) | |
labels_test.save(save_dir / "test.slp", embed=False) | |
if embed: | |
labels_train.save(save_dir / "train.pkg.slp", embed="user") | |
labels_val.save(save_dir / "val.pkg.slp", embed="user") | |
if n_test is not None: | |
labels_test.save(save_dir / "test.pkg.slp", embed="user") | |
else: | |
labels_train.save(save_dir / "train.slp", embed=False) | |
labels_val.save(save_dir / "val.slp", embed=False) | |
if n_test is not None: | |
labels_test.save(save_dir / "test.slp", embed=False) |
# Make test split. | ||
if n_test is not None: | ||
if n_test < 1: | ||
n_test = (n_test * len(labels)) / len(labels_rest) | ||
labels_test, labels_rest = labels_rest.split(n=n_test, seed=seed) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the calculation of n_test
when it's a fraction.
The current calculation of n_test
when n_test < 1
may not produce the intended number of test samples. Dividing by len(labels_rest)
can lead to incorrect scaling, especially after the training split has been removed.
Consider revising the calculation to correctly scale n_test
based on the total number of labels:
if n_test < 1:
- n_test = (n_test * len(labels)) / len(labels_rest)
+ n_test = n_test * len(labels)
n_test = int(n_test)
labels_test, labels_rest = labels_rest.split(n=n_test, seed=seed)
This adjustment ensures that n_test
reflects the correct fraction of the total dataset.
Committable suggestion was skipped due to low confidence.
# Make val split. | ||
if n_val is not None: | ||
if n_val < 1: | ||
n_val = (n_val * len(labels)) / len(labels_rest) | ||
labels_val, _ = labels_rest.split(n=n_val, seed=seed) | ||
if isinstance(n_val, float) and n_val == 1.0: | ||
labels_val = labels_rest | ||
else: | ||
labels_val, _ = labels_rest.split(n=n_val, seed=seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust the calculation of n_val
when it's a fraction.
Similarly, the calculation of n_val
may not yield the correct number of validation samples. Using len(labels_rest)
in the scaling can cause inconsistencies, especially if n_test
has altered the size of labels_rest
.
Update the calculation as follows to base n_val
on the total number of labels:
if n_val < 1:
- n_val = (n_val * len(labels)) / len(labels_rest)
+ n_val = n_val * len(labels)
n_val = int(n_val)
if n_val >= len(labels_rest):
labels_val = labels_rest
else:
labels_val, _ = labels_rest.split(n=n_val, seed=seed)
This ensures that n_val
represents the intended fraction of the entire dataset.
Committable suggestion was skipped due to low confidence.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #120 +/- ##
==========================================
+ Coverage 96.13% 96.15% +0.01%
==========================================
Files 15 15
Lines 2020 2026 +6
==========================================
+ Hits 1942 1948 +6
Misses 78 78 ☔ View full report in Codecov by Sentry. |
This pull request fixes the split calculation in the
make_training_splits
function and adds the option to not embed user-labeled frame images in the saved files.make_training_splits
method #117).Labels.make_training_splits(..., embed=False)
Previously, the function would always embed the images, which could be slow for large projects. With this change, the
embed
parameter is introduced, allowing the user to choose whether to embed the images or save the labels with references to the source video files.