Handle empty dataset from output of sdg leaf node without raising error #272

relyt0925 · 2024-09-12T01:34:35Z

Previously, an EmptyDatasetError was raised when the dataset was empty after running the sdg pipeline of a leaf node. This change logs a warning and continues processing instead, allowing the function to handle empty datasets more gracefully and process other leaf nodes in the taxonomy. Fixes #240

bbrowning · 2024-09-12T19:26:21Z

Thanks for the pull request! Are there valid use-cases where a user expects an empty dataset from a leaf node? In other words, are we saving the user trouble by logging but ignoring any empty datasets? Or are we potentially masking a problem that they need to fix and re-run the entire generation? This might be the right way to address this, but I'm just trying to think through the user workflow here, how the user knows they have a problem, how they fix the problem, and whether a fatal error or just a warning log message makes that easier to spot and fix.

If it is appropriate to just log a message, perhaps we should also keep track of any taxonomy leaf nodes that resulted in empty datasets and log a final warning message at the very end of the generate run summarizing those? The single log message from each leaf node may be easy to miss during the generation loop if multiple leaf nodes are involved.

relyt0925 · 2024-09-13T13:59:55Z

I think logging a summary at the end is a great idea!

I do believe it’s valid for if a user is generating sdg against a full taxonomy (using the taxonomy-base empty parameter) that one leaf node out of 140 for example (that is about the number of leaf nodes in the community taxonomy) shouldn’t cause a complete failure of the sdg run.

bbrowning · 2024-09-13T17:29:11Z

Your example of the entire community taxonomy is a good one, and I agree that even if we can't generate data for all leaf nodes, generating what we can and logging the leaf nodes that failed is better than aborting entirely after multiple hours of generation.

Previously, an EmptyDatasetError was raised when the dataset was empty after running the sdg pipeline of a leaf node. This change logs a warning and continues processing instead, allowing the function to handle empty datasets more gracefully and process other leaf nodes in the taxonomy. Fixes instructlab#240 Signed-off-by: Tyler Lisowski <lisowski@us.ibm.com>

relyt0925 · 2024-09-15T15:41:34Z

/hold

ready for review but want to do an e2e test before merging.

bbrowning · 2024-09-18T18:27:38Z

I was going to ask you to please add a unit test in test_generate_data.py to verify that when a mocked generate block returns an empty dataset that the generation still runs to completion instead of erroring out. However, the code in generate_data.py needs a bit of refactoring to make that easy, and until that's done it requires quite a bit of mocking out things in generate_data to do that test. There are a few examples of this in test_generate_data.py if you want to take a stab at that, but if not this looks reasonable to merge as-is and we can create a separate issue to do some code cleanup and refactoring to make it easier to write new unit tests for the main generation loop.

bbrowning

I ran this manually locally and hit an error that my eyes-only review didn't catch. I added a comment inline to the code with the error, but it looks like we're trying to concatenate a string with a Dataset.

src/instructlab/sdg/generate_data.py

This adds a test and fixes a bug with logging of the empty sdg leaf nodes, as it was trying to log the actual empty dataset instead of the leaf node path. Signed-off-by: Ben Browning <bbrownin@redhat.com>

relyt0925 · 2024-09-21T02:02:58Z

Ben: Thank you so much for taking the time to illustrate that unit test example: I will note that general pattern for future PRs as well. These changes look great and I approve!!!!

relyt0925 · 2024-09-21T02:03:16Z

/unhold

testing complete

relyt0925 · 2024-10-03T18:59:31Z

@instructlab/sdg-maintainers do we think this is a potential candidate to merge? It seems like it is impacting some client workflows when they are utilizing generation against the full community taxonomy

aakankshaduggal

LGTM, Thanks @relyt0925 for your contribution!!

mergify bot added the ci-failure label Sep 12, 2024

relyt0925 force-pushed the issue-240 branch from db5a5ce to 81bca55 Compare September 15, 2024 03:51

mergify bot removed the ci-failure label Sep 15, 2024

bbrowning requested changes Sep 19, 2024

View reviewed changes

src/instructlab/sdg/generate_data.py Show resolved Hide resolved

Log the empty leaf node path (as opposed to its dataset)

59ba8a5

This adds a test and fixes a bug with logging of the empty sdg leaf nodes, as it was trying to log the actual empty dataset instead of the leaf node path. Signed-off-by: Ben Browning <bbrownin@redhat.com>

mergify bot added the testing Relates to testing label Sep 19, 2024

bbrowning approved these changes Sep 25, 2024

View reviewed changes

bbrowning requested a review from a team September 25, 2024 00:15

mergify bot added the one-approval label Sep 25, 2024

cdoern mentioned this pull request Sep 30, 2024

Invalid generated model. Bad training instructlab/instructlab#2303

Open

aakankshaduggal approved these changes Oct 7, 2024

View reviewed changes

aakankshaduggal merged commit 4bd07f5 into instructlab:main Oct 7, 2024
18 checks passed

mergify bot removed the one-approval label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle empty dataset from output of sdg leaf node without raising error #272

Handle empty dataset from output of sdg leaf node without raising error #272

relyt0925 commented Sep 12, 2024

bbrowning commented Sep 12, 2024

relyt0925 commented Sep 13, 2024

bbrowning commented Sep 13, 2024

relyt0925 commented Sep 15, 2024

bbrowning commented Sep 18, 2024

bbrowning left a comment

relyt0925 commented Sep 21, 2024

relyt0925 commented Sep 21, 2024

relyt0925 commented Oct 3, 2024 •

edited

Loading

aakankshaduggal left a comment

Handle empty dataset from output of sdg leaf node without raising error #272

Handle empty dataset from output of sdg leaf node without raising error #272

Conversation

relyt0925 commented Sep 12, 2024

bbrowning commented Sep 12, 2024

relyt0925 commented Sep 13, 2024

bbrowning commented Sep 13, 2024

relyt0925 commented Sep 15, 2024

bbrowning commented Sep 18, 2024

bbrowning left a comment

Choose a reason for hiding this comment

relyt0925 commented Sep 21, 2024

relyt0925 commented Sep 21, 2024

relyt0925 commented Oct 3, 2024 • edited Loading

aakankshaduggal left a comment

Choose a reason for hiding this comment

relyt0925 commented Oct 3, 2024 •

edited

Loading