Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle empty dataset from output of sdg leaf node without raising error #272

Merged
merged 2 commits into from
Oct 7, 2024

Conversation

relyt0925
Copy link
Contributor

Previously, an EmptyDatasetError was raised when the dataset was empty after running the sdg pipeline of a leaf node. This change logs a warning and continues processing instead, allowing the function to handle empty datasets more gracefully and process other leaf nodes in the taxonomy. Fixes #240

@mergify mergify bot added the ci-failure label Sep 12, 2024
@bbrowning
Copy link
Contributor

Thanks for the pull request! Are there valid use-cases where a user expects an empty dataset from a leaf node? In other words, are we saving the user trouble by logging but ignoring any empty datasets? Or are we potentially masking a problem that they need to fix and re-run the entire generation? This might be the right way to address this, but I'm just trying to think through the user workflow here, how the user knows they have a problem, how they fix the problem, and whether a fatal error or just a warning log message makes that easier to spot and fix.

If it is appropriate to just log a message, perhaps we should also keep track of any taxonomy leaf nodes that resulted in empty datasets and log a final warning message at the very end of the generate run summarizing those? The single log message from each leaf node may be easy to miss during the generation loop if multiple leaf nodes are involved.

@relyt0925
Copy link
Contributor Author

I think logging a summary at the end is a great idea!

I do believe it’s valid for if a user is generating sdg against a full taxonomy (using the taxonomy-base empty parameter) that one leaf node out of 140 for example (that is about the number of leaf nodes in the community taxonomy) shouldn’t cause a complete failure of the sdg run.

@bbrowning
Copy link
Contributor

Your example of the entire community taxonomy is a good one, and I agree that even if we can't generate data for all leaf nodes, generating what we can and logging the leaf nodes that failed is better than aborting entirely after multiple hours of generation.

Previously, an EmptyDatasetError was raised when the dataset was empty after running the sdg pipeline of a leaf node. This change logs a warning and continues processing instead, allowing the function to handle empty datasets more gracefully and process other leaf nodes in the taxonomy. Fixes instructlab#240

Signed-off-by: Tyler Lisowski <lisowski@us.ibm.com>
@relyt0925
Copy link
Contributor Author

/hold

ready for review but want to do an e2e test before merging.

@bbrowning
Copy link
Contributor

I was going to ask you to please add a unit test in test_generate_data.py to verify that when a mocked generate block returns an empty dataset that the generation still runs to completion instead of erroring out. However, the code in generate_data.py needs a bit of refactoring to make that easy, and until that's done it requires quite a bit of mocking out things in generate_data to do that test. There are a few examples of this in test_generate_data.py if you want to take a stab at that, but if not this looks reasonable to merge as-is and we can create a separate issue to do some code cleanup and refactoring to make it easier to write new unit tests for the main generation loop.

Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this manually locally and hit an error that my eyes-only review didn't catch. I added a comment inline to the code with the error, but it looks like we're trying to concatenate a string with a Dataset.

src/instructlab/sdg/generate_data.py Show resolved Hide resolved
This adds a test and fixes a bug with logging of the empty sdg leaf
nodes, as it was trying to log the actual empty dataset instead of the
leaf node path.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@mergify mergify bot added the testing Relates to testing label Sep 19, 2024
@relyt0925
Copy link
Contributor Author

Ben: Thank you so much for taking the time to illustrate that unit test example: I will note that general pattern for future PRs as well. These changes look great and I approve!!!!

@relyt0925
Copy link
Contributor Author

/unhold

testing complete

@relyt0925
Copy link
Contributor Author

relyt0925 commented Oct 3, 2024

@instructlab/sdg-maintainers do we think this is a potential candidate to merge? It seems like it is impacting some client workflows when they are utilizing generation against the full community taxonomy

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @relyt0925 for your contribution!!

@aakankshaduggal aakankshaduggal merged commit 4bd07f5 into instructlab:main Oct 7, 2024
18 checks passed
@mergify mergify bot removed the one-approval label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty dataset error kills the workflow
3 participants