-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle empty dataset from output of sdg leaf node without raising error #272
Conversation
Thanks for the pull request! Are there valid use-cases where a user expects an empty dataset from a leaf node? In other words, are we saving the user trouble by logging but ignoring any empty datasets? Or are we potentially masking a problem that they need to fix and re-run the entire generation? This might be the right way to address this, but I'm just trying to think through the user workflow here, how the user knows they have a problem, how they fix the problem, and whether a fatal error or just a warning log message makes that easier to spot and fix. If it is appropriate to just log a message, perhaps we should also keep track of any taxonomy leaf nodes that resulted in empty datasets and log a final warning message at the very end of the generate run summarizing those? The single log message from each leaf node may be easy to miss during the generation loop if multiple leaf nodes are involved. |
I think logging a summary at the end is a great idea! I do believe it’s valid for if a user is generating sdg against a full taxonomy (using the taxonomy-base empty parameter) that one leaf node out of 140 for example (that is about the number of leaf nodes in the community taxonomy) shouldn’t cause a complete failure of the sdg run. |
Your example of the entire community taxonomy is a good one, and I agree that even if we can't generate data for all leaf nodes, generating what we can and logging the leaf nodes that failed is better than aborting entirely after multiple hours of generation. |
Previously, an EmptyDatasetError was raised when the dataset was empty after running the sdg pipeline of a leaf node. This change logs a warning and continues processing instead, allowing the function to handle empty datasets more gracefully and process other leaf nodes in the taxonomy. Fixes instructlab#240 Signed-off-by: Tyler Lisowski <lisowski@us.ibm.com>
db5a5ce
to
81bca55
Compare
/hold ready for review but want to do an e2e test before merging. |
I was going to ask you to please add a unit test in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this manually locally and hit an error that my eyes-only review didn't catch. I added a comment inline to the code with the error, but it looks like we're trying to concatenate a string with a Dataset.
This adds a test and fixes a bug with logging of the empty sdg leaf nodes, as it was trying to log the actual empty dataset instead of the leaf node path. Signed-off-by: Ben Browning <bbrownin@redhat.com>
Ben: Thank you so much for taking the time to illustrate that unit test example: I will note that general pattern for future PRs as well. These changes look great and I approve!!!! |
/unhold testing complete |
@instructlab/sdg-maintainers do we think this is a potential candidate to merge? It seems like it is impacting some client workflows when they are utilizing generation against the full community taxonomy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks @relyt0925 for your contribution!!
Previously, an EmptyDatasetError was raised when the dataset was empty after running the sdg pipeline of a leaf node. This change logs a warning and continues processing instead, allowing the function to handle empty datasets more gracefully and process other leaf nodes in the taxonomy. Fixes #240