-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: convert run_pipeline_no_finalize
from recursive to iterative in order to avoid stack overflows
#11898
Conversation
… to avoid stack overflows.
|
||
if let Some(SinkResult::Finished) = sink_result { | ||
sink_finished = true; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this a bit strange—sink_finished
is defined outside the "main" for
loop, but here we only break the inner while let
loop. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should break the outer loop indeed.
OK, something is not quite right—trying to figure it out, but I'm struggling. |
Found it—I forgot to reset some iterator-local state 🤦🏻 |
|
||
if let Some(SinkResult::Finished) = sink_result { | ||
sink_finished = true; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should break the outer loop indeed.
run_pipeline_no_finalize
from recursive to iterative in order to avoid stack overflows
Wow, this hasn't gotten any easier to follow. :') Any idea what query caused the stackoverflow? |
The same query I mentioned in #11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files. |
Yes, I estimated that. This is now fixed by #11922. Where instead of a union per file, we create a single source that can handle all those files. So the stackoverflow should be resolved by better handling multi-file datasets. This currently is only done for parquet files, but I want to implement this for all file types we have. Can you give it a try with a compiled version of main? |
Since the original issue that this PR was trying address seems to have been resolved, I will close this. Thanks for the initiative though @LukeMathWalker ! |
I ran into a stack overflow using
polars
in one of our projects—I was able to pin it down torun_pipeline_no_finalize
, which is indeed recursive.I've converted the algorithm to be iterative, which was not super-straightforward. Happy to iterate (pun intended) on the code to clean it up however you think it's best.