Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: convert run_pipeline_no_finalize from recursive to iterative in order to avoid stack overflows #11898

Closed
wants to merge 5 commits into from

Conversation

LukeMathWalker
Copy link

@LukeMathWalker LukeMathWalker commented Oct 20, 2023

I ran into a stack overflow using polars in one of our projects—I was able to pin it down to run_pipeline_no_finalize, which is indeed recursive.

image

I've converted the algorithm to be iterative, which was not super-straightforward. Happy to iterate (pun intended) on the code to clean it up however you think it's best.


if let Some(SinkResult::Finished) = sink_result {
sink_finished = true;
break;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this a bit strange—sink_finished is defined outside the "main" for loop, but here we only break the inner while let loop. Is that intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should break the outer loop indeed.

@LukeMathWalker
Copy link
Author

OK, something is not quite right—trying to figure it out, but I'm struggling.

@LukeMathWalker
Copy link
Author

Found it—I forgot to reset some iterator-local state 🤦🏻

crates/polars-lazy/src/tests/streaming.rs Outdated Show resolved Hide resolved

if let Some(SinkResult::Finished) = sink_result {
sink_finished = true;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should break the outer loop indeed.

@ritchie46 ritchie46 changed the title Convert run_pipeline_no_finalize from recursive to iterative in order to avoid stack overflows fix: convert run_pipeline_no_finalize from recursive to iterative in order to avoid stack overflows Oct 20, 2023
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 20, 2023
@ritchie46
Copy link
Member

Wow, this hasn't gotten any easier to follow. :')

Any idea what query caused the stackoverflow?

@LukeMathWalker
Copy link
Author

LukeMathWalker commented Oct 23, 2023

The same query I mentioned in #11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files.

@ritchie46
Copy link
Member

The same query I mentioned in #11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files.

Yes, I estimated that. This is now fixed by #11922. Where instead of a union per file, we create a single source that can handle all those files. So the stackoverflow should be resolved by better handling multi-file datasets.

This currently is only done for parquet files, but I want to implement this for all file types we have. Can you give it a try with a compiled version of main?

@stinodego
Copy link
Member

Since the original issue that this PR was trying address seems to have been resolved, I will close this. Thanks for the initiative though @LukeMathWalker !

@stinodego stinodego closed this Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants