fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows #11898

LukeMathWalker · 2023-10-20T14:24:58Z

I ran into a stack overflow using polars in one of our projects—I was able to pin it down to run_pipeline_no_finalize, which is indeed recursive.

I've converted the algorithm to be iterative, which was not super-straightforward. Happy to iterate (pun intended) on the code to clean it up however you think it's best.

… to avoid stack overflows.

LukeMathWalker · 2023-10-20T14:40:39Z

crates/polars-pipe/src/pipeline/dispatcher.rs

-
-                    if let Some(SinkResult::Finished) = sink_result {
-                        sink_finished = true;
-                        break;


I found this a bit strange—sink_finished is defined outside the "main" for loop, but here we only break the inner while let loop. Is that intentional?

We should break the outer loop indeed.

LukeMathWalker · 2023-10-20T15:26:40Z

OK, something is not quite right—trying to figure it out, but I'm struggling.

LukeMathWalker · 2023-10-20T15:30:03Z

Found it—I forgot to reset some iterator-local state 🤦🏻

crates/polars-lazy/src/tests/streaming.rs

ritchie46 · 2023-10-20T16:06:05Z

crates/polars-pipe/src/pipeline/dispatcher.rs

-
-                    if let Some(SinkResult::Finished) = sink_result {
-                        sink_finished = true;
-                        break;


We should break the outer loop indeed.

ritchie46 · 2023-10-23T15:15:55Z

Wow, this hasn't gotten any easier to follow. :')

Any idea what query caused the stackoverflow?

LukeMathWalker · 2023-10-23T16:08:56Z

The same query I mentioned in #11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files.

ritchie46 · 2023-10-24T05:47:22Z

The same query I mentioned in #11829 minus the streaming part, with the important caveat that it is running over a large Parquet dataset that's (Hive-)partitioned into thousands of small files.

Yes, I estimated that. This is now fixed by #11922. Where instead of a union per file, we create a single source that can handle all those files. So the stackoverflow should be resolved by better handling multi-file datasets.

This currently is only done for parquet files, but I want to implement this for all file types we have. Can you give it a try with a compiled version of main?

stinodego · 2024-02-10T15:09:20Z

Since the original issue that this PR was trying address seems to have been resolved, I will close this. Thanks for the initiative though @LukeMathWalker !

Convert run_pipeline_no_finalize from recursive to iterative in order…

135bd1c

… to avoid stack overflows.

LukeMathWalker requested review from ritchie46 and orlp as code owners October 20, 2023 14:24

Fix lints.

9b72c8b

LukeMathWalker commented Oct 20, 2023

View reviewed changes

Fix tests.

a6e7ff4

ritchie46 requested changes Oct 20, 2023

View reviewed changes

ritchie46 changed the title ~~Convert run_pipeline_no_finalize from recursive to iterative in order to avoid stack overflows~~ fix: convert run_pipeline_no_finalize from recursive to iterative in order to avoid stack overflows Oct 20, 2023

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 20, 2023

LukeMathWalker added 2 commits October 21, 2023 18:46

Remove dbg!

dcda8a4

Break from the outer loop.

2537769

LukeMathWalker requested a review from ritchie46 October 21, 2023 16:50

LukeMathWalker requested review from stinodego and c-peters as code owners January 15, 2024 18:48

stinodego closed this Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows #11898

fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows #11898

LukeMathWalker commented Oct 20, 2023 •

edited

Loading

LukeMathWalker Oct 20, 2023

ritchie46 Oct 20, 2023

LukeMathWalker commented Oct 20, 2023

LukeMathWalker commented Oct 20, 2023

ritchie46 Oct 20, 2023

ritchie46 commented Oct 23, 2023

LukeMathWalker commented Oct 23, 2023 •

edited

Loading

ritchie46 commented Oct 24, 2023

stinodego commented Feb 10, 2024

fix: convert run_pipeline_no_finalize from recursive to iterative in order to avoid stack overflows #11898

fix: convert run_pipeline_no_finalize from recursive to iterative in order to avoid stack overflows #11898

Conversation

LukeMathWalker commented Oct 20, 2023 • edited Loading

LukeMathWalker Oct 20, 2023

Choose a reason for hiding this comment

ritchie46 Oct 20, 2023

Choose a reason for hiding this comment

LukeMathWalker commented Oct 20, 2023

LukeMathWalker commented Oct 20, 2023

ritchie46 Oct 20, 2023

Choose a reason for hiding this comment

ritchie46 commented Oct 23, 2023

LukeMathWalker commented Oct 23, 2023 • edited Loading

ritchie46 commented Oct 24, 2023

stinodego commented Feb 10, 2024

fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows #11898

fix: convert `run_pipeline_no_finalize` from recursive to iterative in order to avoid stack overflows #11898

LukeMathWalker commented Oct 20, 2023 •

edited

Loading

LukeMathWalker commented Oct 23, 2023 •

edited

Loading