Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Remove duplicated content in error messages #8107

Merged
merged 14 commits into from
Feb 7, 2024

Conversation

rben01
Copy link
Contributor

@rben01 rben01 commented Apr 10, 2023

Should fix #8079 , but please avoid merging for now, as I have only tested a small subset of the things this might affect (and I have not yet added any formal tests at all; what's below is the extent of my testing). I would also like input on this to see if this the ideal way to handle this problem and to report errors in general.

Changes

  • Prevented exponential growth of error messages by avoiding nesting of error messages
  • Instead, used a counter to track how many operations had been applied to a LazyFrame after an error
  • Error message contains stack depth and original error
  • Renamed ErrorState to ErrorStateEncounters and made it crate-private
  • Renamed ErrorStateSync to ErrorState and removed its Deref impl so that the only way to access its data is through (crate-private) accessors, thus reducing the size of its public API

Error messages are now O(1) in length (well, technically O(log(n)), if you take into account that the depth counter has log10(n) digits, but this is effectively O(1)).

Test case (singular)

Input

In each case we print > title of test case, = df.describe_plan(), and # df.collect().unwrap_err()

fn main() -> Result<(), Box<dyn std::error::Error>> {
    fn report_err(df: LazyFrame, ops_after: usize) {
        println!("> select xyz, then {ops_after} ops after");
        println!("= {}", df.describe_plan());
        println!("# {:?}", df.collect().unwrap_err());
        println!();
    }

    let df = df! [
        "c1" => [0, 1],
    ]
    .unwrap()
    .lazy();

    println!("> initial df");
    println!("= {}", df.describe_plan());
    println!("# NO ERROR");
    println!();

    let df0 = df.clone().select([col("xyz")]);
    report_err(df0, 0);

    let df1 = df.clone().select([col("xyz")]).select([col("c1")]);
    report_err(df1, 1);

    let df2 = df
        .clone()
        .select([col("xyz")])
        .select([col("c1")])
        .select([col("c2")]);
    report_err(df2, 2);

    let df3 = df
        .clone()
        .select([col("xyz")])
        .select([col("c1")])
        .select([col("c2")])
        .select([col("c3")]);
    report_err(df3, 3);

    Ok(())
}

Output

In each case we print > title of test case, = df.describe_plan(), and # df.collect().unwrap_err()

> initial df
= DF ["c1"]; PROJECT */1 COLUMNS; SELECTION: "None"
# NO ERROR

> select xyz, then 0 ops after
= ErrorState(NotYetEncountered { err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) })
# ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\""))

> select xyz, then 1 ops after
= ErrorState(AlreadyEncountered { n_times: 1, prev_err_msg: "not found: xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"" })
# ComputeError(ErrString("LogicalPlan already failed (depth: 1) with error: 'not found: xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

> select xyz, then 2 ops after
= ErrorState(AlreadyEncountered { n_times: 2, prev_err_msg: "not found: xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"" })
# ComputeError(ErrString("LogicalPlan already failed (depth: 2) with error: 'not found: xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

> select xyz, then 3 ops after
= ErrorState(AlreadyEncountered { n_times: 3, prev_err_msg: "not found: xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"" })
# ComputeError(ErrString("LogicalPlan already failed (depth: 3) with error: 'not found: xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

@ritchie46
Copy link
Member

Any news on this @rben01?

@rben01
Copy link
Contributor Author

rben01 commented May 14, 2023

Thanks for the reminder. I'll circle back on this this week and if nothing else write some tests. But first I'd like to see if there's a better way to include multiple error messages than just the depth.

@rben01
Copy link
Contributor Author

rben01 commented May 18, 2023

@ritchie46 I'm thinking about implementing a system where subsequent operations are tracked in some detail — not just the number of them, but an entire Vec indicating the operations attempted after the error, e.g., [Projection, Sort, Filter]. The idea is that it should make tracking down the source of an error much easier. I've experimented with it a little and it's feasible to store the names of the operations (corresponding to the methods of LogicalPlanBuilder) and some of the data that would be relevant to a user, such as the columns used in the projection. Do you think this is a good idea, or should we just stick with the count for now?

@rben01
Copy link
Contributor Author

rben01 commented Jun 15, 2023

So I think this is fine to merge in its current state. It's better than the current behavior. I'll think more about what improving the error messages might look like in the future, but for now this PR is a strict improvement on the existing behavior.

@svaningelgem
Copy link
Contributor

@rben01 , if the maintainers of polars like your PR as much as I do: could you update it with the main branch and add a test? This can be based on your example with an added assertion that the contents looks correct.

@rben01 rben01 requested review from ritchie46 and orlp as code owners October 17, 2023 02:43
@rben01
Copy link
Contributor Author

rben01 commented Oct 17, 2023

@rben01 , if the maintainers of polars like your PR as much as I do: could you update it with the main branch and add a test? This can be based on your example with an added assertion that the contents looks correct.

I pulled in main and wrote some tests, but I'm unable to run them because it won't build now (looks like some upstream compilation errors). I'll check back on this periodically and keep pulling from main and hopefully eventually I can run the new tests and have them pass.

@rben01
Copy link
Contributor Author

rben01 commented Oct 18, 2023

Ok, test added, should be ok to merge now.

@ritchie46
Copy link
Member

Ok, test added, should be ok to merge now.

Can you ensure CI is green? :)

@rben01
Copy link
Contributor Author

rben01 commented Oct 18, 2023

Can you ensure CI is green? :)

Nothing I can do about python but rust CI should be all green now :)

@svaningelgem
Copy link
Contributor

Hi @rben01 , I think it is because the way the errors are now handled.
You do some kind of "compression", and likely this is loosing the Duplicate error that is being thrown.

What if you add in your tests something like this:

    let df4 = df
        .clone()
        .select([col("c1"), col("c1")])
        .collect();

That should give you the DuplicateError that Python is complaining about. And likely fix that specific test as well. Because Python expects a DuplicateError (on rust side it's called Duplicate or polars_bail!(duplicate=), but it gets a ComputeError instead.

So I think it has to do with the "compression"/"unwrapping" that you implemented.

@rben01
Copy link
Contributor Author

rben01 commented Oct 20, 2023

Thanks @svaningelgem , you're absolutely right. I changed things so that instead of wrapping everything in ComputeError, I wrap them in their original error variant. Here is the new output:

> initial df
= DF ["c1"]; PROJECT */1 COLUMNS; SELECTION: "None"
# NO ERROR

> select xyz, then 0 ops after
= ErrorState(NotYetEncountered { err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) })
# ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\""))

> select xyz, then 1 ops after
= ErrorState(AlreadyEncountered { n_times: 1, orig_err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) })
# ColumnNotFound(ErrString("LogicalPlan already failed (depth: 1) with error: 'xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

> select xyz, then 2 ops after
= ErrorState(AlreadyEncountered { n_times: 2, orig_err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) })
# ColumnNotFound(ErrString("LogicalPlan already failed (depth: 2) with error: 'xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

> select xyz, then 3 ops after
= ErrorState(AlreadyEncountered { n_times: 3, orig_err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) })
# ColumnNotFound(ErrString("LogicalPlan already failed (depth: 3) with error: 'xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

And now we're all green.

@rben01
Copy link
Contributor Author

rben01 commented Oct 23, 2023

Updated again to simplify the implementation and remove some of the pointless stuff in debug output. Hopefully this remains all green.

> initial df
= DF ["c1"]; PROJECT */1 COLUMNS; SELECTION: "None"
# NO ERROR

> select xyz, then 0 ops after
= ErrorState { n_times: 0, err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) }
# ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\""))

> select xyz, then 1 ops after
= ErrorState { n_times: 1, err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) }
# ColumnNotFound(ErrString("LogicalPlan already failed (depth: 1) with error: 'xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

> select xyz, then 2 ops after
= ErrorState { n_times: 2, err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) }
# ColumnNotFound(ErrString("LogicalPlan already failed (depth: 2) with error: 'xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

> select xyz, then 3 ops after
= ErrorState { n_times: 3, err: ColumnNotFound(ErrString("xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"")) }
# ColumnNotFound(ErrString("LogicalPlan already failed (depth: 3) with error: 'xyz\n\nError originated just after this operation:\nDF [\"c1\"]; PROJECT */1 COLUMNS; SELECTION: \"None\"'"))

@rben01
Copy link
Contributor Author

rben01 commented Nov 1, 2023

@ritchie46 does this look good to you?

@rben01
Copy link
Contributor Author

rben01 commented Feb 5, 2024

@ritchie46 Any update on merging this?

@stinodego
Copy link
Member

stinodego commented Feb 5, 2024

@rben01 could you rebase this? Then I'll have a look or ask Ritchie to look at it.

Also, the linked issue has been closed. Is there a MRE we can run to see the fix in action? Adding a Python test with such an example would be preferable.

@stinodego stinodego changed the title Removed duplicated content in error messages fix: Remove duplicated content in error messages Feb 5, 2024
@stinodego stinodego added the A-exceptions Area: exception handling label Feb 5, 2024
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Feb 5, 2024
Prevented exponential growth of error messages by avoiding nesting of error messages
Instead, used a counter to track how many operations had been applied to a LazyFrame after an error
Renamed `ErrorState` to `ErrorStateEncounters` and made it crate-private
Renamed `ErrorStateSync` to `ErrorState` and removed its Deref impl so that the only way to access its data is through (crate-private) accessors, thus reducing the size of its public API
…slashes in output

Minor cleanup + commenting
Prevented exponential growth of error messages by avoiding nesting of error messages
Instead, used a counter to track how many operations had been applied to a LazyFrame after an error
Renamed `ErrorState` to `ErrorStateEncounters` and made it crate-private
Renamed `ErrorStateSync` to `ErrorState` and removed its Deref impl so that the only way to access its data is through (crate-private) accessors, thus reducing the size of its public API
Instead of leaving a `String` in place of `PolarsError`, leave in place
the original `PolarsError`. Then, when it's time to produce a new error,
we clone the original error if possible; and it's almost always
possible, as all `PolarsError` variants contain ErrString as their data
except one: `Io`. So just convert `Io` to `ComputeError` with an error
message saying so, otherwise reuse the `PolarsError` variant with a
cloned message. (This is based off, and utilizes, the method
`PolarsError::wrap_msg`.)

This way we can retain the "shape" of the original error and just
augment its message to indicate that it's already failed n times.

Also cleaned up the `err_msg.rs` test in anticipation of adding a bit
more to that file.
Now that errors aren't taken and replaced by a String - the original is
stored and cloned as needed - we can simplify the implementation.
Instead of NotYetEncountered and AlreadyEncountered, we can just have a
single struct with a counteer (no enum needed). This massively
simplifies the implementation
Also, simplified the implemtnation of tests
Also cleaned up ErrorState's Debug implementation so that users don't
see the nuts and bolts inside the struct - they only see the important
debugging info
@rben01 rben01 force-pushed the fix-error-fmt-exponential-length branch from f801640 to 7058c89 Compare February 6, 2024 03:37
@rben01
Copy link
Contributor Author

rben01 commented Feb 6, 2024

Also, the linked issue has been closed. Is there a MRE we can run to see the fix in action? Adding a Python test with such an example would be preferable.

I closed it with a smaller PR that prevented the escaping of backslashes leading to O(2^n) error message length, but there was still duplication within error messages, as all old messages were printed — that's what this PR fixes.

Here is an example from Python of compounding error messages (edited after posting to replace c, c, c with c, d, e to make it clear that the failure from the first select destroys the LazyFrame):

In [10]: pl.__version__
Out[10]: '0.20.7'

In [11]: df = pl.DataFrame({'a': 1, 'b': range(2)})

In [12]: df.lazy().select('c').select('d').select('e')
Out[12]: ---------------------------------------------------------------------------
ColumnNotFoundError                       Traceback (most recent call last)
<<< backtrace omitted >>>
ColumnNotFoundError: c

Error originated just after this operation:
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"

Error originated just after this operation:
ErrorStateSync(AlreadyEncountered(not found: c

Error originated just after this operation:
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"))
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"

Error originated just after this operation:
ErrorStateSync(AlreadyEncountered(not found: c

Error originated just after this operation:
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"

Error originated just after this operation:
ErrorStateSync(AlreadyEncountered(not found: c

Error originated just after this operation:
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"))
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"))
ErrorStateSync(AlreadyEncountered(not found: c

Error originated just after this operation:
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"))
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for this.

However, if we look at the resulting exception message now, it's not quite correct:

import polars as pl

lf = pl.LazyFrame({"a": 1, "b": range(2)})
lf.select("c").select("d").select("e").collect()
Traceback (most recent call last):
  File "/home/stijn/code/polars/py-polars/repro.py", line 29, in <module>
    df.select("c").select("d").select("e").collect()
  File "/home/stijn/code/polars/py-polars/polars/lazyframe/frame.py", line 1927, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: LogicalPlan already failed (depth: 2) with error: 'c

Error originated just after this operation:
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"'

Two things:

  • The quote around the c column name ends up at the end (e.g. we have 'c now with the last quote at the end of this message.
  • I would expect a simple ColumnNotFound: 'c' - not something about LogicalPlan already failed.

@rben01
Copy link
Contributor Author

rben01 commented Feb 7, 2024

Oh, I see. Looks like the Rust error messages have been specially crafted so they look nice in Python. The error is not ColumnNotFoundError: c, but rather ColumnNotFoundError: c\n\nError originated just after….

On the Rust side, the current (v0.37.0) error message looks like this:

Err(ColumnNotFound(ErrString("c\n\nError originated just after this operation:\nDF
[\"a\", \"b\"]; PROJECT */2 COLUMNS; SELECTION: \"None\"")))

When it makes it to Python, I guess everything but the message itself is stripped off and converted to a ColumnNotFoundError and then the message is printed. So if I muck with the message I have to make sure it has the property that it looks nice in Python. Shouldn't be too hard to clean this up.

@rben01 rben01 requested a review from stinodego February 7, 2024 04:41
@rben01
Copy link
Contributor Author

rben01 commented Feb 7, 2024

So now, the error messages should look like:

polars.exceptions.ColumnNotFoundError: c

Error originated just after this operation:
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"'

LogicalPlan had already failed with the above error; after failure, 3 additional operations were attempted on the LazyFrame

I wasn't able to test in Python; I'm just going off my understanding of how the error messages on the rust side are translated into Python.

@ritchie46
Copy link
Member

If @stinodego is happy with the error messages, this can be merged. Thanks a lot for this @rben01

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@stinodego stinodego merged commit 9dfc9fd into pola-rs:main Feb 7, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-exceptions Area: exception handling fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exponentially large error strings when running operations on an already-errored LazyFrame
4 participants