Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix whitespace in number parsing #3195

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

thaliaarchi
Copy link
Contributor

@thaliaarchi thaliaarchi commented Nov 2, 2024

This is two separate number parsing bug fixes:

Do not skip leading whitespace in jvp_strtod

jvp_strtod skips leading whitespace, but its decnum counterpart decNumberFromString (called within jv_number_with_literal) does not. Those two are called interchangeably, so it leads to inconsistent behavior depending on whether the decnum feature is enabled. Additionally, classify, used in the token scanner, only considers [ \t\n\r] to be whitespace, but jvp_strtod consumes the larger set [ \t\n\v\f\r], so those extra characters are considered literals.

Commit ce0e788 (improve tonumber/0 performance by parsing input as number literal, 2024-03-02) changed tonumber to not accept leading or trailing whitespace. However, this case was missed.

Changing this deviates from the behavior of strdod from <stdlib.h> and is technically a breaking API change, since it is a public symbol.

$ ./configure
$ make -j8
$ ./jq -n '" 123" | tonumber'
jq: error (at <unknown>): string (" 123") cannot be parsed as a number
$ ./jq -n '"123 " | tonumber'
jq: error (at <unknown>): string ("123 ") cannot be parsed as a number

$ make clean
$ ./configure --disable-decnum
$ make -j8
$ ./jq -n '" 123" | tonumber'
123
$ ./jq -n '"123 " | tonumber'
jq: error (at <unknown>): string ("123 ") cannot be parsed as a number

Handle input errors for --indent

Handle malformed and overflowing arguments. Also, forbid leading and trailing spaces to match the behavior of tonumber.

This will produce a merge conflict in conjunction with PR#3194 Parse short options in order given, because they modify the same area. However, they are logically independent and can be merged in either order. After the first merge, I will rebase the second onto the first. Done

Handle malformed and overflowing arguments. Also, forbid leading and
trailing spaces to match the behavior of tonumber from commit ce0e788
(improve tonumber/0 performance by parsing input as number literal,
2024-03-02).
`jvp_strtod` skips leading whitespace, but its decnum counterpart
`decNumberFromString` (called within `jv_number_with_literal`) does not.
Those two are called interchangeably, so it leads to inconsistent
behavior depending on whether the decnum feature is enabled.
Additionally, `classify`, used in the token scanner, only considers
[ \t\n\r] to be whitespace, but `jvp_strtod` consumes the larger set
[ \t\n\v\f\r], so those extra characters are considered literals.

Changing this deviates from the behavior of `strdod` from <stdlib.h> and
is technically a breaking API change, since it is a public symbol.
tests/jq.test Show resolved Hide resolved
@wader wader requested review from emanuele6 and itchyny November 25, 2024 13:10
@emanuele6
Copy link
Member

The patch is fine, but I have missed why we want to make tonumber/0, in the next version of jq, no longer ignore trailing and leading whitespace instead of just making skip whitespace before and after also when using decNum to not change the behaviour.

Copy link
Member

@emanuele6 emanuele6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that is what gojq also does, I guess

@pkoppstein
Copy link
Contributor

emanuele6 wrote:

I have missed why we want to make tonumber/0, in the next version of jq, no longer ignore trailing and leading whitespace instead of just making skip whitespace before

This is a breaking change, and in my view quite a serious one.

Since there is no underlying bug at issue (in particular, the JSON spec for numbers does not make it so), I think the existing behavior is more than acceptable, and indeed quite conventional, e.g. in DuckDB
' 12 '::INT => 12, and I'm sure there are countless others like it.

@thaliaarchi
Copy link
Contributor Author

If the consensus is that whitespace should still be stripped, maintaining the released and earlier behavior, I'll rework this PR in the opposite direction. I personally don't see a good reason for the breaking change to forbid leading and trailing whitespace, so am in favor of this.

@wader
Copy link
Member

wader commented Dec 3, 2024

Trimming leading/trailing whitespace is fine with me 👍

@pkoppstein
Copy link
Contributor

@thaliaarchi wrote:

I personally don't see a good reason for the breaking change to forbid leading and trailing whitespace, so am in favor of this.

Thanks for your understanding.

It looks like there might not be absolute consensus regarding the change to tonumber/0 in the immediate future, so I hope you'll understand that in such cases where there is no compelling reason for a breaking change, the norm would be to defer it to the next "major" release, which in this case I believe would be 2.0.

If the performance difference between the two versions of tonumber/0 is judged to be so significant, then my suggestion would be to leave tonumber() alone and introduce a new filter, say tonumber/1, that would support the optimization. (E.g. tonumber(null) could disallow all extraneous characters, and tonumber(regex) could allow stripping of characters as determined by regex.)

errno = 0;
long indent = strtol(argv[i+1], &end, 10);
if (errno || indent < -1 || indent > 7 ||
isspace(*argv[i+1]) || end == NULL || *end) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed two problems here:

  1. end == NULL doesn't make sense, it seems entirely unnecessary
  2. I think there should also be end == argv[i+1] as error condition otherwise --indent '' is accepted and treated as equivalent of --indent 0. Did you do this intentionally?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants