-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix whitespace in number parsing #3195
base: master
Are you sure you want to change the base?
Conversation
7671872
to
f6fdfee
Compare
Handle malformed and overflowing arguments. Also, forbid leading and trailing spaces to match the behavior of tonumber from commit ce0e788 (improve tonumber/0 performance by parsing input as number literal, 2024-03-02).
`jvp_strtod` skips leading whitespace, but its decnum counterpart `decNumberFromString` (called within `jv_number_with_literal`) does not. Those two are called interchangeably, so it leads to inconsistent behavior depending on whether the decnum feature is enabled. Additionally, `classify`, used in the token scanner, only considers [ \t\n\r] to be whitespace, but `jvp_strtod` consumes the larger set [ \t\n\v\f\r], so those extra characters are considered literals. Changing this deviates from the behavior of `strdod` from <stdlib.h> and is technically a breaking API change, since it is a public symbol.
f6fdfee
to
f1bfe95
Compare
The patch is fine, but I have missed why we want to make tonumber/0, in the next version of jq, no longer ignore trailing and leading whitespace instead of just making skip whitespace before and after also when using decNum to not change the behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that is what gojq also does, I guess
emanuele6 wrote:
This is a breaking change, and in my view quite a serious one. Since there is no underlying bug at issue (in particular, the JSON spec for numbers does not make it so), I think the existing behavior is more than acceptable, and indeed quite conventional, e.g. in DuckDB |
If the consensus is that whitespace should still be stripped, maintaining the released and earlier behavior, I'll rework this PR in the opposite direction. I personally don't see a good reason for the breaking change to forbid leading and trailing whitespace, so am in favor of this. |
Trimming leading/trailing whitespace is fine with me 👍 |
@thaliaarchi wrote:
Thanks for your understanding. It looks like there might not be absolute consensus regarding the change to tonumber/0 in the immediate future, so I hope you'll understand that in such cases where there is no compelling reason for a breaking change, the norm would be to defer it to the next "major" release, which in this case I believe would be 2.0. If the performance difference between the two versions of tonumber/0 is judged to be so significant, then my suggestion would be to leave tonumber() alone and introduce a new filter, say tonumber/1, that would support the optimization. (E.g. tonumber(null) could disallow all extraneous characters, and tonumber(regex) could allow stripping of characters as determined by regex.) |
errno = 0; | ||
long indent = strtol(argv[i+1], &end, 10); | ||
if (errno || indent < -1 || indent > 7 || | ||
isspace(*argv[i+1]) || end == NULL || *end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have noticed two problems here:
end == NULL
doesn't make sense, it seems entirely unnecessary- I think there should also be
end == argv[i+1]
as error condition otherwise--indent ''
is accepted and treated as equivalent of--indent 0
. Did you do this intentionally?
This is two separate number parsing bug fixes:
Do not skip leading whitespace in
jvp_strtod
jvp_strtod
skips leading whitespace, but its decnum counterpartdecNumberFromString
(called withinjv_number_with_literal
) does not. Those two are called interchangeably, so it leads to inconsistent behavior depending on whether the decnum feature is enabled. Additionally,classify
, used in the token scanner, only considers[ \t\n\r]
to be whitespace, butjvp_strtod
consumes the larger set[ \t\n\v\f\r]
, so those extra characters are considered literals.Commit ce0e788 (improve tonumber/0 performance by parsing input as number literal, 2024-03-02) changed
tonumber
to not accept leading or trailing whitespace. However, this case was missed.Changing this deviates from the behavior of
strdod
from <stdlib.h> and is technically a breaking API change, since it is a public symbol.Handle input errors for --indent
Handle malformed and overflowing arguments. Also, forbid leading and trailing spaces to match the behavior of
tonumber
.This will produce a merge conflict in conjunction with PR#3194 Parse short options in order given, because they modify the same area. However, they are logically independent and can be merged in either order. After the first merge, I will rebase the second onto the first.Done