Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ecma ranges with set terminator #55

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

stevenh
Copy link

@stevenh stevenh commented Dec 6, 2022

Fix ECMAScript un-escaped literal '-' when followed by predefined character sets.

Also:

  • Fixed missing error check on parseProperty() call.
  • Use addChar(ch) helper instead of addRange(ch, ch).

Fixes #54

Fix ECMAScript un-escaped literal '-' when followed by predefined
character sets.

Also:
* Fixed missing error check on parseProperty() call.
* Use addChar(ch) helper instead of addRange(ch, ch).

Fixes dlclark#54
@stevenh stevenh force-pushed the fix/ecmascript-ranges branch from 00a2f4c to 7c81fa4 Compare December 6, 2022 16:15
@dlclark
Copy link
Owner

dlclark commented Dec 12, 2022

Hey, thanks for this. I have a few notes:

  • I'd like to see tests for all the impacted shortcuts code paths, \d \s \w \p
  • The inRange flag is not reset when we figure out we're not actually in a range, what are the implications of this? It seems like our parsing state would be messed up and could lead to bad parsing. [a-\s\b] should parse fine, but currently fails, I suspect related to this parsing state issue.
  • For the behavior I did some research into javascript regex engines and it appears:
    • [\w-z] should fail to parse (currently doesn't fail)
    • The current regexp2 behavior matches the unicode option in a javascript engine. If we're changing this behavior we need to maintain the original compatibility with unicode option. This research opened my eyes to a whole class of bugs that likely exist in regexp2.
    • If you don't use the unicode option in a javascript then \p shouldn't be a special case at all! (it just means p)

@stevenh
Copy link
Author

stevenh commented Dec 18, 2022

Nice catches, working on fixing these.

Can you confirmed the desired compatibility is it PCRE or PCRE2?

I ask as testoutput1 contains some patterns which are valid for PCRE but not PCRE2, namely:

  • /[\d-z]+/
  • /^[\d-a]/

I'm guessing PCRE2 as that's what .NET seems to match as per https://regex101.com/

@dlclark
Copy link
Owner

dlclark commented Dec 18, 2022

Desired base compatibility is the .NET engine.

Ignore the files created by VSCode.
Handle Unicode code points in ECMA mode with both the flag enabled and
disabled.

This removes four tests which were incorrectly passing due to missing
detection of invalid ranges, aligning with the behaviour of .NET and
PCRE2 as opposed to PCRE.
@stevenh
Copy link
Author

stevenh commented Dec 18, 2022

I think I've covered most the bases with this update, but definitely needs a second pair of eyes.

@stevenh
Copy link
Author

stevenh commented Jan 13, 2023

@dlclark in case you missed the notification.

@dlclark
Copy link
Owner

dlclark commented Jan 13, 2023

Thanks for the reminder @stevenh!

In reviewing the commit I see several regex's that used to work but now return errors. I know the new behavior matches C#, but I don't want to break regex's that may be out in the wild already working. It's fine if new regex patterns suddenly stop throwing errors during compile, but the corpus of existing unit tests should still pass as-is.

I'm more willing to make breaking changes behind the ECMAScript flag since people use that option for compatibility with a standard.

Thoughts?

@stevenh
Copy link
Author

stevenh commented Jan 16, 2023

That's what I thought, but testing some more it seems that regexp101 C# implementation actually has a bug and reports this as failure incorrectly as confirmed by this fiddler share

So I need to update so these patterns are once again valid anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ecmascript: cannot include class \s in character range
2 participants