-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use IndexOfAnyValues in the RegexCompiler and source gen #78927
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsCloses #78203 For the source generator, we emit file static class Utilities
{
internal static readonly IndexOfAnyValues<char> AsciiHexDigit =
IndexOfAnyValues.Create("0123456789ABCDEFabcdef");
internal static readonly IndexOfAnyValues<char> Ascii_800600008006 =
IndexOfAnyValues.Create("WYZwyz");
} Whereas for the RegexCompiler, I added a new span.IndexOfAny(this._indexOfAnyValues[42]); A few issues I came across while working on this:
@stephentoub Can you please take a look at this to see if it makes any sense? How do you go about benchmarking a change like this? Just pick a few patterns that you know will be affected?
|
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs
Outdated
Show resolved
Hide resolved
Is this something you plan to address in this PR, or we should address it subsequently?
Probably worth opening an issue for this. That said, I've seen diminishing returns in the past around the negated sets, as it's often (not always) the case that the thing you're then searching for occurs really soon.
It's lovely. Nice work.
Can you run outerloop locally as well? That'll run another ~18,000 patterns through the source generator and compiler.
For a sweeping change like this, I'll typically run all the regex tests in dotnet/performance, in particular https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs, https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Common.cs, and https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/runtime/BenchmarksGame/regex-redux-5.cs. |
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
807d926
to
48179aa
Compare
Cases like I'd have to check what this means for source-generated cases where there's less per-call overhead for I'll keep digging... |
Here are all the Regex benchmarks from The largest regression is for the |
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
…alues primary sets
75e8c8d
to
768a5eb
Compare
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Improvements on Windows-x64: |
Closes #78203
For the source generator, we emit
IndexOfAnyValues
definitions underUtilities
Whereas for the RegexCompiler, I added a new
IndexOfAnyValues<char>[]
field toCompiledRegexRunner
and the callers end up looking something likeA few issues I came across while working on this:
IndexOfAnyValues
instead ofInRange('a', 'd')
Chars
for theFixedDistanceSet
optimization don't support negated sets[^ab]
now end up usingIndexOfAnyValues
instead ofIndexOfAnyExcept('a', 'b')
IndexOfAnyValues
is probably inflated because of this@stephentoub Can you please take a look at this to see if it makes any sense?
The tests are passing, but I don't have in-depth knowledge about all the tradeoffs being made in Regex.
How do you go about benchmarking a change like this? Just pick a few patterns that you know will be affected?