Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to match known issues using regex #11209

Closed
markwilkie opened this issue Oct 10, 2022 · 30 comments
Closed

Add the ability to match known issues using regex #11209

markwilkie opened this issue Oct 10, 2022 · 30 comments
Assignees

Comments

@markwilkie
Copy link
Member

markwilkie commented Oct 10, 2022

Today, known issues are matched using simply string matching only. There's a request to expand this to include regex. See below for a specific example from @jkotas and short discussion.

cc/ @missymessa @ulisesh

====================
From Jan K:

It would work for this specific case, but I would not want to do the math to try to split the search into two stages. Figuring out the right classification is difficult enough already.

What is the problem with just allowing regexes? The regex engines are pretty good at figuring out how to search for the set of strings or set of string patterns in the most efficient way.

 it might be possible to decorate exit code 1 with the real error? At least in some cases, we’ve been successful with that approach.
This is again about making changes is msbuild with 1 – 2 months turnaround.

From: Mark Wilkie <mawilkie@microsoft.com>
Sent: Thursday, October 6, 2022 2:51 PM
To: Jan Kotas <jkotas@microsoft.com>
Subject: RE: Known issue classification limitations

Also – it might be possible to decorate exit code 1 with the real error? At least in some cases, we’ve been successful with that approach.

From: Mark Wilkie
Sent: Thursday, October 6, 2022 2:49 PM
To: Jan Kotas <jkotas@microsoft.com>
Subject: RE: Known issue classification limitations

I see – better error messages would be great, but it’ll take a while (if ever).

I wonder if a “2 stage” search might work? For example, “exit code 1” for initial search, then “foo bar” to further refine. Do you think that’s right?

From: Jan Kotas <jkotas@microsoft.com>
Sent: Thursday, October 6, 2022 2:40 PM
To: Mark Wilkie <mawilkie@microsoft.com>
Subject: RE: Known issue classification limitations

It would be nice, but I am not sure what it would mean here.

The underlying problem are unexpected msbuild server crashes. Improvements in error messages for msbuild server crashes would need to be first done in msbuild (if it is even possible), then shipped in preview/RC, and then the preview/RC would need to promoted to LKG that is used for dotnet/runtime build. It is 1 – 2 month turnaround.

From: Mark Wilkie <mawilkie@microsoft.com>
Sent: Thursday, October 6, 2022 2:10 PM
To: Jan Kotas <jkotas@microsoft.com>
Subject: RE: Known issue classification limitations

Is this the kind of thing that would be better served by improving the error message itself?

From: Jan Kotas <jkotas@microsoft.com>
Sent: Thursday, October 6, 2022 1:58 PM
To: Mark Wilkie <mawilkie@microsoft.com>
Subject: Known issue classification limitations

https://github.com/dotnet/runtime/issues/74328

This issue is matching: ErrorMessage:"Build process exited with non-zero exit code: 1" today.

The problem is that this issue has multiple failure modes. For example, it can manifest itself as this https://github.com/dotnet/runtime/issues/75429#issue-1369121436

The workaround would be to have a separate active issue on each of the possible failures modes, however it is not something that devs like. The standard operating procedure is to have one tracking issue for one underlying problem and close the rest as dups.

Thanks,
Jan

(I will keep replying to this email as I hit more of these.)

Release Note Description

Known issues support for error matching using regex

@missymessa
Copy link
Member

@ulisesh when you get back from PTO, could you share what challenges you ran into regarding regex (if any) previously?

@ulisesh
Copy link
Contributor

ulisesh commented Oct 11, 2022

There are two concerns about regex:

  • Processing a regex can be considerably slower than String.Contains. Performance is very important here because we process thousands of lines trying to find a match
  • Security implications of using a regex provided in a public repo

@ulisesh
Copy link
Contributor

ulisesh commented Oct 11, 2022

Based on the feedback we have received, there is still work to be done to improve the error message people can use for known issues but I'm not sure if regex is the solution

@jkotas
Copy link
Member

jkotas commented Oct 11, 2022

Processing a regex can be considerably slower than String.Contains

We have a very fast regex engine in .NET Core. Have you looked into using that?

You can also build one composite regex that scans for all Known Build Errors at once. I expect that scanning each log using composite regex is going to faster than calling String.Contains 100+ times. Given the current trend, the number of Known Build Error issues in dotnet/runtime is typically going to in hundreds.

Security implications of using a regex provided in a public repo

I assume that you meant denial-of-service attacks. The regex engines that come with .NET Core have timeout option to mitigate this risk. Also, you can also do mitigations like verifying that the source of Known Build Error text came from a person with enough privileges.

@markwilkie
Copy link
Member Author

For your example @jkotas, what's an example regex you'd use? We have historically shied away from regex, but it sounds like we can put up guard rails and perhaps be ok?

@alexperovich
Copy link
Member

If we really want to be safe about it we could use the new "NonBacktracking" regex option that is in .net today. That one doesn't support some weird constructs in regex, but guarantees linear performance.

@ChadNedzlek
Copy link
Member

Based on my understanding, any quantifier (like, which is incredibly common, *) is "backtracking", so I'm not sure we could realistically disable that. Setting some really, really small timeout might work?

@alexperovich
Copy link
Member

the non-backtracking engine works fine with any quantifiers, its just things like backreferences and lookarounds that aren't supported. These are the not-supported things from https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/

The new RegexOptions.NonBacktracking option doesn’t support everything the other built-in engines support. In particular, the option can’t be used in conjunction with RegexOptions.RightToLeft or RegexOptions.ECMAScript, and it doesn’t allow for the following constructs in the pattern:

Atomic groups
Backreferences
Balancing groups
Conditional
Lookarounds
Start anchors (\G)

@alexperovich
Copy link
Member

A pattern like stuff(a|b)*pizza works fine in a non-backtracking engine. The only difference is non-backtracking is O(n) best and worst case, and backtracking is O(1) best and O(2^n) worse case.

@jkotas
Copy link
Member

jkotas commented Oct 11, 2022

For your example @jkotas, what's an example regex you'd use? We have historically shied away from regex, but it sounds like we can put up guard rails and perhaps be ok?

I would use (Build process exited with non-zero exit code: 1|BuildFailureException : Command failed with exit code 1)

A better example for regex would be dotnet/runtime#76759 . The error message contains line number that it is fragile. It is going to stop matching if somebody adds or removes a line from the file. I would like to match any number in cases like this.

I think limiting the regexes to non-backtracking ones is fine.

@ChadNedzlek
Copy link
Member

ChadNedzlek commented Oct 11, 2022

There's a whole bunch of interesting advice here: https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices

A few things that look really innocuous are called out as actually shockingly dangerous. It's actually really easy to create regexs that take really long to fail to find a match (generally, positive matching is pretty good, it's the ones that don't match, but have lots of backtracking that can get dicey).

Related to the trust problem. I don't think we can actually tell who made the text be what it is: https://docs.github.com/en/developers/webhooks-and-events/events/issue-event-types doesn't list anything that looks like "description changed", and anyone can modify the description of an issue, since they are in a public repo. We'd have to use a different mechanism that allows auditing, I think.

@alexperovich
Copy link
Member

The non-backtracking stuff removes most of those concerns. The processing time is guaranteed to be linear in the size of the input.

@alexperovich
Copy link
Member

well, technically its O(n * m^2) but m is the size of the pattern, and that can be restricted to a small number.

@hoyosjs
Copy link
Member

hoyosjs commented Oct 19, 2022

Honestly the more I look at runtime, the more I feel like this could really help. A bunch of the times it's things like generic type names, line number, thread IDs, and such that are embedded in the middle of the message.

@markwilkie
Copy link
Member Author

Yep - there's momentum growing for doing this. Thoughts @ulisesh ?

@ulisesh
Copy link
Contributor

ulisesh commented Oct 19, 2022

Based on what @alexperovich said, it sounds like we can give regex a try. I think I can implement this next week 😄

@alexperovich
Copy link
Member

I think it would be useful to disallow '.' in the regexes. Its a lazy alternative to something like [0-9] and 99/100 times its not needed.

@ChadNedzlek
Copy link
Member

It really makes them a lot easier to read though. Like, if it's a package name or a file name, do you really want to read "(?:\d|\w|_|-|\.|)* and try to figure out that's just . with extra steps?

@ulisesh ulisesh self-assigned this Oct 24, 2022
@ulisesh
Copy link
Contributor

ulisesh commented Oct 25, 2022

Security concerns were brought up in the stand up.

I was told @blowdart could help us here.

@blowdart, the scenario is:

We have automation that takes error messages from GitHub issues and looks for matches in the logs of our public builds. Right now, the matching happens with String.Contains but we would like to use RegEx to increase the number of matches the automation can find.

Any concerns or suggestions?

@blowdart
Copy link

Oh boy do I 😈

So, who creates the issues? Is it a bot, or is it anyone who can? If it's a bot, who controls the bot and writes the code? What type of error messages? Is there a consistent format? And so on...

The problem here is that regular expressions, typically ones that contain backtracking, but not always, can lead to Denial of Service attacks. If you consider the issue contents untrusted, and you should, then using regular expressions, unless very simple, on untrusted input should be avoided. Any use of regular expressions must be accompanied by a timeout, which, unfortunately is not the default constructor.

I leave discussions of readability to everyone else as I agree they're hard to read and maintain and lower the pool of contributors as people who don't know them well would have no confidence in updating or maintaining them.

@ulisesh
Copy link
Contributor

ulisesh commented Oct 25, 2022

Thanks for your response. Anyone can add/change this kind of issues. #11369 is a good example. We read the json blob and the automation uses the ErrorMessage string to match build failures with the issue.

@alexperovich
Copy link
Member

We were going to use the non-backtracking regex engine that was added in .net 7, that should reduce issues, and combined with a timeout is even better.

@hoyosjs
Copy link
Member

hoyosjs commented Oct 25, 2022

Another thing that I don't know if possible is to add tokenization e.g.: (?:\d|\w|_|-|.|)* is packageName or something. Essentially, this is about dev experience and there's ways that are perhaps doable without compromising security and readability (and not all that much perf cost).

@jkotas
Copy link
Member

jkotas commented Oct 25, 2022

So, who creates the issues?

It is the same set of people who can submit PRs to our repos. Bad actors can DoS our engineering system by submitting malicious PRs, that will cause a lot more damage that a bad regex here. In case this ever happens, we will block the bad actors github account and move on.

I assume that the issue regex matching runs on a single machine, so the blast radius of malicious regex is very small (one machine, team not blocked from doing work). Blast radius of malicious PRs is much larger (hundreds of machines, team severely blocked).

@blowdart
Copy link

A smaller radius that can still be mitigated should be. So, yes, timeouts and the newer not-backtracking will work.

@jkotas
Copy link
Member

jkotas commented Oct 25, 2022

I agree that we should not be leaving doors open needlessly.

@markwilkie
Copy link
Member Author

thanks all

@ulisesh
Copy link
Contributor

ulisesh commented Oct 26, 2022

Thanks for your input, I'll work on implementing this and build some examples of how the new error messages will look like if we use RegEx.

@ulisesh
Copy link
Contributor

ulisesh commented Dec 5, 2022

@ulisesh ulisesh closed this as completed Dec 5, 2022
@jkotas
Copy link
Member

jkotas commented Dec 8, 2022

@ulisesh Could you please add a documentation for how to use this to https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/KnownIssues.md#how-to-fill-out-a-known-issue-error-message-section ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants