Templatize `enregisterLocalVars` in various methods of LSRA #85144

kunalspathak · 2023-04-21T07:45:34Z

Let's see how much TP gain we get and if it is even worth pursuing.

ghost · 2023-04-21T07:45:47Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Let's see how much TP gain we get and if it is even worth pursuing.

Author:	kunalspathak
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

kunalspathak · 2023-04-21T17:46:13Z

TPdiffs seems good for minopts as expected, but I need to investigate why there are regressions specially for linux.

kunalspathak · 2023-04-25T14:56:48Z

TPdiffs seems good for minopts as expected, but I need to investigate why there are regressions specially for linux.

Here is the analysis summary for coreclr_tests collection of unix_x64_x64.dll:

??$resolveRegisters@$0A@@LinearScan@@QEAAXXZ                    : 5040463141  : NA       : 16.51% : +0.6076%
?initVarRegMaps@LinearScan@@AEAAXXZ                             : 3877517487  : NA       : 12.70% : +0.4674%
??$resolveRegisters@$00@LinearScan@@QEAAXXZ                     : 3765504006  : NA       : 12.33% : +0.4539%
?allocateRegisters@LinearScan@@QEAAXXZ                          : 2631515439  : +8.26%   : 8.62%  : +0.3172%
??$identifyCandidates@$00@LinearScan@@AEAAXXZ                   : 335048503   : NA       : 1.10%  : +0.0404%
?recordMaxSpill@LinearScan@@QEAAXXZ                             : 89011904    : NA       : 0.29%  : +0.0107%
?buildIntervals@LinearScan@@QEAAXXZ                             : 46684771    : +1.33%   : 0.15%  : +0.0056%
?identifyCandidates@LinearScan@@AEAAXXZ                         : -420751896  : -100.00% : 1.38%  : -0.0507%
?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z : -2041237366 : -30.87%  : 6.69%  : -0.2461%
?doLinearScan@LinearScan@@UEAA?AW4PhaseStatus@@XZ               : -3250564252 : -98.94%  : 10.65% : -0.3918%
?resolveRegisters@LinearScan@@QEAAXXZ                           : -9014105293 : -100.00% : 29.53% : -1.0866%

initVarRegMaps

From the disassembly, it seems that previously, we were inlining this method, but with this PR, we are not. I will revert this particular change.

identifyCandidates

From the disassembly, the identifyCandidates<false> gets inlined now, and hence we do not see its entry in the diff.

resolveRegistes

From the disassembly, the resolveRegisters<false> removes good chunk of code with this PR:

recordMaxSpill

From the disassembly, for some reason, recordMaxSpill() call has stopped inlining in both resolveRegisters<false> and resolveRegistes<true> case.

Again, this analysis is based on what VC++ produced for clrjit_unix_x64_x64.dll and can no way represent how clang would produce the clrjit.so for unix.

For experiment, I will try to push a change that templatizes buildInterval() to see how different are the TP improvements. With doing that, we do see 3KB increase in binary size.

kunalspathak · 2023-04-25T19:00:58Z

Will need to wait until we have superpmi collection ready.

kunalspathak · 2023-04-26T18:48:44Z

Here are the latest numbers. I have no idea why linux-x64 numbers are exactly opposite of windows-x64 numbers, but i am not even sure how much we could rely on linux-x64 numbers as I mentioned in #85144 (comment). I will try out templatizing buildIntervals() to see how much it impacts the TP.

kunalspathak · 2023-04-27T04:23:21Z

I will try out templatizing buildIntervals() to see how much it impacts the TP.

It gets us extra 0.05%.

kunalspathak · 2023-04-27T04:24:04Z

@dotnet/jit-contrib

BruceForstall · 2023-04-27T04:55:24Z

Diffs

The TP diffs are quite mixed, with big improvements and big regressions.

Is it worth it?

kunalspathak · 2023-04-27T06:07:39Z

The TP diffs are quite mixed, with big improvements and big regressions.

From my understanding, the ones showed for windows are true/real and remaining all are fabricated/artificial IMO because those binaries are not even compiled with VC++ and it is very hard to believe the numbers flip between windows and altjit binaries.

Is it worth it?

Probably. We are still eliminating branches in lot of hot code (e.g. #85144 (comment)) and some of that is eliminated from the iterations over BasicBlocks and local variables, which unfortunately can't be measured using TP's instruction count.

kunalspathak · 2023-05-03T17:35:57Z

The TP diffs are quite mixed, with big improvements and big regressions.

From my understanding, the ones showed for windows are true/real and remaining all are fabricated/artificial IMO because those binaries are not even compiled with VC++ and it is very hard to believe the numbers flip between windows and altjit binaries.

Is it worth it?

Probably. We are still eliminating branches in lot of hot code (e.g. #85144 (comment)) and some of that is eliminated from the iterations over BasicBlocks and local variables, which unfortunately can't be measured using TP's instruction count.

@BruceForstall - any thoughts on this?

kunalspathak · 2023-05-03T18:49:48Z

which unfortunately can't be measured using TP's instruction count.

Reminds me of @tannergooding comment in #83648 (comment)

BruceForstall · 2023-05-04T04:54:35Z

[There's some bug with the Diffs Extension page where it's showing multiple diffs outputs. I guess that's because the diffs was run twice in the same PR]

The MinOpts TP numbers look good for all platforms except linux-x64.

The FullOpts TP numbers are mixed for all platforms (regression for coreclr_tests especially, mostly improvements otherwise), but pure regression for linux-x64.

So:

Why is it ok to accept the FullOpts regressions for all platforms? It's not clear why there should be any regression. Is there some particularly anomalous test in coreclr_tests?
W.r.t. linux-x64 you seem to be arguing that because we're actually measuring TP using the RyuJIT cross-compiler on Windows that was built with VC++ that the TP numbers are not valid. But I'm not sure why they are not valid. We are certainly assuming that they are a proxy for native TP, and that could be incorrect. However, VC++ might do a better job than clang/LLVM just as it might do a worse job. The code path run by RyuJIT should be the same in both native and cross compiler cases.

@jakobbotsch There's an interesting point here: can we create a linux-x64 build of our instruction counting PIN tool, and add native linux-x64 TP measurement, such that we're measuring the clang-built RyuJIT? PIN does support linux-x64 builds. We could keep doing the linux-x64 cross-compiler TP which we could use to validate how representative our cross-compiler TP measurements are (by manually comparing the two runs). We currently don't do any SPMI AzDO pipeline runs on Linux, but it should work.

kunalspathak · 2023-05-04T05:31:49Z

The MinOpts TP numbers look good for all platforms except linux-x64.

Yes, and I don't have any explanation to why that must be happening specially the windows numbers shows the opposite result and both windows/linux (cross-compiled) binaries are produced by same VC++ compiler.

We are certainly assuming that they are a proxy for native TP

Yes, I think that too. My point is that VC++ compiler could have made certain decisions that is increasing the number of instructions executed and eliminating branch instructions are included in that number. Even if number of instructions might have increased, the number of branches (which contains higher latency and branch mis-predictions) are reduced (as shown in my analysis #85144 (comment)) and that should lead to better execution performance. Current TP diff can no way find out if execution speed has increased or not because we don't take into account various details of instruction pipeline and their latency/throughput. I feel that the TP diff is a good yardstick to guide us if the change could actually cause TP impact or not, but we need to interpret those numbers on case-by-case basis.

EDIT:

Is there some particularly anomalous test in coreclr_tests?

I am not sure how to find that out. Is there a way to do that?

jakobbotsch · 2023-05-04T08:22:52Z

@jakobbotsch There's an interesting point here: can we create a linux-x64 build of our instruction counting PIN tool, and add native linux-x64 TP measurement, such that we're measuring the clang-built RyuJIT? PIN does support linux-x64 builds. We could keep doing the linux-x64 cross-compiler TP which we could use to validate how representative our cross-compiler TP measurements are (by manually comparing the two runs).

Yes it would be possible (and we have talked about it before), it's just work. I've opened #85749 to track it.

jakobbotsch · 2023-05-04T08:44:50Z

EDIT:

Is there some particularly anomalous test in coreclr_tests?

I am not sure how to find that out. Is there a way to do that?

My typical trick is the following:

diff --git a/src/coreclr/tools/superpmi/superpmi/superpmi.cpp b/src/coreclr/tools/superpmi/superpmi/superpmi.cpp
index c0d64df1acc..e31e423bf15 100644
--- a/src/coreclr/tools/superpmi/superpmi/superpmi.cpp
+++ b/src/coreclr/tools/superpmi/superpmi/superpmi.cpp
@@ -607,6 +607,12 @@ int __cdecl main(int argc, char* argv[])
 
                             break;
                         case NearDifferResult::SuccessWithoutDiff:
+                            PrintDiffsCsvRow(
+                                diffCsv,
+                                reader->GetMethodContextIndex(),
+                                mcb.size,
+                                baseMetrics.NumCodeBytes, diffMetrics.NumCodeBytes,
+                                baseMetrics.NumExecutedInstructions, diffMetrics.NumExecutedInstructions);
                             break;
                         case NearDifferResult::Failure:
                             if (o.mclFilename != nullptr)

Then you can invoke superpmi.exe with -diffsInfo -- you can get the command from superpmi.py tpdiff -f coreclr_tests and adjust it, e.g. it will end up being something like:

C:\dev\dotnet\spmi\pintools\1.0\windows\pin.exe -follow_execv -t C:\dev\dotnet\spmi\pintools\1.0\windows\clrjit_inscount_x64\clrjit_inscount.dll -quiet -- C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\superpmi.exe -applyDiff -diffsInfo diffs.csv -p C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\clrjit.dll C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\clrjit.dll C:\dev\dotnet\spmi\mch\387bcec3-9a71-4422-a11c-e7ce3b73c592.windows.x64\benchmarks.run_pgo.windows.x64.checked.mch

You can open diffs.csv in Excel and sort by a new column that computes the relative diff between the instruction numbers.

It would be nice to make this easier.. I opened #85755 to track that.

src/coreclr/jit/lsra.cpp

kunalspathak · 2023-05-10T23:52:25Z

@BruceForstall @jakobbotsch - how do we fix this? I assume the jitrollingbuild should be using Clang 16, but not sure why it is on Clang 12?

kunalspathak · 2023-05-10T23:56:29Z

[23:27:18] SuperPMI encountered missing data. Missing with base JIT: 5557. Missing with diff JIT: 5557. Total contexts: 396321.

[23:27:18] Asm diffs summary:


Traceback (most recent call last):

  File "C:\h\w\AF5D093D\p\superpmi.py", line 4661, in <module>

    sys.exit(main(args))

  File "C:\h\w\AF5D093D\p\superpmi.py", line 4552, in main

    success = asm_diffs.replay_with_asm_diffs()

  File "C:\h\w\AF5D093D\p\superpmi.py", line 2019, in replay_with_asm_diffs

    self.write_asmdiffs_markdown_summary(write_fh, asm_diffs)

  File "C:\h\w\AF5D093D\p\superpmi.py", line 2170, in write_asmdiffs_markdown_summary

    write_row(*t)

  File "C:\h\w\AF5D093D\p\superpmi.py", line 2165, in write_row

    num_missed_base / num_contexts * 100,

ZeroDivisionError: division by zero

Also hitting this although the num_contexts seems to be > 0.

BruceForstall · 2023-05-11T04:46:51Z

@BruceForstall @jakobbotsch - how do we fix this? I assume the jitrollingbuild should be using Clang 16, but not sure why it is on Clang 12?

It will fix itself. #84676 changed builds to use clang 16, but the rolling build hasn't run since then (it's running right now). After it builds, you'll need to merge & push (I think) so your baseline will be a newly built JIT.

BruceForstall · 2023-05-11T05:05:21Z

While we should protect against division by zero in the code, there's probably something more fundamental failing. What run was the python crash from?

I see your diffs run for win-x64 has:

[22:56:59] ERROR: Couldn't load base metrics summary created by child process
[22:56:59] ERROR: Couldn't load base metrics summary created by child process
[22:56:59] ERROR: Couldn't load base metrics summary created by child process
[22:56:59] ERROR: Couldn't load base metrics summary created by child process
[22:56:59] General fatal error

kunalspathak · 2023-05-11T05:08:14Z

Yes, this is the same error I am seeing in superpmi-replay and it repros on main. Opened #86082 to track it.

BruceForstall · 2023-05-12T16:49:58Z

The TP diffs look like a pure in on linux-x64 native, which doesn't match the linux-x64 cross TP. Odd. The only regression still is the slight win-x86 regression.

kunalspathak · 2023-05-12T16:52:52Z

The TP diffs look like a pure in on linux-x64 native, which doesn't match the linux-x64 cross TP. Odd. The only regression still is the slight win-x86 regression.

Yes. Seems like a good wins on linux-x64 (native) MinOpts of 0.7%. Thanks for the review and great to see this PR motivated us to build native linux TP, which is a good thing to have.

kunalspathak added 6 commits April 20, 2023 23:42

Optimize identifyCandidates()

3d6216c

Optimize isRegCandidate()

eecde74

Optimize initVarRegMaps()

d069677

Optimize processBlockEndAllocation() and processBlockStartLocations()

369bed8

Optimize resolveRegisters()

0bea64a

jit format

3c0d379

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 21, 2023

ghost assigned kunalspathak Apr 21, 2023

kunalspathak added 2 commits April 25, 2023 07:57

Revert change for initVarRegMask()

c92be31

Remove the unneeded assert

bf7b68d

Optimize buildIntervals()

24dd4c8

build-analysis bot mentioned this pull request Apr 27, 2023

Tracking issue for CI build timeouts #76454

Closed

kunalspathak marked this pull request as ready for review April 27, 2023 04:23

jakobbotsch mentioned this pull request May 4, 2023

SPMI: Native Linux-hosted throughput runs #85749

Closed

BruceForstall reviewed May 4, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved

kunalspathak requested a review from MichalStrehovsky as a code owner May 10, 2023 21:56

kunalspathak force-pushed the enregisterLocalVars branch from 086324d to 32b9a30 Compare May 10, 2023 22:04

Merge remote-tracking branch 'origin/main' into enregisterLocalVars

b1b883b

kunalspathak removed request for radical, lewing, vargaz, kg, marek-safar, pavelsavara, lambdageek, jeffhandley, radekdoulik, thaystg, BrzVlad, kotlarmilos, lateralusX, MichalStrehovsky and SamMonoRT May 10, 2023 22:07

Merge remote-tracking branch 'origin/main' into enregisterLocalVars

e5ed689

BruceForstall approved these changes May 12, 2023

View reviewed changes

kunalspathak merged commit 93134de into dotnet:main May 12, 2023

kunalspathak deleted the enregisterLocalVars branch May 12, 2023 16:53

ghost locked as resolved and limited conversation to collaborators Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Templatize `enregisterLocalVars` in various methods of LSRA #85144

Templatize `enregisterLocalVars` in various methods of LSRA #85144

kunalspathak commented Apr 21, 2023

ghost commented Apr 21, 2023

kunalspathak commented Apr 21, 2023

kunalspathak commented Apr 25, 2023

kunalspathak commented Apr 25, 2023

kunalspathak commented Apr 26, 2023

kunalspathak commented Apr 27, 2023

kunalspathak commented Apr 27, 2023

BruceForstall commented Apr 27, 2023

kunalspathak commented Apr 27, 2023

kunalspathak commented May 3, 2023

kunalspathak commented May 3, 2023

BruceForstall commented May 4, 2023

kunalspathak commented May 4, 2023 •

edited

Loading

jakobbotsch commented May 4, 2023

jakobbotsch commented May 4, 2023

kunalspathak commented May 10, 2023 •

edited

Loading

kunalspathak commented May 10, 2023

BruceForstall commented May 11, 2023

BruceForstall commented May 11, 2023

kunalspathak commented May 11, 2023

BruceForstall commented May 12, 2023

kunalspathak commented May 12, 2023

Templatize enregisterLocalVars in various methods of LSRA #85144

Templatize enregisterLocalVars in various methods of LSRA #85144

Conversation

kunalspathak commented Apr 21, 2023

ghost commented Apr 21, 2023

kunalspathak commented Apr 21, 2023

kunalspathak commented Apr 25, 2023

initVarRegMaps

identifyCandidates

resolveRegistes

recordMaxSpill

kunalspathak commented Apr 25, 2023

kunalspathak commented Apr 26, 2023

kunalspathak commented Apr 27, 2023

kunalspathak commented Apr 27, 2023

BruceForstall commented Apr 27, 2023

kunalspathak commented Apr 27, 2023

kunalspathak commented May 3, 2023

kunalspathak commented May 3, 2023

BruceForstall commented May 4, 2023

kunalspathak commented May 4, 2023 • edited Loading

jakobbotsch commented May 4, 2023

jakobbotsch commented May 4, 2023

kunalspathak commented May 10, 2023 • edited Loading

kunalspathak commented May 10, 2023

BruceForstall commented May 11, 2023

BruceForstall commented May 11, 2023

kunalspathak commented May 11, 2023

BruceForstall commented May 12, 2023

kunalspathak commented May 12, 2023

Templatize `enregisterLocalVars` in various methods of LSRA #85144

Templatize `enregisterLocalVars` in various methods of LSRA #85144

kunalspathak commented May 4, 2023 •

edited

Loading

kunalspathak commented May 10, 2023 •

edited

Loading