Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Templatize enregisterLocalVars in various methods of LSRA #85144

Merged
merged 13 commits into from
May 12, 2023

Conversation

kunalspathak
Copy link
Member

Let's see how much TP gain we get and if it is even worth pursuing.

Fixes: #82848

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 21, 2023
@ghost ghost assigned kunalspathak Apr 21, 2023
@ghost
Copy link

ghost commented Apr 21, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Let's see how much TP gain we get and if it is even worth pursuing.

Fixes: #82848

Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member Author

TPdiffs seems good for minopts as expected, but I need to investigate why there are regressions specially for linux.

image

@kunalspathak
Copy link
Member Author

TPdiffs seems good for minopts as expected, but I need to investigate why there are regressions specially for linux.

image

Here is the analysis summary for coreclr_tests collection of unix_x64_x64.dll:

??$resolveRegisters@$0A@@LinearScan@@QEAAXXZ                    : 5040463141  : NA       : 16.51% : +0.6076%
?initVarRegMaps@LinearScan@@AEAAXXZ                             : 3877517487  : NA       : 12.70% : +0.4674%
??$resolveRegisters@$00@LinearScan@@QEAAXXZ                     : 3765504006  : NA       : 12.33% : +0.4539%
?allocateRegisters@LinearScan@@QEAAXXZ                          : 2631515439  : +8.26%   : 8.62%  : +0.3172%
??$identifyCandidates@$00@LinearScan@@AEAAXXZ                   : 335048503   : NA       : 1.10%  : +0.0404%
?recordMaxSpill@LinearScan@@QEAAXXZ                             : 89011904    : NA       : 0.29%  : +0.0107%
?buildIntervals@LinearScan@@QEAAXXZ                             : 46684771    : +1.33%   : 0.15%  : +0.0056%
?identifyCandidates@LinearScan@@AEAAXXZ                         : -420751896  : -100.00% : 1.38%  : -0.0507%
?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z : -2041237366 : -30.87%  : 6.69%  : -0.2461%
?doLinearScan@LinearScan@@UEAA?AW4PhaseStatus@@XZ               : -3250564252 : -98.94%  : 10.65% : -0.3918%
?resolveRegisters@LinearScan@@QEAAXXZ                           : -9014105293 : -100.00% : 29.53% : -1.0866%

initVarRegMaps

From the disassembly, it seems that previously, we were inlining this method, but with this PR, we are not. I will revert this particular change.

image

identifyCandidates

From the disassembly, the identifyCandidates<false> gets inlined now, and hence we do not see its entry in the diff.

image

resolveRegistes

From the disassembly, the resolveRegisters<false> removes good chunk of code with this PR:

image

recordMaxSpill

From the disassembly, for some reason, recordMaxSpill() call has stopped inlining in both resolveRegisters<false> and resolveRegistes<true> case.

image

Again, this analysis is based on what VC++ produced for clrjit_unix_x64_x64.dll and can no way represent how clang would produce the clrjit.so for unix.

For experiment, I will try to push a change that templatizes buildInterval() to see how different are the TP improvements. With doing that, we do see 3KB increase in binary size.

@kunalspathak
Copy link
Member Author

Will need to wait until we have superpmi collection ready.

@kunalspathak
Copy link
Member Author

Here are the latest numbers. I have no idea why linux-x64 numbers are exactly opposite of windows-x64 numbers, but i am not even sure how much we could rely on linux-x64 numbers as I mentioned in #85144 (comment). I will try out templatizing buildIntervals() to see how much it impacts the TP.

image

@kunalspathak
Copy link
Member Author

I will try out templatizing buildIntervals() to see how much it impacts the TP.

It gets us extra 0.05%.

image

@kunalspathak kunalspathak marked this pull request as ready for review April 27, 2023 04:23
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@BruceForstall
Copy link
Member

Diffs

The TP diffs are quite mixed, with big improvements and big regressions.

Is it worth it?

@kunalspathak
Copy link
Member Author

The TP diffs are quite mixed, with big improvements and big regressions.

From my understanding, the ones showed for windows are true/real and remaining all are fabricated/artificial IMO because those binaries are not even compiled with VC++ and it is very hard to believe the numbers flip between windows and altjit binaries.

Is it worth it?

Probably. We are still eliminating branches in lot of hot code (e.g. #85144 (comment)) and some of that is eliminated from the iterations over BasicBlocks and local variables, which unfortunately can't be measured using TP's instruction count.

@kunalspathak
Copy link
Member Author

The TP diffs are quite mixed, with big improvements and big regressions.

From my understanding, the ones showed for windows are true/real and remaining all are fabricated/artificial IMO because those binaries are not even compiled with VC++ and it is very hard to believe the numbers flip between windows and altjit binaries.

Is it worth it?

Probably. We are still eliminating branches in lot of hot code (e.g. #85144 (comment)) and some of that is eliminated from the iterations over BasicBlocks and local variables, which unfortunately can't be measured using TP's instruction count.

@BruceForstall - any thoughts on this?

@kunalspathak
Copy link
Member Author

which unfortunately can't be measured using TP's instruction count.

Reminds me of @tannergooding comment in #83648 (comment)

@BruceForstall
Copy link
Member

[There's some bug with the Diffs Extension page where it's showing multiple diffs outputs. I guess that's because the diffs was run twice in the same PR]

The MinOpts TP numbers look good for all platforms except linux-x64.

The FullOpts TP numbers are mixed for all platforms (regression for coreclr_tests especially, mostly improvements otherwise), but pure regression for linux-x64.

So:

  1. Why is it ok to accept the FullOpts regressions for all platforms? It's not clear why there should be any regression. Is there some particularly anomalous test in coreclr_tests?
  2. W.r.t. linux-x64 you seem to be arguing that because we're actually measuring TP using the RyuJIT cross-compiler on Windows that was built with VC++ that the TP numbers are not valid. But I'm not sure why they are not valid. We are certainly assuming that they are a proxy for native TP, and that could be incorrect. However, VC++ might do a better job than clang/LLVM just as it might do a worse job. The code path run by RyuJIT should be the same in both native and cross compiler cases.

@jakobbotsch There's an interesting point here: can we create a linux-x64 build of our instruction counting PIN tool, and add native linux-x64 TP measurement, such that we're measuring the clang-built RyuJIT? PIN does support linux-x64 builds. We could keep doing the linux-x64 cross-compiler TP which we could use to validate how representative our cross-compiler TP measurements are (by manually comparing the two runs). We currently don't do any SPMI AzDO pipeline runs on Linux, but it should work.

@kunalspathak
Copy link
Member Author

kunalspathak commented May 4, 2023

The MinOpts TP numbers look good for all platforms except linux-x64.

Yes, and I don't have any explanation to why that must be happening specially the windows numbers shows the opposite result and both windows/linux (cross-compiled) binaries are produced by same VC++ compiler.

We are certainly assuming that they are a proxy for native TP

Yes, I think that too. My point is that VC++ compiler could have made certain decisions that is increasing the number of instructions executed and eliminating branch instructions are included in that number. Even if number of instructions might have increased, the number of branches (which contains higher latency and branch mis-predictions) are reduced (as shown in my analysis #85144 (comment)) and that should lead to better execution performance. Current TP diff can no way find out if execution speed has increased or not because we don't take into account various details of instruction pipeline and their latency/throughput. I feel that the TP diff is a good yardstick to guide us if the change could actually cause TP impact or not, but we need to interpret those numbers on case-by-case basis.

EDIT:

Is there some particularly anomalous test in coreclr_tests?

I am not sure how to find that out. Is there a way to do that?

@jakobbotsch
Copy link
Member

@jakobbotsch There's an interesting point here: can we create a linux-x64 build of our instruction counting PIN tool, and add native linux-x64 TP measurement, such that we're measuring the clang-built RyuJIT? PIN does support linux-x64 builds. We could keep doing the linux-x64 cross-compiler TP which we could use to validate how representative our cross-compiler TP measurements are (by manually comparing the two runs).

Yes it would be possible (and we have talked about it before), it's just work. I've opened #85749 to track it.

@jakobbotsch
Copy link
Member

EDIT:

Is there some particularly anomalous test in coreclr_tests?

I am not sure how to find that out. Is there a way to do that?

My typical trick is the following:

diff --git a/src/coreclr/tools/superpmi/superpmi/superpmi.cpp b/src/coreclr/tools/superpmi/superpmi/superpmi.cpp
index c0d64df1acc..e31e423bf15 100644
--- a/src/coreclr/tools/superpmi/superpmi/superpmi.cpp
+++ b/src/coreclr/tools/superpmi/superpmi/superpmi.cpp
@@ -607,6 +607,12 @@ int __cdecl main(int argc, char* argv[])
 
                             break;
                         case NearDifferResult::SuccessWithoutDiff:
+                            PrintDiffsCsvRow(
+                                diffCsv,
+                                reader->GetMethodContextIndex(),
+                                mcb.size,
+                                baseMetrics.NumCodeBytes, diffMetrics.NumCodeBytes,
+                                baseMetrics.NumExecutedInstructions, diffMetrics.NumExecutedInstructions);
                             break;
                         case NearDifferResult::Failure:
                             if (o.mclFilename != nullptr)

Then you can invoke superpmi.exe with -diffsInfo -- you can get the command from superpmi.py tpdiff -f coreclr_tests and adjust it, e.g. it will end up being something like:

C:\dev\dotnet\spmi\pintools\1.0\windows\pin.exe -follow_execv -t C:\dev\dotnet\spmi\pintools\1.0\windows\clrjit_inscount_x64\clrjit_inscount.dll -quiet -- C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\superpmi.exe -applyDiff -diffsInfo diffs.csv -p C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\clrjit.dll C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\clrjit.dll C:\dev\dotnet\spmi\mch\387bcec3-9a71-4422-a11c-e7ce3b73c592.windows.x64\benchmarks.run_pgo.windows.x64.checked.mch

You can open diffs.csv in Excel and sort by a new column that computes the relative diff between the instruction numbers.

It would be nice to make this easier.. I opened #85755 to track that.

@kunalspathak
Copy link
Member Author

kunalspathak commented May 10, 2023

@BruceForstall @jakobbotsch - how do we fix this? I assume the jitrollingbuild should be using Clang 16, but not sure why it is on Clang 12?

image

@kunalspathak
Copy link
Member Author

[23:27:18] SuperPMI encountered missing data. Missing with base JIT: 5557. Missing with diff JIT: 5557. Total contexts: 396321.

[23:27:18] Asm diffs summary:


Traceback (most recent call last):

  File "C:\h\w\AF5D093D\p\superpmi.py", line 4661, in <module>

    sys.exit(main(args))

  File "C:\h\w\AF5D093D\p\superpmi.py", line 4552, in main

    success = asm_diffs.replay_with_asm_diffs()

  File "C:\h\w\AF5D093D\p\superpmi.py", line 2019, in replay_with_asm_diffs

    self.write_asmdiffs_markdown_summary(write_fh, asm_diffs)

  File "C:\h\w\AF5D093D\p\superpmi.py", line 2170, in write_asmdiffs_markdown_summary

    write_row(*t)

  File "C:\h\w\AF5D093D\p\superpmi.py", line 2165, in write_row

    num_missed_base / num_contexts * 100,

ZeroDivisionError: division by zero

Also hitting this although the num_contexts seems to be > 0.

@BruceForstall
Copy link
Member

@BruceForstall @jakobbotsch - how do we fix this? I assume the jitrollingbuild should be using Clang 16, but not sure why it is on Clang 12?

It will fix itself. #84676 changed builds to use clang 16, but the rolling build hasn't run since then (it's running right now). After it builds, you'll need to merge & push (I think) so your baseline will be a newly built JIT.

@BruceForstall
Copy link
Member

While we should protect against division by zero in the code, there's probably something more fundamental failing. What run was the python crash from?

I see your diffs run for win-x64 has:

[22:56:59] ERROR: Couldn't load base metrics summary created by child process
[22:56:59] ERROR: Couldn't load base metrics summary created by child process
[22:56:59] ERROR: Couldn't load base metrics summary created by child process
[22:56:59] ERROR: Couldn't load base metrics summary created by child process
[22:56:59] General fatal error

@kunalspathak
Copy link
Member Author

Yes, this is the same error I am seeing in superpmi-replay and it repros on main. Opened #86082 to track it.

@BruceForstall
Copy link
Member

The TP diffs look like a pure in on linux-x64 native, which doesn't match the linux-x64 cross TP. Odd. The only regression still is the slight win-x86 regression.

@kunalspathak
Copy link
Member Author

The TP diffs look like a pure in on linux-x64 native, which doesn't match the linux-x64 cross TP. Odd. The only regression still is the slight win-x86 regression.

Yes. Seems like a good wins on linux-x64 (native) MinOpts of 0.7%. Thanks for the review and great to see this PR motivated us to build native linux TP, which is a good thing to have.

image

@kunalspathak kunalspathak merged commit 93134de into dotnet:main May 12, 2023
@kunalspathak kunalspathak deleted the enregisterLocalVars branch May 12, 2023 16:53
@ghost ghost locked as resolved and limited conversation to collaborators Jun 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider passing enregisterLocalVars as a template argument to LinearScan
3 participants