Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Neoverse V2 and Armv9 #79

Merged
merged 4 commits into from
Nov 6, 2023
Merged

Add Neoverse V2 and Armv9 #79

merged 4 commits into from
Nov 6, 2023

Conversation

AdhocMan
Copy link
Contributor

Add support for Armv9 and Neoverse v2, as required for the Nvidia Grace CPU.

@alalazo alalazo self-assigned this Oct 28, 2023
@alalazo alalazo added the enhancement New feature or request label Oct 28, 2023
@alalazo
Copy link
Member

alalazo commented Oct 28, 2023

I'll follow with a PR referencing this on the archspec side and check whether tests passes. @fspiga Do you (or anybody from NVIDIA) want to double check this?

@fspiga
Copy link

fspiga commented Oct 30, 2023

I have staged changes for the similar type of enhancements in my local system and I have been testing things for few weeks.

For GCC/GNU, re use of -mtune and -march, see Compiler flags across architectures: -march, -mtune, and -mcpu. As long as someone is not cross compiling, we advise to just use -mcpu for 12.2 onward. It is possible to compile anything with a older GCC but we strongly advise to use a modern version of the compiler. On any Grace platform, a modern software stack equals higher chances of better generated code and performance.

Re NVHPC, IIRC neoverse-v2 supports in 23.3 was like a beta. We advise to go from 23.5 where we added a first wave of bug fixes and enhahncements. I will verify this.

Re CLANG, I am not sure the -msve-vector-bits=128 is necessary. The -mcpu=neoverse-v should already imply 128bits SIMD (it is a fixed uarch property). I will verify this.

It is worth to add Arm HPC Compiler in this mix. Support for V2 starts from version 23.04.1. We can have someone from Arm Ltd double-check this.

In my working copy, anything that is below the reccomended versions for Grace goes to neoverse_n1 or armv8.4a. This is because, at least in my mind, whoever gets a Grace system will also have modern Linux Kernel (6.2+) and a recent OS and take the step to use modern compilers. Internally we are not testing for performancer or optimal code generation on older GNU on Grace. We know that, thanks to Arm ecosystem, any generic armv8 binary with NEON enabled will run.

@AdhocMan
Copy link
Contributor Author

For GCC/GNU, re use of -mtune and -march, see Compiler flags across architectures: -march, -mtune, and -mcpu. As long as someone is not cross compiling, we advise to just use -mcpu for 12.2 onward. It is possible to compile anything with a older GCC but we strongly advise to use a modern version of the compiler. On any Grace platform, a modern software stack equals higher chances of better generated code and performance.

As far as I know, GCC only added support for -mpcu=neoverse-v2 in version 13. Am I mistaken or are you suggesting to use a different cpu target for 12.2?

Re NVHPC, IIRC neoverse-v2 supports in 23.3 was like a beta. We advise to go from 23.5 where we added a first wave of bug fixes and enhahncements. I will verify this.

That would be great, thanks.

Re CLANG, I am not sure the -msve-vector-bits=128 is necessary. The -mcpu=neoverse-v should already imply 128bits SIMD (it is a fixed uarch property). I will verify this.

The documentation of both GCC and CLANG state that the default is 'scalable'. I could not very if this is included in the mcpu option, so I've kept it when possible. I'll remove it if you can verify that it is redundant.

It is worth to add Arm HPC Compiler in this mix. Support for V2 starts from version 23.04.1. We can have someone from Arm Ltd double-check this.

According to the release notes, it was added to 23.04.0: https://developer.arm.com/documentation/107578/2304/?lang=en
I'll add the flags starting from that release.

In my working copy, anything that is below the reccomended versions for Grace goes to neoverse_n1 or armv8.4a. This is because, at least in my mind, whoever gets a Grace system will also have modern Linux Kernel (6.2+) and a recent OS and take the step to use modern compilers. Internally we are not testing for performancer or optimal code generation on older GNU on Grace. We know that, thanks to Arm ecosystem, any generic armv8 binary with NEON enabled will run.

I'd also expect that for any production workload, a recent compiler can be expected on Grace systems. But most other architectures specified here also add some tuning flags for older GCC versions as well. So in order to stay consistent, I took the 'neoverse_n1' specification as a guideline.
For these older versions, I mainly relied on documentation only, so these flags are not necessarily the optimum, but represent an educated guess only.

Copy link
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fspiga Can you double check the few items that are on hold here? It would be good to have v2 support in the upcoming major release of Spack 🙂

},
{
"versions": "8.0:8.99",
"flags" : "-march=armv8.4-a+sve -mtune=cortex-a72 -msve-vector-bits=128"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "-msve-vector-bits=128" should not be used in this entire PR. It is making vector-length specific (VLS) SVE which can only be guaranteed to run correctly if the platform has that SVE vector length. We need to produce binaries compiled on V2 that would not risk wrong behaviour on (eg) V1 where VL is 256 in SVE.

There is the warning in the GCC manual:

[..] At present ‘-msve-vector-bits=128’ also generates vector-length agnostic output for big-endian targets. All other values generate vector-length specific code. The behavior of these values may change in future releases and no value except ‘scalable’ should be relied on for producing code that is portable across different hardware SVE vector lengths.

Changes requested:

  • Remove any reference to -msve-vector-length - it's not needed.
  • For compilers with the V2 cost model (gcc 13, ACFL (arm) 23.04.0 and above) - replace whole line with "-mcpu=neoverse-v2". There is no need to specify "mtune" and "march" as "mcpu" sets both on aarch64 for Clang, ACfL, and GNU.

This PR for N1/V1 : #81 is the (simpler) pattern for compilers that do support a particular core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the argument for having comparability with V1. These architectures differ in their supported instruction sets, so the incompatibility goes beyond the sve vector width. But this flag is no longer included now due to the comments below.

* For compilers with the V2 cost model (gcc 13, ACFL (arm) 23.04.0 and above) - replace whole line with "-mcpu=neoverse-v2".  There is no need to specify "mtune" and "march" as "mcpu" sets both on aarch64 for Clang, ACfL, and GNU.

The mcpu flag is used whenever possible for the neoverse V2 architecture with all compilers already. Only in older compiler versions like the one you singled out, the maximum supported arm architecture flag subset is used. So I'm not sure I understand your comment here, since this is very similar to the N1/V1 specifications you are pointing to.

@willlovett-arm
Copy link

Hi all,

(Simon - hi! I'm technology manager for the compiler teams at Arm).

The documentation of both GCC and CLANG state that the default is 'scalable'. I could not very if this is included in the mcpu option, so I've kept it when possible. I'll remove it if you can verify that it is redundant.

Correct - this example https://godbolt.org/z/a7a3En6TT shows that in practice.

Can I check: do we have a good reason for enabling width-specific flags here? If it's for a good reason (eg. we've explicitly written width-specific library code), then it's fine. If it's for performance reasons (eg. the compiler gets to assume another thing, so it should be faster, right?...), I'd be wary. We do almost all our optimization work on width-agnistic codegen. I'd go so far as to say: if you did have any examples where you saw a speed advantage with width-specific, please let me know, so we can put effort into fixing them.

I recognise that, in this particular case, this framework knows it's only ever building for neoverse-v2, so it's probably a moot point. But note that we (Arm) are keen to proliferate vector-length-agnostic code wherever possible, to reduce the impact of future vector length incompatibility.

Thanks!

Will.

@alalazo
Copy link
Member

alalazo commented Nov 6, 2023

@willlovett-arm @dslarm Can you suggest changesets to this PR (I assume they amount to removing -msve-vector-bits=128)? I'll accept those and get this PR merged. Then we could discuss if any improvement is needed wrt flags, but I'd like to get neoverse-v2 support in the next Spack release which is happening early this week...

@AdhocMan
Copy link
Contributor Author

AdhocMan commented Nov 6, 2023

Can I check: do we have a good reason for enabling width-specific flags here? If it's for a good reason (eg. we've explicitly written width-specific library code), then it's fine. If it's for performance reasons (eg. the compiler gets to assume another thing, so it should be faster, right?...), I'd be wary. We do almost all our optimization work on width-agnistic codegen. I'd go so far as to say: if you did have any examples where you saw a speed advantage with width-specific, please let me know, so we can put effort into fixing them.

I recognise that, in this particular case, this framework knows it's only ever building for neoverse-v2, so it's probably a moot point. But note that we (Arm) are keen to proliferate vector-length-agnostic code wherever possible, to reduce the impact of future vector length incompatibility.

Thanks!

Will.

Hi Will, thanks for providing feedback on this.

I think the fact that your compiler optimization work targets width-agnostic codegen is a convincing argument for removing the fixed sve vector length. The intention was to provide as much optimization opportunities as possible for the compiler. One of our applications could benefit from fixed vector width, but that should not translate into (potentially) worse optimization in general.

@alalazo
Copy link
Member

alalazo commented Nov 6, 2023

Thanks @AdhocMan and everyone involved in the discussion!

@alalazo alalazo merged commit d844bb3 into archspec:master Nov 6, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants