Fix half2 with HIP #146

Engininja2 · 2023-07-09T22:48:55Z

The current version of CUDA allows you to access the component halfs of half2 through half2.x and half2.y, but in HIP x and y are unsigned shorts and not half or _float16. This replaces accessing those variables with intrinsic functions.

With the changes I can use --force_half2 and still get sensible text with my AMD GPU.

ardfork · 2023-07-10T00:52:35Z

This doesn't work for me. Still spouting gibberish with -fh2.
I also do not understand your change for __compat_h2rcp, it's just a backport from ROCm 5.6 for those that use outdated version of ROCm, it can just be removed if using ROCm >= 5.6.

Engininja2 · 2023-07-10T03:35:28Z

That sucks. I thought I had figured it out. I'm using a 5700XT with rocblas and pytorch 2.0.1 recompiled on ROCm 5.4.3, so it's possible it's avoiding something else that happens related to feature differences like the dot product opcodes it doesn't have.

With __compat_h2rcp as from ROCM5.6, I got gibberish with silu_mul_cuda_kernel with half2.
A basic test kernel that calls that __compat_h2rcp gives the wrong result, and has a v_cvt_f16_u16_e32 in its disassembly so it's converting from a short to a 16bit float. Compiling it in a rocm 5.6 docker has the same result, including for the h2rcp() that's builtin to rocm. Targeting gfx1030 produces the same object code as gfx1010 so I suspect there's a bug with AMD's h2rcp.

ardfork · 2023-07-10T12:47:31Z

I spent a bit more time testing your patch. It seems to be a bit more coherent or at least different than without it.

Without patch:

 -- Testing 8 chunks.
 ** Perplexity (reconstruct): 32000.1813
 -- Testing 8 chunks.
 ** Perplexity (quant, token): 31321.9178
 ** Generation: 'To be or not to be, that is theMDbrnлия Sud Beauoin Burg stick BourbourgrourutobreFD Stock ReserveMDbMDbMDbMDb'

With patch:

 -- Testing 8 chunks.
 ** Perplexity (reconstruct): 5.4959
 -- Testing 8 chunks.
 ** Perplexity (quant, token): 292330.3358
 ** Generation: 'To be or not to be, that is the............\n – – – –– – noiseokal drag next kv Ron Ep Brook future'

As you can see Perplexity (reconstruct) seem correct with your patch. I looked a bit more, and while using your patch, disabling fused MLP with --fused_mlp_thd 0 make it work.

So, there are still some half2 problem in fused MLP.

turboderp · 2023-07-10T23:07:14Z

So, there are still some half2 problem in fused MLP.

Do they actually matter, though? Maybe you could benchmark it with and without --silu_no_half2 to see if it's even worthwhile to keep fighting with h2exp and h2rcp. The activation kernel is very memory-bound anyway.

Engininja2 · 2023-07-11T18:08:31Z

@ardfork can you try clearing the contents of the exllama_ext cache? In my case it was in ~/.cache/torch_extensions/py311_cpu/exllama_ext

I tried first running exllama without the patch and with the cache deleted, and then ran with the patch without deleting the cache and got gibberish. After clearing the cache it worked.

With using functions to access the component halfs of half2, undefining HIP_NO_HALF_CONVERSIONS doesn't seem to be necessary anymore.

ardfork · 2023-07-11T20:12:15Z

That work now. Look like the change in hip_compat.cuh weren't taken into account when rebuilding. I wonder if adding a useless change in q4_mlp to force it to be rebuild would be useful (or find another way to force a rebuild of it), since people are going to have the same problem as I had.

As for speed, on my machine it is the same using half2 or not, but it's nice to be able to use everything.

It might be a good idea to raise an issue or a PR upstream to get that h2rcp fixed, as they already tried to fix it in 5.6.

If they are no performance drop on most HIP supported platform, I think we should revert #afc8b7cd. And also revert oobabooga/text-generation-webui@3c076c3 once it use a version of exllama with your patch merged.

If they are no problem with using __low2half and __high2half on CUDA, I think it would be nice to merge that PR, having all the paths working on HIP is better.

Engininja2 · 2023-07-12T01:51:11Z

Changing extra_cuda_cflags caused ninja to compile everything again too, so I don't think there's a need to change q4_mlp this time.

I raised an issue with hipamd for h2rcp(). ROCm/clr#8

Stop defaulting to --no_half2 with hip, and whitespace added to q4_mlp to help make sure it gets recompiled.

Engininja2 · 2023-07-15T23:46:08Z

I went and added a newline to q4_mlp.cu anyways in case someone using exllama downstream is using their own code for loading the extension. I removed hip defaulting to no_half2 like you suggested.

I would assume that performance should be fine for anyone with an RDNA, CDNA, or Vega GPU. Anything from the RX 580 era lacks the packed math instructions so half2 might be slower for them depending on what the compiler does.

ardfork · 2023-07-21T23:24:34Z

I raised an issue with hipamd for h2rcp()

They have moved hipamd to https://github.com/ROCm-Developer-Tools/clr, might be a good idea to also put your issue on that new repo, I fear that it might get forgotten on the old one.

fxmarty · 2023-08-04T14:45:54Z

exllama_ext/hip_compat.cuh

@@ -8,8 +8,8 @@ __device__ __forceinline__ __half __compat_hrcp(__half x) {
 }

 __device__ __forceinline__ __half2 __compat_h2rcp(__half2 x) {
-    return _Float16_2{static_cast<_Float16>(__builtin_amdgcn_rcph(x.x)),
-        static_cast<_Float16>(__builtin_amdgcn_rcph(x.y))};
+    return _Float16_2{static_cast<_Float16>(__builtin_amdgcn_rcph(static_cast<__half2_raw>(x).data.x)),


Do you know where I can read about this __builtin_amdgcn_rcph?

Use intrinsic functions instead of half2.x,y

9e7bb9e

Engininja2 added 2 commits July 11, 2023 11:55

cleanup h2rcp

40e7cf3

remove undef of hip no half conversions

9a37ef9

Fix half2 with HIP

df5e7cf

Stop defaulting to --no_half2 with hip, and whitespace added to q4_mlp to help make sure it gets recompiled.

opcod3 mentioned this pull request Jul 20, 2023

Splitting model on multiple GPUs is broken (ROCm) #173

Closed

This was referenced Jul 28, 2023

Add ROCm support #7

Merged

ROCm Port ggerganov/llama.cpp#1087

Merged

fxmarty reviewed Aug 4, 2023

View reviewed changes

ardfork mentioned this pull request Sep 16, 2023

ROCM: Garbadge output turboderp/exllamav2#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix half2 with HIP #146

Fix half2 with HIP #146

Engininja2 commented Jul 9, 2023

ardfork commented Jul 10, 2023

Engininja2 commented Jul 10, 2023

ardfork commented Jul 10, 2023

turboderp commented Jul 10, 2023

Engininja2 commented Jul 11, 2023 •

edited

Loading

ardfork commented Jul 11, 2023

Engininja2 commented Jul 12, 2023

Engininja2 commented Jul 15, 2023

ardfork commented Jul 21, 2023

fxmarty Aug 4, 2023

Fix half2 with HIP #146

Are you sure you want to change the base?

Fix half2 with HIP #146

Conversation

Engininja2 commented Jul 9, 2023

ardfork commented Jul 10, 2023

Engininja2 commented Jul 10, 2023

ardfork commented Jul 10, 2023

turboderp commented Jul 10, 2023

Engininja2 commented Jul 11, 2023 • edited Loading

ardfork commented Jul 11, 2023

Engininja2 commented Jul 12, 2023

Engininja2 commented Jul 15, 2023

ardfork commented Jul 21, 2023

fxmarty Aug 4, 2023

Choose a reason for hiding this comment

Engininja2 commented Jul 11, 2023 •

edited

Loading