Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64 Linux test failures on LLVM 12 and newer #597

Open
elliottslaughter opened this issue Aug 16, 2022 · 23 comments
Open

AArch64 Linux test failures on LLVM 12 and newer #597

elliottslaughter opened this issue Aug 16, 2022 · 23 comments

Comments

@elliottslaughter
Copy link
Member

elliottslaughter commented Aug 16, 2022

I've seen this pop up a couple of times, so I'm documenting it here so people know about it. On ARM (AArch64), LLVM versions >= 12 seem to experience higher test failure rates. I've seen this happen on both Linux (Graviton, NVIDIA Jetson) and macOS (Apple M1), so it seems to be a feature of ARM processors and not of a specific OS.

Test pass rate on LLVM 14.0.0:

98% tests passed, 13 tests failed out of 549

Total Test time (real) =  30.15 sec

The following tests FAILED:
	 65 - class.t (Failed)
	 70 - class6.t (Failed)
	 90 - coverage3.t (Failed)
	121 - defercond.t (SEGFAULT)
	209 - let2.t (SEGFAULT)
	343 - simpleglobal.t (Failed)
	396 - teststd.t (SEGFAULT)
	412 - vars.t (SEGFAULT)
	413 - vars2.t (SEGFAULT)
	415 - vecarith.t (SEGFAULT)
	417 - vecobj.t (Failed)
	424 - zeroreturn.t (SEGFAULT)
	425 - zeroreturn2.t (SEGFAULT)
Errors while running CTest

Test pass rate on LLVM 11.1.0:

99% tests passed, 4 tests failed out of 549

Total Test time (real) =  28.59 sec

The following tests FAILED:
	 65 - class.t (Failed)
	 70 - class6.t (Failed)
	 90 - coverage3.t (Failed)
	417 - vecobj.t (Failed)
Errors while running CTest

This is with dcd2eff on the following machine:

$ uname -a
Linux gcc80 4.9.201-tegra #1 SMP PREEMPT Fri Jul 9 08:56:59 PDT 2021 aarch64 aarch64 aarch64 GNU/Linux

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           4
Vendor ID:           Nvidia
Model:               0
Model name:          ARMv8 Processor rev 0 (v8l)
Stepping:            0x0
CPU max MHz:         2265.6001
CPU min MHz:         115.2000
BogoMIPS:            62.50
L1d cache:           64K
L1i cache:           128K
L2 cache:            2048K
L3 cache:            4096K
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp

My best guess at a root cause is #485, because LLVM 12 is the version where we switched back to MCJIT after ORCv1 was removed. However, it's possible something else changed in LLVM that is breaking something, and we haven't accounted for it yet.

Right now the best workaround is to stick to LLVM 11 on ARM platforms.

@shoe42
Copy link
Contributor

shoe42 commented Aug 18, 2022

Not sure how useful this is, but here's some quick tests (using 23559ea), macOS on M1. Exporting both SDKROOT and INCLUDE_PATH, using /usr/bin/clang++ and LLVM 11 through 14 from Homebrew, I get the following:

  • LLVM 11
98% tests passed, 10 tests failed out of 549

Total Test time (real) =  16.96 sec

The following tests FAILED:
	 27 - atomicrmw.t (Subprocess aborted)
	 46 - bug372.t (SEGFAULT)
	 48 - bug372b.t (SEGFAULT)
	 49 - bug372c.t (SEGFAULT)
	 60 - cconv_array.t (Failed)
	 65 - class.t (SEGFAULT)
	 67 - class3.t (SEGFAULT)
	272 - fakeasm.t (Failed)
	423 - printfarray.t (SEGFAULT)
	533 - vararg.t (Subprocess aborted)
  • LLVM 12
98% tests passed, 10 tests failed out of 549

Total Test time (real) =  16.90 sec

The following tests FAILED:
	 27 - atomicrmw.t (SEGFAULT)
	 46 - bug372.t (SEGFAULT)
	 48 - bug372b.t (SEGFAULT)
	 49 - bug372c.t (SEGFAULT)
	 60 - cconv_array.t (Failed)
	 65 - class.t (SEGFAULT)
	 67 - class3.t (SEGFAULT)
	272 - fakeasm.t (Failed)
	423 - printfarray.t (SEGFAULT)
	533 - vararg.t (Subprocess aborted)
  • LLVM 13
98% tests passed, 11 tests failed out of 549

Total Test time (real) =  16.96 sec

The following tests FAILED:
	 27 - atomicrmw.t (Subprocess aborted)
	 46 - bug372.t (SEGFAULT)
	 48 - bug372b.t (SEGFAULT)
	 49 - bug372c.t (SEGFAULT)
	 50 - bug372d.t (SEGFAULT)
	 60 - cconv_array.t (Failed)
	 65 - class.t (SEGFAULT)
	 67 - class3.t (SEGFAULT)
	272 - fakeasm.t (Failed)
	423 - printfarray.t (SEGFAULT)
	533 - vararg.t (Subprocess aborted)
  • LLVM 14
98% tests passed, 11 tests failed out of 549

Total Test time (real) =  15.97 sec

The following tests FAILED:
	 27 - atomicrmw.t (SEGFAULT)
	 46 - bug372.t (SEGFAULT)
	 48 - bug372b.t (SEGFAULT)
	 49 - bug372c.t (SEGFAULT)
	 51 - bug4.t (SEGFAULT)
	 60 - cconv_array.t (Failed)
	 65 - class.t (SEGFAULT)
	 67 - class3.t (SEGFAULT)
	272 - fakeasm.t (Failed)
	423 - printfarray.t (SEGFAULT)
	533 - vararg.t (SEGFAULT)

@elliottslaughter
Copy link
Member Author

Interesting, thanks. I guess the issue identified in the title here is really a Linux problem, so that means the issue on Apple M1 is really somewhere else.

Thanks for taking the time to test this.

@elliottslaughter elliottslaughter changed the title AArch64 test failures on LLVM 12 and newer AArch64 Linux test failures on LLVM 12 and newer Aug 18, 2022
@shoe42
Copy link
Contributor

shoe42 commented Aug 19, 2022

Here's some more (hopefully useful) information gleaned from running each failing test individually in lldb. As before, using commit 23559ea, macOS on M1, and LLVM 13 from Homebrew.

  • atomicrmw.t:
54321
atomicrmw.t:69:         terra atomic_add_and_return(x : &int32,y : int32) : int32
atomicrmw.t:70:             return atomicrmw("add", x, y, { syncscope = "", ordering = "acq_rel", align = native, isvolatile = false })
atomicrmw.t:69:         end
definition 	{&int32,int32} -> int32
define dso_local i32 @"$atomic_add_and_return"(i32* %0, i32 %1) {
entry:
  %2 = atomicrmw add i32* %0, i32 %1 acq_rel, align 4
  ret i32 %2
}
assembly for function at address 0x10978c000
0x10978c000(+0):		ldaxr	w8, [x0]
0x10978c004(+4):		add	w9, w8, w1
0x10978c008(+8):		stlxr	w10, w9, [x0]
0x10978c00c(+12):		cbnz	w10, #-12
0x10978c010(+16):		mov	x0, x8
0x10978c014(+20):		ret
22
atomicrmw.t:91:         terra atomic_fadd(x : &double,y : double) : {}
atomicrmw.t:92:             atomicrmw("fadd", x, y, { syncscope = "", ordering = "monotonic", align = native, isvolatile = false })
atomicrmw.t:91:         end
definition 	{&double,double} -> {}
define dso_local void @"$atomic_fadd"(double* %0, double %1) {
entry:
  %2 = atomicrmw fadd double* %0, double %1 monotonic, align 8
  ret void
}
assembly for function at address 0x109794000
0x109794000(+0):		ldr	d1, [x0]
0x109794004(+4):		b	#16
0x109794008(+8):		fmov	d1, x10
0x10979400c(+12):		cmp	x10, x9
0x109794010(+16):		b.eq	#40
0x109794014(+20):		fadd	d2, d1, d0
0x109794018(+24):		fmov	x8, d2
0x10979401c(+28):		fmov	x9, d1
0x109794020(+32):		ldaxr	x10, [x0]
0x109794024(+36):		cmp	x10, x9
0x109794028(+40):		b.ne	#-32
0x10979402c(+44):		stlxr	wzr, x8, [x0]
0x109794030(+48):		cbnz	wzr, #-16
0x109794034(+52):		b	#-44
0x109794038(+56):		ret
21
atomicrmw.t:69:         terra atomic_add_and_return(x : &int32,y : int32) : int32
atomicrmw.t:70:             return atomicrmw("add", x, y, { syncscope = "", ordering = "acq_rel", align = native, isvolatile = false })
atomicrmw.t:69:         end
definition 	{&int32,int32} -> int32
define dso_local i32 @"$atomic_add_and_return"(i32* %0, i32 %1) {
entry:
  %2 = atomicrmw add i32* %0, i32 %1 acq_rel, align 4
  ret i32 %2
}
assembly for function at address 0x10978c000
0x10978c000(+0):		ldaxr	w8, [x0]
0x10978c004(+4):		add	w9, w8, w1
0x10978c008(+8):		stlxr	w10, w9, [x0]
0x10978c00c(+12):		cbnz	w10, #-12
0x10978c010(+16):		mov	x0, x8
0x10978c014(+20):		ret
LLVM ERROR: Cannot select: intrinsic %llvm.aarch64.stlxr
Process 49454 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00000001a314ad98 libsystem_kernel.dylib`__pthread_kill + 8
  • bug372.t:
Process 49620 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3131676e69727450)
    frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
  • bug372b.t:
Process 49681 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3131676e69727450)
    frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
  • bug372c.t:
Expect this ouput:
Process 49745 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x697274530b008c00)
    frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
  • bug372d.t:
Expect this ouput:
Process 61876 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfff9000000000000)
    frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
  • cconv_array.t:
running test for int8[0]
running test for int8[1]
running test for int8[2]
running test for int8[3]
running test for int8[4]
running test for int8[5]
running test for int8[6]
running test for int8[7]
running test for int8[8]
running test for int8[9]
cconv_array.t:53: NYI: cannot call this C function (yet)
stack traceback:
	[C]: in function 'caller'
	cconv_array.t:53: in function 'run_test_case'
	cconv_array.t:64: in main chunk
Process 49838 exited with status = 1 (0x00000001)
  • class.t:
metatype init of Animal
allocating slot 1 to method speak in type Animal
stub speak in Animal
metatype init of Dog
importing	speak	at slot	1	 in type 	Dog
override slot 1 with method speak in type Dog
stub speak in Dog
metatype init of Cat
importing	speak	at slot	1	 in type 	Cat
override slot 1 with method speak in type Cat
stub speak in Cat
meow! 1876945344
woof! 1876945344
metatype init of P
allocating slot 1 to method add in type P
stub add in P
metatype init of C$1
importing	add	at slot	1	 in type 	C$1
allocating slot 2 to method sub in type C$1
stub add in C$1
stub sub in C$1
defining interface Add for type P
defining interface Add for type C$1
defining interface Sub for type C$1
interface xA�
Process 49941 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
  • class3.t:
metatype init of Leaf
allocating slot 1 to method print in type Leaf
stub print in Leaf
metatype init of Node
importing	print	at slot	1	 in type 	Node
override slot 1 with method print in type Node
stub print in Node
defining interface Prints for type Leaf
defining interface Prints for type Node
Process 50072 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xd50)
    frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
  • fakeasm.t:
fakeasm.t:32: assertion failed!
stack traceback:
	[C]: in function 'assert'
	fakeasm.t:32: in main chunk
Process 50169 exited with status = 1 (0x00000001)

Adding a print('"' ... s ... '"') before the assert shows the string s contains 1 3 4 6 what 1876945344\n when inside lldb, with the final integer varying seemingly randomly, except for the beginning 18 which seems to remain constant, when run outside of lldb. Bear in mind this is far from exhaistive testing, and just observations from a few runs.

  • printfarray.t:
Process 53567 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8000000060)
    frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
  • vararg.t:
Assertion failed: (Ty && "Invalid GetElementPtrInst indices for type!"), function checkGEPType, file Instructions.h, line 921.
Process 53786 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x0000000100038d1c terra`llvm::checkGEPType(Ty=0x0000000000000000) at Instructions.h:921:3
   918 	// message on bad indexes for a gep instruction.
   919 	//
   920 	inline Type *checkGEPType(Type *Ty) {
-> 921 	  assert(Ty && "Invalid GetElementPtrInst indices for type!");
   922 	  return Ty;
   923 	}
   924
Target 0: (terra) stopped.

@elliottslaughter
Copy link
Member Author

cconv_array.t seems to be straight up a case of something being not yet implemented in LuaJIT. I don't think there's anything we can do for that one except turn the test off for when we're on M1 macOS.

Based on the others, it looks like a lot of them are getting caught in strlen, which perhaps indicates an issue with memory layout. But beyond that it's not obvious to me what the issue is. Maybe LLVM is doing an optimization that breaks type punning, but that's a random guess.

This is a shot in the dark, but what happens when you change {fastmath=false} to false in:

terra.jitcompilationunit = terra.newcompilationunit(terra.nativetarget,true,{fastmath=false}) -- compilation unit used for JIT compilation, will eventually specify the native architecture

That should completely shut off optimizations in the JIT.

For these tests, I think what it will take to make progress is to minimize each test: i.e., delete lines from the test until the the failure behavior changes. Once we've got the smallest version of each test that fails with the same behavior as the original, then we can (hopefully) see what specifically it is about that test that is causing trouble, and then form a hypothesis.

@shoe42
Copy link
Contributor

shoe42 commented Aug 19, 2022

Replacing {fastmath=false} to just false seems to have no effect - looks like the same tests are failing. Backtraces for the strlen issues (in the below example, specifically running bug372d.t) traces to a point outside of any loaded sections:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfff9000000000000)
  * frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
    frame #1: 0x00000001a30436a0 libsystem_c.dylib`__vfprintf + 4544
    frame #2: 0x00000001a30531d8 libsystem_c.dylib`vfprintf_l + 68
    frame #3: 0x00000001a3070f38 libsystem_c.dylib`printf + 80
    frame #4: 0x00000001095e80f8
(lldb) image lookup -va 0x00000001095e80f8
(lldb)

Looks like the code at frame 4 might be generated from Terra's JIT. Setting a breakpoint on the address shows it's at least written to by llvm::RuntimeDyldImpl::emitSection:

(lldb) w s e 0x00000001095e80f8
Watchpoint created: Watchpoint 1: addr = 0x1095e80f8 size = 8 state = enabled type = w
(lldb) c
Process 62850 resuming

...

Process 62850 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT (code=258, subcode=0x1095e80f0)
  * frame #0: 0x00000001a3197184 libsystem_platform.dylib`_platform_memmove + 180
    frame #1: 0x0000000101491784 terra`llvm::RuntimeDyldImpl::emitSection(llvm::object::ObjectFile const&, llvm::object::SectionRef const&, bool) + 756
    frame #2: 0x0000000101490a08 terra`llvm::RuntimeDyldImpl::findOrEmitSection(llvm::object::ObjectFile const&, llvm::object::SectionRef const&, bool, std::__1::map<llvm::object::SectionRef, unsigned int, std::__1::less<llvm::object::SectionRef>, std::__1::allocator<std::__1::pair<llvm::object::SectionRef const, unsigned int> > >&) + 120
    frame #3: 0x000000010148f704 terra`llvm::RuntimeDyldImpl::loadObjectImpl(llvm::object::ObjectFile const&) + 1952
    frame #4: 0x00000001014a4570 terra`llvm::RuntimeDyldMachO::loadObject(llvm::object::ObjectFile const&) + 52
    frame #5: 0x0000000101493824 terra`llvm::RuntimeDyld::loadObject(llvm::object::ObjectFile const&) + 800
    frame #6: 0x000000010141e43c terra`llvm::MCJIT::generateCodeForModule(llvm::Module*) + 208
    frame #7: 0x000000010141f128 terra`llvm::MCJIT::findSymbol(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 544
    frame #8: 0x000000010141ed98 terra`llvm::MCJIT::getSymbolAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 124
    frame #9: 0x000000010141f3ac terra`llvm::MCJIT::getGlobalValueAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 52
    frame #10: 0x0000000100050c14 terra`JITGlobalValue(CU=0x0000600002c04100, gv=0x0000600002c10d88) at tcompiler.cpp:3812:24
    frame #11: 0x0000000100009574 terra`terra_jit(L=0x000000010a800380) at tcompiler.cpp:3820:17
    frame #12: 0x0000000103070388 terra`lj_BC_FUNCC + 44
    frame #13: 0x0000000103017ddc terra`lua_pcall(L=<unavailable>, nargs=<unavailable>, nresults=<unavailable>, errfunc=<unavailable>) at lj_api.c:1145:12 [opt]
    frame #14: 0x00000001000052dc terra`docall(L=0x000000010a800380, narg=0, clear=0) at main.cpp:339:14
    frame #15: 0x0000000100004e84 terra`main(argc=2, argv=0x000000016fdfee90) at main.cpp:119:13
    frame #16: 0x000000010936108c dyld`start + 520
(lldb) w delete 1
1 watchpoints deleted.
(lldb) c
Process 62850 resuming

...

Expect this ouput:
Process 62850 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfff9000000000000)
  * frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
    frame #1: 0x00000001a30436a0 libsystem_c.dylib`__vfprintf + 4544
    frame #2: 0x00000001a30531d8 libsystem_c.dylib`vfprintf_l + 68
    frame #3: 0x00000001a3070f38 libsystem_c.dylib`printf + 80
    frame #4: 0x00000001095e80f8

@elliottslaughter
Copy link
Member Author

I think bug372.t would be the best version of this test to debug, assuming it still fails, since it's the shortest.

Unfortunately, debug info does not seem to be functioning on ARM. (You can try with terra -g bug372.t, but don't get your hopes up.)

So I think the next best thing we can do is "hardhat" debugging, which in this case probably is going to involve printf on all of the pointers in the program.

@shoe42
Copy link
Contributor

shoe42 commented Aug 19, 2022

Not sure if relevant, but e.g. in bug372d.t, dumping the following addresses:

C.printf("Config:               0x%0.16lx\n", [long](&(config)))
C.printf("xBCLeftInflowProfile: 0x%0.16lx\n", [long](&(config.BC.xBCLeftInflowProfile.u.File.FileDir)))
C.printf("xBCLeftHeat:          0x%0.16lx\n", [long](&(config.BC.xBCLeftHeat.u.File.FileDir)))

Gives the following output:

Config:               0xfff9000000000000
xBCLeftInflowProfile: 0xfff9000000000000
xBCLeftHeat:          0xfff9000000000000

I'm far from even remotely familiar with Terra and its syntax, but is this expected? Shouldn't the FileDirs be offset into the Config struct?

@elliottslaughter
Copy link
Member Author

Is a long 32 bits on M1? It may simply be that you're throwing away the relevant bits of the address.

Better to use %p and pass the pointer:

C.printf("Config:               0x%0.16p\n", &(config))
C.printf("xBCLeftInflowProfile: 0x%0.16p\n", &(config.BC.xBCLeftInflowProfile.u.File.FileDir))
C.printf("xBCLeftHeat:          0x%0.16p\n", &(config.BC.xBCLeftHeat.u.File.FileDir))

@shoe42
Copy link
Contributor

shoe42 commented Aug 19, 2022

Using %p still outputs just 0x0xfff9000000000000. Something doesn't seem right - clang reports long to be 8 bytes.

@elliottslaughter
Copy link
Member Author

elliottslaughter commented Aug 19, 2022

Ok, well there is definitely something wrong because those pointers should be different addresses. E.g. here's output on x86_64:

Config:               0x0x00007ff7b331cd08
xBCLeftInflowProfile: 0x0x00007ff7b331cf40
xBCLeftHeat:          0x0x00007ff7b331cd0c

Maybe try this test? This should show us the LLVM IR before and after optimizations, plus the assembly, so maybe we'll spot something in here.

local C = terralib.includecstring
[[
#include "stdbool.h"
#include "stdint.h"
#include "stdlib.h"
#include "stdio.h"
#include "string.h"

typedef int TempProfile;
#define TempProfile_File 0

typedef int InflowProfile;
#define InflowProfile_File 0
#define InflowProfile_SuctionAndBlowing 1

struct Config {
  struct  {
    struct  {
      int32_t type;
      union  {
        struct  {
          int8_t FileDir[256];
        } File;
      } u;
    } xBCLeftHeat;
    struct  {
      int32_t type;
      union  {
        struct  {
          double addedVelocity;
          int8_t FileDir[256];
        } File;
        struct  {
          double sigma;
          struct  {
            uint32_t length;
            double values[10];
          } beta;
          double Zw;
          struct  {
            uint32_t length;
            double values[10];
          } A;
          struct  {
            uint32_t length;
            double values[10];
          } omega;
        } SuctionAndBlowing;
      } u;
    } yBCLeftInflowProfile;
    struct  {
      int32_t type;
      union  {
        struct  {
          double addedVelocity;
          int8_t FileDir[256];
        } File;
        struct  {
          double sigma;
          struct  {
            uint32_t length;
            double values[10];
          } beta;
          double Zw;
          struct  {
            uint32_t length;
            double values[10];
          } A;
          struct  {
            uint32_t length;
            double values[10];
          } omega;
        } SuctionAndBlowing;
      } u;
    } xBCLeftInflowProfile;
  } BC;
};
]]

local Config = C.Config

terra main()
  var config : Config
  C.printf("Config:               0x%0.16p\n", &(config))
  C.printf("xBCLeftInflowProfile: 0x%0.16p\n", &(config.BC.xBCLeftInflowProfile.u.File.FileDir))
  C.printf("xBCLeftHeat:          0x%0.16p\n", &(config.BC.xBCLeftHeat.u.File.FileDir))
end
print(terralib.saveobj(nil, "llvmir", {main=main}, nil, nil, false))
main:disas()

@shoe42
Copy link
Contributor

shoe42 commented Aug 19, 2022

Output from the above:

define dso_local void @"$main"() {
entry:
  %config = alloca %struct.Config, align 8
  %0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string", i64 0, i64 0), %struct.Config* nonnull %config)
  %1 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 2, i32 1, i32 0, i32 1
  %2 = bitcast %struct.anon.8* %1 to [256 x i8]*
  %3 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.1", i64 0, i64 0), [256 x i8]* nonnull %2)
  %4 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 0, i32 1, i32 0, i32 0
  %5 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.2", i64 0, i64 0), [256 x i8]* nonnull %4)
  ret void
}
; ModuleID = 'terra'
source_filename = "terra"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-darwin21.6.0"

%struct.Config = type { %struct.anon.1 }
%struct.anon.1 = type { %struct.anon.2, %struct.anon.4, %struct.anon.4 }
%struct.anon.2 = type { i32, %union.anon }
%union.anon = type { %struct.anon.3 }
%struct.anon.3 = type { [256 x i8] }
%struct.anon.4 = type { i32, %union.anon.5 }
%union.anon.5 = type { %struct.anon.7 }
%struct.anon.7 = type { double, %struct.anon.8, double, %struct.anon.8, %struct.anon.8 }
%struct.anon.8 = type { i32, [10 x double] }
%struct.anon.6 = type { double, [256 x i8] }

@"$string" = private unnamed_addr constant [32 x i8] c"Config:               0x%0.16p\0A\00", align 1
@"$string.1" = private unnamed_addr constant [32 x i8] c"xBCLeftInflowProfile: 0x%0.16p\0A\00", align 1
@"$string.2" = private unnamed_addr constant [32 x i8] c"xBCLeftHeat:          0x%0.16p\0A\00", align 1

define dso_local void @main() {
entry:
  %config = alloca %struct.Config, align 8
  %0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)
  %1 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
  %2 = getelementptr %struct.anon.1, %struct.anon.1* %1, i32 0, i32 2
  %3 = getelementptr %struct.anon.4, %struct.anon.4* %2, i32 0, i32 1
  %4 = getelementptr %union.anon.5, %union.anon.5* %3, i32 0, i32 0
  %5 = bitcast %struct.anon.7* %4 to %struct.anon.6*
  %6 = getelementptr %struct.anon.6, %struct.anon.6* %5, i32 0, i32 1
  %7 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
  %8 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
  %9 = getelementptr %struct.anon.1, %struct.anon.1* %8, i32 0, i32 0
  %10 = getelementptr %struct.anon.2, %struct.anon.2* %9, i32 0, i32 1
  %11 = getelementptr %union.anon, %union.anon* %10, i32 0, i32 0
  %12 = getelementptr %struct.anon.3, %struct.anon.3* %11, i32 0, i32 0
  %13 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
  ret void
}

declare dso_local i32 @printf(i8*, ...)

definition 	{} -> {}
assembly for function at address 0x10d674000
0x10d674000(+0):		stp	x28, x27, [sp, #-48]!
0x10d674004(+4):		stp	x20, x19, [sp, #16]
0x10d674008(+8):		stp	x29, x30, [sp, #32]
0x10d67400c(+12):		sub	sp, sp, #848
0x10d674010(+16):		adrp	x0, #0
0x10d674014(+20):		ldr	x0, [x0, #120]
0x10d674018(+24):		adrp	x19, #0
0x10d67401c(+28):		ldr	x19, [x19, #112]
0x10d674020(+32):		add	x20, sp, #8
0x10d674024(+36):		add	x1, sp, #8
0x10d674028(+40):		blr	x19
0x10d67402c(+44):		add	x1, x20, #568
0x10d674030(+48):		adrp	x0, #0
0x10d674034(+52):		ldr	x0, [x0, #104]
0x10d674038(+56):		blr	x19
0x10d67403c(+60):		orr	x1, x20, #0x4
0x10d674040(+64):		adrp	x0, #0
0x10d674044(+68):		ldr	x0, [x0, #96]
0x10d674048(+72):		blr	x19
0x10d67404c(+76):		add	sp, sp, #848
0x10d674050(+80):		ldp	x29, x30, [sp, #32]
0x10d674054(+84):		ldp	x20, x19, [sp, #16]
0x10d674058(+88):		ldp	x28, x27, [sp], #48
0x10d67405c(+92):		ret

@shoe42
Copy link
Contributor

shoe42 commented Aug 19, 2022

So running the test from #597 (comment) gives:

Config:               0x0x0000000000000000
xBCLeftInflowProfile: 0x0x0000000000000000
xBCLeftHeat:          0x0x0000000000000000

Which is strange. Even stranger, re-adding the two memcpy calls to copy the strings before printing addresses gives:

Config:               0x0x0000000000007ffc
xBCLeftInflowProfile: 0x0x0000000000007ffc
xBCLeftHeat:          0x0x0000000000007ffc

@shoe42
Copy link
Contributor

shoe42 commented Aug 19, 2022

Removing the call to terralib.saveobj and disas seems to flip the above behaviour with 0x0000000000000000/0x0000000000007ffc. Not a clue what's going on.

@shoe42
Copy link
Contributor

shoe42 commented Aug 19, 2022

Could this somehow be an issue with vararg behaviour on aarch64 macOS being different to aarch64 Linux/general ARMv8 or aarch64 ABI? https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms

@elliottslaughter
Copy link
Member Author

Hm, maybe. I'd expect LLVM to handle most differences in the C ABI, at least when calling functions with scalar arguments (including, I believe, variadic functions).

But let's test that! We can run Clang to see what code it generates:

#include "stdbool.h"
#include "stdint.h"
#include "stdlib.h"
#include "stdio.h"
#include "string.h"

typedef int TempProfile;
#define TempProfile_File 0

typedef int InflowProfile;
#define InflowProfile_File 0
#define InflowProfile_SuctionAndBlowing 1

struct Config {
  struct  {
    struct  {
      int32_t type;
      union  {
        struct  {
          int8_t FileDir[256];
        } File;
      } u;
    } xBCLeftHeat;
    struct  {
      int32_t type;
      union  {
        struct  {
          double addedVelocity;
          int8_t FileDir[256];
        } File;
        struct  {
          double sigma;
          struct  {
            uint32_t length;
            double values[10];
          } beta;
          double Zw;
          struct  {
            uint32_t length;
            double values[10];
          } A;
          struct  {
            uint32_t length;
            double values[10];
          } omega;
        } SuctionAndBlowing;
      } u;
    } yBCLeftInflowProfile;
    struct  {
      int32_t type;
      union  {
        struct  {
          double addedVelocity;
          int8_t FileDir[256];
        } File;
        struct  {
          double sigma;
          struct  {
            uint32_t length;
            double values[10];
          } beta;
          double Zw;
          struct  {
            uint32_t length;
            double values[10];
          } A;
          struct  {
            uint32_t length;
            double values[10];
          } omega;
        } SuctionAndBlowing;
      } u;
    } xBCLeftInflowProfile;
  } BC;
};

int main() {
  struct Config config;
  printf("Config:               %p\n", &(config));
  printf("xBCLeftInflowProfile: %p\n", &(config.BC.xBCLeftInflowProfile.u.File.FileDir));
  printf("xBCLeftHeat:          %p\n", &(config.BC.xBCLeftHeat.u.File.FileDir));
  return 0;
}

Then run:

clang bug372d.c -o bug372.ll -S -emit-llvm

Should give you the unoptimized LLVM IR, which we can compare to what Terra produces.

You can also run the program to sanity check it's behavior:

(On x86:)

$ clang bug372d.c -o bug372
$ ./bug372 
Config:               0x7ff7babbb500
xBCLeftInflowProfile: 0x7ff7babbb738
xBCLeftHeat:          0x7ff7babbb504

For best results, it's important to use the same Clang that you used to build Terra.

Note in the C program I also used %p instead of %0.16p. It turns out the latter is undefined behavior for pointers. I guess we should try that out in Terra too in case it makes any difference.

We could potentially add back more of bug372d.t, and write the corresponding pieces of bug372d.c. By comparing the generated code, we should eventually be able to spot the problem. Not all differences are an issue (Terra intentionally does some of its code generation differently), but the problem will (likely) be contained in the differences in the generated code.

Another possible explanation is that we're doing something bad with the JIT, but if so I don't see the mechanism for how that would be going wrong so far.

@shoe42
Copy link
Contributor

shoe42 commented Aug 20, 2022

  • LLVM IR for bug372d.c:
; ModuleID = 'bug372d.c'
source_filename = "bug372d.c"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-macosx12.0.0"

%struct.Config = type { %struct.anon }
%struct.anon = type { %struct.anon.0, %struct.anon.2, %struct.anon.9 }
%struct.anon.0 = type { i32, %union.anon }
%union.anon = type { %struct.anon.1 }
%struct.anon.1 = type { [256 x i8] }
%struct.anon.2 = type { i32, %union.anon.3 }
%union.anon.3 = type { %struct.anon.5 }
%struct.anon.5 = type { double, %struct.anon.6, double, %struct.anon.7, %struct.anon.8 }
%struct.anon.6 = type { i32, [10 x double] }
%struct.anon.7 = type { i32, [10 x double] }
%struct.anon.8 = type { i32, [10 x double] }
%struct.anon.9 = type { i32, %union.anon.10 }
%union.anon.10 = type { %struct.anon.12 }
%struct.anon.12 = type { double, %struct.anon.13, double, %struct.anon.14, %struct.anon.15 }
%struct.anon.13 = type { i32, [10 x double] }
%struct.anon.14 = type { i32, [10 x double] }
%struct.anon.15 = type { i32, [10 x double] }
%struct.anon.11 = type { double, [256 x i8] }

@.str = private unnamed_addr constant [26 x i8] c"Config:               %p\0A\00", align 1
@.str.1 = private unnamed_addr constant [26 x i8] c"xBCLeftInflowProfile: %p\0A\00", align 1
@.str.2 = private unnamed_addr constant [26 x i8] c"xBCLeftHeat:          %p\0A\00", align 1

; Function Attrs: noinline nounwind optnone ssp uwtable
define i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca %struct.Config, align 8
  store i32 0, i32* %1, align 4
  %3 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str, i64 0, i64 0), %struct.Config* %2)
  %4 = getelementptr inbounds %struct.Config, %struct.Config* %2, i32 0, i32 0
  %5 = getelementptr inbounds %struct.anon, %struct.anon* %4, i32 0, i32 2
  %6 = getelementptr inbounds %struct.anon.9, %struct.anon.9* %5, i32 0, i32 1
  %7 = bitcast %union.anon.10* %6 to %struct.anon.11*
  %8 = getelementptr inbounds %struct.anon.11, %struct.anon.11* %7, i32 0, i32 1
  %9 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str.1, i64 0, i64 0), [256 x i8]* %8)
  %10 = getelementptr inbounds %struct.Config, %struct.Config* %2, i32 0, i32 0
  %11 = getelementptr inbounds %struct.anon, %struct.anon* %10, i32 0, i32 0
  %12 = getelementptr inbounds %struct.anon.0, %struct.anon.0* %11, i32 0, i32 1
  %13 = bitcast %union.anon* %12 to %struct.anon.1*
  %14 = getelementptr inbounds %struct.anon.1, %struct.anon.1* %13, i32 0, i32 0
  %15 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str.2, i64 0, i64 0), [256 x i8]* %14)
  ret i32 0
}

declare i32 @printf(i8*, ...) #1

attributes #0 = { noinline nounwind optnone ssp uwtable "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-m1" "target-features"="+aes,+crc,+crypto,+dotprod,+fp-armv8,+fp16fml,+fullfp16,+lse,+neon,+ras,+rcpc,+rdm,+sha2,+sha3,+sm4,+v8.5a,+zcm,+zcz" }
attributes #1 = { "frame-pointer"="non-leaf" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-m1" "target-features"="+aes,+crc,+crypto,+dotprod,+fp-armv8,+fp16fml,+fullfp16,+lse,+neon,+ras,+rcpc,+rdm,+sha2,+sha3,+sm4,+v8.5a,+zcm,+zcz" }

!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6, !7, !8}
!llvm.ident = !{!9}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 12, i32 3]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 1, !"branch-target-enforcement", i32 0}
!3 = !{i32 1, !"sign-return-address", i32 0}
!4 = !{i32 1, !"sign-return-address-all", i32 0}
!5 = !{i32 1, !"sign-return-address-with-bkey", i32 0}
!6 = !{i32 7, !"PIC Level", i32 2}
!7 = !{i32 7, !"uwtable", i32 1}
!8 = !{i32 7, !"frame-pointer", i32 1}
!9 = !{!"Apple clang version 13.1.6 (clang-1316.0.21.2.5)"}
  • Sanity check of bug372d.c:
$ clang bug372d.c -o bug372 && ./bug372
Config:               0x16daf2bc0
xBCLeftInflowProfile: 0x16daf2df8
xBCLeftHeat:          0x16daf2bc4
  • Using %p in bug372d.t:
Expect this ouput:
Config:               0xfff9000000000000
xBCLeftInflowProfile: 0xfff9000000000000
xBCLeftHeat:          0xfff9000000000000
zsh: segmentation fault  ../build/bin/terra bug372d.t

@shoe42
Copy link
Contributor

shoe42 commented Aug 20, 2022

So, been doing a bit of digging, and this does seem to be a calling convention issue.

Here's the disassembled _main of the C version, bug372d.c:

0000000100003ec4 <_main>:
100003ec4: f4 4f be a9 	stp	x20, x19, [sp, #-32]!
100003ec8: fd 7b 01 a9 	stp	x29, x30, [sp, #16]
100003ecc: fd 43 00 91 	add	x29, sp, #16
100003ed0: ff 83 0d d1 	sub	sp, sp, #864
100003ed4: 1f 20 03 d5 	nop
100003ed8: 88 09 00 58 	ldr	x8, 0x100004008 <_printf+0x100004008>
100003edc: 08 01 40 f9 	ldr	x8, [x8]
100003ee0: a8 83 1e f8 	stur	x8, [x29, #-24]
100003ee4: f3 43 00 91 	add	x19, sp, #16
100003ee8: f3 03 00 f9 	str	x19, [sp]
100003eec: e0 03 00 10 	adr	x0, #124
100003ef0: 1f 20 03 d5 	nop
100003ef4: 1a 00 00 94 	bl	0x100003f5c <_printf+0x100003f5c>
100003ef8: 68 e2 08 91 	add	x8, x19, #568
100003efc: e8 03 00 f9 	str	x8, [sp]
100003f00: 00 04 00 50 	adr	x0, #130
100003f04: 1f 20 03 d5 	nop
100003f08: 15 00 00 94 	bl	0x100003f5c <_printf+0x100003f5c>
100003f0c: 68 02 7e b2 	orr	x8, x19, #0x4
100003f10: e8 03 00 f9 	str	x8, [sp]
100003f14: 40 04 00 10 	adr	x0, #136
100003f18: 1f 20 03 d5 	nop
100003f1c: 10 00 00 94 	bl	0x100003f5c <_printf+0x100003f5c>
100003f20: a8 83 5e f8 	ldur	x8, [x29, #-24]
100003f24: 1f 20 03 d5 	nop
100003f28: 09 07 00 58 	ldr	x9, 0x100004008 <_printf+0x100004008>
100003f2c: 29 01 40 f9 	ldr	x9, [x9]
100003f30: 3f 01 08 eb 	cmp	x9, x8
100003f34: c1 00 00 54 	b.ne	0x100003f4c <_main+0x88>
100003f38: 00 00 80 52 	mov	w0, #0
100003f3c: ff 83 0d 91 	add	sp, sp, #864
100003f40: fd 7b 41 a9 	ldp	x29, x30, [sp, #16]
100003f44: f4 4f c2 a8 	ldp	x20, x19, [sp], #32
100003f48: c0 03 5f d6 	ret
100003f4c: 01 00 00 94 	bl	0x100003f50 <_printf+0x100003f50>

Note how it stores the first argument to printf in x0, and the second argument (the vararg one) on the stack (str x8, [sp]). I was under the impression x0-x7 were used for variable arguments where possible, and overflow would be placed onto the stack, but I could be wrong.

Anyway, disassembly of the Terra-generated _main (not _$main), built from the LLVM IR output from #597 (comment), is as follows:

0000000100003e4c <_main>:
100003e4c: f4 4f be a9 	stp	x20, x19, [sp, #-32]!
100003e50: fd 7b 01 a9 	stp	x29, x30, [sp, #16]
100003e54: ff 43 0d d1 	sub	sp, sp, #848
100003e58: 00 05 00 10 	adr	x0, #160
100003e5c: 1f 20 03 d5 	nop
100003e60: e1 23 00 91 	add	x1, sp, #8
100003e64: f3 23 00 91 	add	x19, sp, #8
100003e68: 21 00 00 94 	bl	0x100003eec <_printf+0x100003eec>
100003e6c: 20 05 00 50 	adr	x0, #166
100003e70: 61 e2 08 91 	add	x1, x19, #568
100003e74: 1f 20 03 d5 	nop
100003e78: 1d 00 00 94 	bl	0x100003eec <_printf+0x100003eec>
100003e7c: 80 05 00 10 	adr	x0, #176
100003e80: 61 02 7e b2 	orr	x1, x19, #0x4
100003e84: 1f 20 03 d5 	nop
100003e88: 19 00 00 94 	bl	0x100003eec <_printf+0x100003eec>
100003e8c: ff 43 0d 91 	add	sp, sp, #848
100003e90: fd 7b 41 a9 	ldp	x29, x30, [sp, #16]
100003e94: f4 4f c2 a8 	ldp	x20, x19, [sp], #32
100003e98: c0 03 5f d6 	ret

Note how it's using both x0 and x1 for passing arguments to printf. It's even calculating the correct addresses: the 2nd and 3rd addresses should be 568 bytes and 4 bytes greater than &config, respectively. However, there's no stack.

Running the first _main gives the following output:

Config:               0x16f0de690
xBCLeftInflowProfile: 0x16f0de8c8
xBCLeftHeat:          0x16f0de694

And the Terra-generated version gives:

Config:               0x100b58310
xBCLeftInflowProfile: 0x100b58310
xBCLeftHeat:          0x100b58310

This looked strange initially as multiple runs of the C version always seemed to have the stack somewhere in the 0x16xxxxxxx range. So, running the Terra-generated _main through lldb shows that x1 is having the correct address calculated:

...
(lldb) 
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x0000000100003e68 bug372_terra`main + 28
bug372_terra`main:
->  0x100003e68 <+28>: bl     0x100003eec               ; symbol stub for: printf
    0x100003e6c <+32>: adr    x0, #0xa6                 ; "xBCLeftInflowProfile: %p\n"
    0x100003e70 <+36>: add    x1, x19, #0x238
    0x100003e74 <+40>: nop    
Target 0: (bug372_terra) stopped.
(lldb) reg read x0 x1
      x0 = 0x0000000100003ef8  "Config:               %p\n"
      x1 = 0x000000016fdfe648
...
(lldb) 
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x0000000100003e78 bug372_terra`main + 44
bug372_terra`main:
->  0x100003e78 <+44>: bl     0x100003eec               ; symbol stub for: printf
    0x100003e7c <+48>: adr    x0, #0xb0                 ; "xBCLeftHeat:          %p\n"
    0x100003e80 <+52>: orr    x1, x19, #0x4
    0x100003e84 <+56>: nop    
Target 0: (bug372_terra) stopped.
(lldb) reg r x0 x1
      x0 = 0x0000000100003f12  "xBCLeftInflowProfile: %p\n"
      x1 = 0x000000016fdfe880
...
(lldb) 
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x0000000100003e88 bug372_terra`main + 60
bug372_terra`main:
->  0x100003e88 <+60>: bl     0x100003eec               ; symbol stub for: printf
    0x100003e8c <+64>: add    sp, sp, #0x350
    0x100003e90 <+68>: ldp    x29, x30, [sp, #0x10]
    0x100003e94 <+72>: ldp    x20, x19, [sp], #0x20
Target 0: (bug372_terra) stopped.
(lldb) reg r x0 x1
      x0 = 0x0000000100003f2c  "xBCLeftHeat:          %p\n"
      x1 = 0x000000016fdfe64c
(lldb) x/2x $sp
0x16fdfe640: 0x00084310 0x00000001

So, Terra-generated IR is, for some reason, placing the printf vararg into x1, while the C-generated IR is placing it on the stack, which is where printf is reading it from - note the last command run above, showing 0x100084310 on the stack.

I can see in the LLVM IR that Terra for some reason is defining a different set of arguments for printf:

  • Terra call:
%7 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
  • C call:
%9 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str.1, i64 0, i64 0), [256 x i8]* %8)

I'm not very familiar with AAPCS64, however the docs at https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#id82 seem to imply that variadic arguments can first be passed in registers, as Terra is doing. However, Apple's documentation at https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms seems to differ in that it assigns variadic arguments to 8-byte stack slots instead of registers. This could explain the issues being seen here.

@shoe42
Copy link
Contributor

shoe42 commented Aug 20, 2022

Looks like this issue is with Terra's generation of the printf calls. Replacing the calls to i32 (i8*, %struct.Config*, ...)/i32 (i8*, [256 x i8]*, ...) with calls to i32 (i8*, ...) allows LLVM to properly pass the varargs:

; ModuleID = 'terra'
source_filename = "terra"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-darwin21.6.0"

%struct.Config = type { %struct.anon.1 }
%struct.anon.1 = type { %struct.anon.2, %struct.anon.4, %struct.anon.4 }
%struct.anon.2 = type { i32, %union.anon }
%union.anon = type { %struct.anon.3 }
%struct.anon.3 = type { [256 x i8] }
%struct.anon.4 = type { i32, %union.anon.5 }
%union.anon.5 = type { %struct.anon.7 }
%struct.anon.7 = type { double, %struct.anon.8, double, %struct.anon.8, %struct.anon.8 }
%struct.anon.8 = type { i32, [10 x double] }
%struct.anon.6 = type { double, [256 x i8] }

@"$string" = private unnamed_addr constant [26 x i8] c"Config:               %p\0A\00", align 1
@"$string.1" = private unnamed_addr constant [26 x i8] c"xBCLeftInflowProfile: %p\0A\00", align 1
@"$string.2" = private unnamed_addr constant [26 x i8] c"xBCLeftHeat:          %p\0A\00", align 1

define dso_local void @main() {
entry:
  %config = alloca %struct.Config, align 8
  ;%0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)
  %0 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)

  %1 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
  %2 = getelementptr %struct.anon.1, %struct.anon.1* %1, i32 0, i32 2
  %3 = getelementptr %struct.anon.4, %struct.anon.4* %2, i32 0, i32 1
  %4 = getelementptr %union.anon.5, %union.anon.5* %3, i32 0, i32 0
  %5 = bitcast %struct.anon.7* %4 to %struct.anon.6*
  %6 = getelementptr %struct.anon.6, %struct.anon.6* %5, i32 0, i32 1
  ;%7 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
  %7 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
  
  %8 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
  %9 = getelementptr %struct.anon.1, %struct.anon.1* %8, i32 0, i32 0
  %10 = getelementptr %struct.anon.2, %struct.anon.2* %9, i32 0, i32 1
  %11 = getelementptr %union.anon, %union.anon* %10, i32 0, i32 0
  %12 = getelementptr %struct.anon.3, %struct.anon.3* %11, i32 0, i32 0
  ;%13 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
  %13 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
  ret void
}

declare dso_local i32 @printf(i8*, ...)

; definition      {} -> {}
define dso_local void @"$main"() {
entry:
  %config = alloca %struct.Config, align 8
  %0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i64 0, i64 0), %struct.Config* nonnull %config)
  %1 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 2, i32 1, i32 0, i32 1
  %2 = bitcast %struct.anon.8* %1 to [256 x i8]*
  %3 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i64 0, i64 0), [256 x i8]* nonnull %2)
  %4 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 0, i32 1, i32 0, i32 0
  %5 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i64 0, i64 0), [256 x i8]* nonnull %4)
  ret void
}

Result:

Config:               0x16f8c6678
xBCLeftInflowProfile: 0x16f8c68b0
xBCLeftHeat:          0x16f8c667c

Not sure why it's treating printf as a i32 (i8*, %struct.Config*, ...)/i32 (i8*, [256 x i8]*, ...) function, however. Is this due to Terra's glue generating incorrect types for functions with varargs somehow?

I'd guess that the Linux test failures, which this issue was initially opened for, are unrelated to this, so maybe this belongs better in a separate issue?

@elliottslaughter
Copy link
Member Author

Wow, thanks for doing all this digging.

Yes, I think this is a separate issue: why is Terra thinking that printf is a i32 (i8*, T*, ...) (where T is the second argument) instead of i32 (i8*, ...). This could be in Terra's code generation, or it could be somewhere in the glue Terra uses to extract these definitions from Clang.

@elliottslaughter
Copy link
Member Author

Just one more note: the place to look for issues is likely going to be EmitCall, or the Classify call that it immediately does. This is a bit tricky because it's not just a matter of taking the function as-is and calling it. (E.g., structs have to be decomposed into scalar/vector/array arguments in a way that's architecture-specific.) So that's probably why there's an opportunity for Terra to mess this up.

terra/src/tcompiler.cpp

Lines 1467 to 1470 in 0cf6be6

Value *EmitCall(IRBuilder<> *B, Obj *ftype, Obj *paramtypes, Value *callee,
std::vector<Value *> *actuals) {
Classification info;
Classify(ftype, paramtypes, &info);

Hopefully it's just a matter of teaching Terra not to add arguments beyond those in the original function to the formal type it returns, but this may potentially go deeper into the compiler plumbing.

@shoe42
Copy link
Contributor

shoe42 commented Aug 21, 2022

I think trying to track this down is beyond me at the moment, I'm not familiar with Terra's (or LLVM's) compilation process, but might be able to take a look at somepoint in the future. Probably worth opening a separate issue for it.

@elliottslaughter
Copy link
Member Author

@shoe42 I have a potential fix for the varargs issue at #602. Can you please check if this works for you, and if so what do the test suite results look like after this?

@elliottslaughter
Copy link
Member Author

Coming back to the issue of Linux AArch64, I see that the following tests are still failing:

98% tests passed, 9 tests failed out of 549
Total Test time (real) =  32.29 sec
The following tests FAILED:
	121 - defercond.t (SEGFAULT)
	209 - let2.t (SEGFAULT)
	343 - simpleglobal.t (Failed)
	396 - teststd.t (SEGFAULT)
	412 - vars.t (SEGFAULT)
	413 - vars2.t (SEGFAULT)
	415 - vecarith.t (SEGFAULT)
	424 - zeroreturn.t (SEGFAULT)
	425 - zeroreturn2.t (SEGFAULT)

From #625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants