Notes 2019 10 31

Jump to bottom Edit New page

Marc-Andre Hermanns edited this page Nov 4, 2019 · 4 revisions

Participants

Marc-Andre Hermanns
Bengisu Elis
Chris Chambreau
Joachim Protze
Josh Cottingham
Martin Schulz

Topics

QMPI

In OMPT the address given to the tool is the address where to jump back to
- Usually is still in a register within the first call outside of application
  - Cheap to get
- For a wrapper this is trivial to get
  - for GCC: __GET_RETURN_ADDRESS (see GCC Docs)
  - for Visual C/C++: _ReturnAddress (see MS Docs)
- From a PMPI to PnMPI
  - Access to this address has recently been added to PnMPI
  - On x86 architectures you can
    1. substract 1 from this address
    2. feed this to addr2line
    3. get source line info from where the function was called
  - cheapest way to obtain such information
    - no stack tracing needed
    - you can store just the address and resolve on demand (or only once)
  - Helps with address-space randomisation
  - Can the address help with stack walking
    - Not directly, as it is a pointer to the next instruction
      - no stack information
    - Frame address of the first frame in the runtime would be interesting for this
      - Pointer to the stack frame where the application entered MPI
      - Should also be easy to obtain right for the MPI implementation
- Should we expose this information through a semi-opaque type like MPI_Status?
  - Quick access to known parameters
  - Allow implementation to provide internal information as well
  - Just a single additional argument to the QMPI callbacks (future proof)
Thread-safety to register and de-register tools at runtime
- Dynamic registration/deregistration at runtime may become problematic
- Global registry would be needed
  - needs to be locked every time to look into the table
  - runtime overhead not worth the additional
- In OMPT a tool should just return (in the callback) instead of trying to deregister
- for QMPI query table at the begining and then data can live in a local variable
- atomics would not really help
  - memory fences prevent hashing and slow down performance
  - all threads look at same data structure (no copies possible)
    - access across NUMA boundaries (incurs performance hit)
  - Maybe less of a Problem for MPI as calls are less frequent?
    - This would be good to verify on a broader set of platforms
- What do we want to optimize for?
  - OMPT -> optimize for no tool in the chain
    - Some additional penalty for adding a tool
    - Do static branch prediction (assume code-path without tool to be likely)
    - Would probably for favoured by implementations
Static linking/loading
- Always both present or not?
- What about extensions rather than tools?
- Can you make the same static library active at the same time?
  - Tool needs to handle this
- Dynamic tool may need
  - Dynamic library linked at link time (loaded)
  - Dynamic library opened vi dlopen