-
Notifications
You must be signed in to change notification settings - Fork 2
Notes 2016 06 17
Kathryn Mohror edited this page Jun 17, 2016
·
2 revisions
- Jean-Baptiste Besnard
- John DelSignore
- Marc-Andre Hermanns
- Kathryn Mohror
- Bob Moench
- Anh Vo
- In the Forum face to face, discussed how the functionality currently implemented with the being_debugged variable will be supported in MPIR-2
- Since we changed the definition of being_debugged such that it can be defined in all MPI processes and not just the starter process, do we want to follow that model in MPIR-2?
- The consensus during the forum discussion was no, we don't want to have a callback for every process each time attach/detach occurs
- Could we keep this symbol based in MPI processes, and use an API-based mechanism in the starter process for MPIR-2?
- What is the difference between debug_gate variable and being_debugged variable? Are they just variations of the same thing?
- No, debug_gate is a synchronization mechanism so that the MPI processes don't run away (and potentially terminate) before the debugger has a chance to attach to them
- There are other implementations of this besides debug_gate. Some use a barrier, some use exit from execv
- Are we dictating implementation by defining these variables?
- Yes
- Should we change the way we are managing such that implementations can choose for themselves what makes sense? Variable vs barrier vs ??
- Yes
- We should make this part of the dll, and leave details out of the MPIR spec
- Debugger calls into dll and indicates which processes it wants to debug, and it happens somehow
- Does this mean we would need per process initialization and per process dll?
- No, whatever the mechanism is is up to the implementation
- Need an entry point into the dll that says: these are the processes I want to debug
- At Forum, we decided that if the MPI implementation makes the MPIR-1 symbols available, then it needs to adhere to the MPIR-1 spec
- If an MPI only wants to support MPIR-2, then the symbols are not available
- Currently is a table of all MPI processes indexed by their rank in COMM_WORLD
- In MPIR-2 we want to change this to support things we see in existence now, or expect to come in the future
- MPI implementations where MPI processes are implemented as threads (e.g. MPC)
- MPI endpoints
- MPI sessions
- dynamic processes
- ?
- The index being the rank in WORLD needs to go
- What if WORLD is not defined?
- What if different WORLDS combine?
- What if processes come and go?
- From the debugger point of view, all it needs is a list of OS processes because that's what it attaches to
- However, users think of MPI processes in terms of ranks, what if they want to attach to one or a subset?
- Many debugging cases involve a problem always occurring in rank X
- Does the starter process know where the ranks are?
- Yes, probably, but ranks only make sense with respect to a communicator.
- Can we always assume that WORLD will exist as we know it today in MPI? Probably not...
- There is a useful historical assumption about rank mapping and place in table
- Going to need another interface to get that information
- We want a more general model that can describe where a process lives in some context
- is context a communicator? a session? an instant of time?
- WORLD could have multiple MPI processes per OS PID
- How can a user make sense of what they are debugging?
- What about a generic labeling mechanism?
- MPI gives a list of OS processes with some label, e.g. "Ranks 0-7", e.g. "progress thread"
- How does a user figure out what the labels mean?
- Well, hopefully they would be intuitive
- Does the labeling scheme limit scalability?
- E.g. tools that group processes based on behavior, like STAT
- If the ranks aren't ints, then how can an aggregation tool scalably represent groups of processes?
- Thread based MPIs
- Same PID could appear multiple times in the table
- Want a mechanism that says "I have this thread. Is it an MPI task or not?"
- What about a mechanism that tells the mapping?
- Either: this is a one-to-one mapping (like MPIR-1) or it is not
- Would that work?
- This is hard since we are trying to design for moving targets in many cases
- What will happen with endpoints, sessions, dynamic processes?
- We need more input from implementors
- Let's reach out to some folks and see if they are interested in participating
- Don't want to design something that is incompatible with a particular implementation