-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libhio with internal json-c build fails on power9 nodes #49
Comments
@floquet you don't see this problem because if you're building with spack, the libhio recipe uses an external json-c. |
Easy enough to fix. Untar it. Run autoreconf -ivf and tar it. (the json-c tarball that is) |
@hppritcha: Darwin Power9: I look for Spack created these modules when it built |
I was able to run autoreconf as @hjelmn suggests. However, when testing the resulting build I get ucx errors. It looks like this is common with power9s. I need to spend more time with it to solve the ucx error. |
To remove the ucx error, I built ucx and then built openmpi to specifically include ucx. I started with the instructions I found here: https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX However, my newly built openmpi encountered a error with MPI_Win_allocate_shared. I found that hio already has an alternative code path to work around a "HIO_CRAY_BUFFER_BUG_1" that does not use MPI_Win_allocate_shared. The alternative path is used on the Cray systems already to handle an issue encountered with using HIO with mpich. After switching to the alternative code path and recompiling libhio, the test cases pass on the darwin power9. I am not sure if there is a performance reason for using MPI_Win_allocate_shared versus the alternative path. |
Hmm, try running with --mca osc rdma --mca btl_uct_memory_domains mlx5_0 |
The reason for the alternate path is Cray's insistence in using XPMEM underneath MPI_Win_allocate_shared. XPMEM memory regions can't be used for mutexes, condition variables, etc. |
So I am not sure to which version of HIO your suggestion of adding " --mca osc rdma --mca btl_uct_memory_domains mlx5_0" was related to. I tried it for a hio built against the gcc/7.3.0 and openmpi/2.1.5-gcc_7.3.0 modules on darwin. That version segfaults with and without these --mca options. For my hio using a openmpi I personally built with ucx I tried with and without the MPI_Win_allocate_shared call. With the MPI_Win_allocate_shared, I get the same error I saw previously indicating a failure during the allocate_shared call. Without MPI_Win_allocate_shared, I get the following warning: "Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue." The test seems to run correctly. If I add the suggested "--mca opal_common_ucx_opal_mem_hooks 1" option, I get this warning "Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption." The test seems to run correctly in this case as well. I have played with --mca options to mpirun on my own while working on this issue, never successfully. Can you explain why you suggested these particular options and what you hoped they would accomplish? It is interesting that you added the allocate_shared work around because XPMEM didn't work for mutexes, but I believe I am seeing an error during the call to MPI_Win_allocate_shared itself, not when it is being used for a mutex or the like. |
The json tarball included in libhio is too old to recognize the system type - ppc64le, so the build fails when it tries to build the json lib.
The json tarball needs to be updated to one of the 0.13.1 releases. These releases work on darwin power9 nodes, for example.
The tarball needs to be patched to support the doc gen removal and function renaming. See the json-c.patch file. Note the current patch fails to patch cleanly on either json-c master or the 0.13.1 tags. This will have to be manually redone.
The text was updated successfully, but these errors were encountered: