Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libhio with internal json-c build fails on power9 nodes #49

Open
hppritcha opened this issue Apr 26, 2019 · 8 comments
Open

libhio with internal json-c build fails on power9 nodes #49

hppritcha opened this issue Apr 26, 2019 · 8 comments
Assignees
Milestone

Comments

@hppritcha
Copy link
Member

The json tarball included in libhio is too old to recognize the system type - ppc64le, so the build fails when it tries to build the json lib.

The json tarball needs to be updated to one of the 0.13.1 releases. These releases work on darwin power9 nodes, for example.

The tarball needs to be patched to support the doc gen removal and function renaming. See the json-c.patch file. Note the current patch fails to patch cleanly on either json-c master or the 0.13.1 tags. This will have to be manually redone.

@hppritcha
Copy link
Member Author

@floquet you don't see this problem because if you're building with spack, the libhio recipe uses an external json-c.

@hjelmn
Copy link
Collaborator

hjelmn commented Apr 26, 2019

Easy enough to fix. Untar it. Run autoreconf -ivf and tar it. (the json-c tarball that is)

@hppritcha hppritcha added this to the v1.4.1.4 milestone Apr 26, 2019
@floquet
Copy link
Collaborator

floquet commented Apr 26, 2019

@hppritcha: Darwin Power9: I look for /usr/lib64/json-c. If not found, build latest.

Spack created these modules when it built json-c:
/scratch/users/dantopa/new-spack/libraries/darwin-power9.libhio/share/spack/modules/linux-rhel7-ppc64le/json-c/0.13.1-gcc-4.8.5, json-c/0.13.1-gcc-6.4.0

@plamborn
Copy link
Contributor

plamborn commented May 3, 2019

I was able to run autoreconf as @hjelmn suggests. However, when testing the resulting build I get ucx errors. It looks like this is common with power9s. I need to spend more time with it to solve the ucx error.

@plamborn
Copy link
Contributor

To remove the ucx error, I built ucx and then built openmpi to specifically include ucx. I started with the instructions I found here: https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX

However, my newly built openmpi encountered a error with MPI_Win_allocate_shared. I found that hio already has an alternative code path to work around a "HIO_CRAY_BUFFER_BUG_1" that does not use MPI_Win_allocate_shared. The alternative path is used on the Cray systems already to handle an issue encountered with using HIO with mpich.

After switching to the alternative code path and recompiling libhio, the test cases pass on the darwin power9. I am not sure if there is a performance reason for using MPI_Win_allocate_shared versus the alternative path.

@hjelmn
Copy link
Collaborator

hjelmn commented May 21, 2019

Hmm, try running with --mca osc rdma --mca btl_uct_memory_domains mlx5_0

@hjelmn
Copy link
Collaborator

hjelmn commented May 21, 2019

The reason for the alternate path is Cray's insistence in using XPMEM underneath MPI_Win_allocate_shared. XPMEM memory regions can't be used for mutexes, condition variables, etc.

@plamborn
Copy link
Contributor

So I am not sure to which version of HIO your suggestion of adding " --mca osc rdma --mca btl_uct_memory_domains mlx5_0" was related to.

I tried it for a hio built against the gcc/7.3.0 and openmpi/2.1.5-gcc_7.3.0 modules on darwin. That version segfaults with and without these --mca options.

For my hio using a openmpi I personally built with ucx I tried with and without the MPI_Win_allocate_shared call.

With the MPI_Win_allocate_shared, I get the same error I saw previously indicating a failure during the allocate_shared call.

Without MPI_Win_allocate_shared, I get the following warning: "Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue." The test seems to run correctly.

If I add the suggested "--mca opal_common_ucx_opal_mem_hooks 1" option, I get this warning "Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption." The test seems to run correctly in this case as well.

I have played with --mca options to mpirun on my own while working on this issue, never successfully. Can you explain why you suggested these particular options and what you hoped they would accomplish?

It is interesting that you added the allocate_shared work around because XPMEM didn't work for mutexes, but I believe I am seeing an error during the call to MPI_Win_allocate_shared itself, not when it is being used for a mutex or the like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants