-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS GPU Runner #107
Add AWS GPU Runner #107
Conversation
Got a few segfaults:
On the OpenCL stuff, so might need to look into what we might be missing, also saw this: |
Okay switching to cuda 11.7 made the warning go away, but I am still getting a segfault. @peastman do you have any ideas what could be going on? |
Forgot to run the CUDA tests!
Okay so just opencl failing, I wonder if we need to add something to the runner host? |
Hmmm
Looks like it might be an issue for where |
Maybe not since the output of the github runner has the same error:
I am guessing it has to do with the opencl driver on the amazon linux box... I will spin up an instance to try and debug it interactively |
So when I spun up an ec2 instance with the same ami, I didn't get any tests to fail, but I did use micromamba to set up the env, so I will switch the self hosted CI to micromamba and see if that fixes our problems. |
Switching to micromamba worked! |
push: | ||
branches: | ||
- master | ||
- feat/add_aws_gpu_runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- feat/add_aws_gpu_runner |
Remove before merge
@peastman how does this look? I think we just need to decide when it runs, maybe on marge to master and I will keep the part that lets you run it on-demand? |
NVCC_VERSION: ${{ env.nvcc-version }} | ||
PYTORCH_VERSION: ${{ env.pytorch-version }} | ||
|
||
- uses: mamba-org/provision-with-micromamba@main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to mciromamba
is not a solution. It just hides some issue with the dependencies or mamba
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well here is the diff between the envs:
2c2
< _openmp_mutex 4.5 2_gnu conda-forge
---
> _openmp_mutex 4.5 2_kmp_llvm conda-forge
33d32
< intel-openmp 2022.1.0 h9e868ea_3769
69a69
> llvm-openmp 16.0.2 h4dfa4b3_0 conda-forge
72c72
< mkl 2022.1.0 hc2b9512_224
---
> mkl 2022.2.1 h84fe81f_16997 conda-forge
92c92
< python 3.10.10 he550d4f_0_cpython conda-forge
---
> python 3.10.0 h543edf9_3_cpython conda-forge
103a104
> sqlite 3.40.0 h4ff8645_1 conda-forge
105a107
> tbb 2021.8.0 hf52228f_0 conda-forge
Where >
is the working one, so it looks like sqlite & tbb pacakges are added, and llvm-openmp instead of intel's implementation was chosen. I am not sure which pins need to be adjusted. Why only the openCL tests would fail with a different mutex flavor, mkl version, and openmp implementation and none of the others, I don't have any idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bet is on the ocl-icd package being somehow overriden by the system one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now do you want me to add pins to the environment yaml? Previously, you didn't want me to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemhenry I say if pinning some versions its what it takes so be it. We can later on relax these after merging. AFAIK I cannot play around with this at the moment.
I would start by trying if installing llvm-openmp is enough. Another option could be to use the OCL library that comes with CUDA.
Maybe skipping these two is enough for cmake to pick the CUDA ones?
-DOPENCL_INCLUDE_DIR=${CONDA_PREFIX}/include \
-DOPENCL_LIBRARY=${CONDA_PREFIX}/lib/libOpenCL${SHLIB_EXT}
I would really like this merged ASAP so we can also merge #106
Resolves #93 |
Going to see if I can figure this out interactively, for reference this is what mambaforge reports
|
And now micromamba:
|
Using a lock file (so the envs with mamba and micromamba are the same) gives me the same result, all the tests pass except these
when using mamba, everything works with micromamba. What is also really confusing is that the python OpenCL tests pass for mamba... |
This is what the segfault looks like when I use
|
Here is the env file, if you make the env with
At this point, I am inclined to just move forward and use |
This is the output of ldd on the different binaries ( |
What version of OpenCL are you compiling against, and which are you running against? |
7bb3469
to
552c4f0
Compare
Looks like whatever ships with conda I do get this warning:
|
|
What header files are you compiling against? |
Inspect libOpenCL.so.1 (with the file command) in both cases, maybe you will see in one instance you will see that it is just a symlink to the system one? |
Looks like they are not pointing to the system libopencl:
Interestingly, they are not the same file:
Will keep investigating |
Looks like they downloaded the same package...
|
And just for completeness, every
|
And just to double check, both tests point to a different libopencl
|
What OpenCL headers are you compiling against? |
Replacing the
Looking at ccmake,
So the ones in
|
See this in your previous answer:
Maybe you are linking against this one via torch/cuda.
Or this (micro):
So you have two different versions of the same library in the binary, right? Another guess is the ocl-icd package https://github.com/OCL-dev/ocl-icd
Perhaps this is working incorrectly, or the installation order matters. |
One thing that makes no sense is if I copy the
which is just from conda patching the path into the binary |
Maybe this path inside libopencl.so makes the binary load also other stuff from micromamba. |
Revisiting this, we have been using this PR to test a few other PRs and it has been helpful. I understand that switching to micromamba may be a bit of a hack, but I rather have something that works than nothing at all. |
I agree @mikemhenry, we ran out of things to try. We cannot say we understand what is making the OpenCL tests fail, but I think it is safe to say that it is caused by something in the environment. |
OK! Let's use Micromamba if there is no other option. |
Sounds good, sorry I couldn't quite figure it out! |
name: Do the job on the runner | ||
needs: start-runner # required to start the main job when the runner is ready | ||
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner | ||
timeout-minutes: 1200 # 20 hrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it costs money, I would lower this. In my machine the tests take ~1 minute. Maybe 120 mins should give enough room for compilation, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to take into account the python tests, which do take a while in the CPU. Although 2 hours should be enough, perhaps we can skip CPU tests in this runner:
$ pytest -v -s -k "not Reference and not CPU" Test*py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a great idea, I didn't think about that. I am trying to think if there is a case where it is useful to run the CPU tests on this runner... something else to play with is pytest using xdist since these boxes are MUCH more powerful than what GHA gives us.
So do you think there is any value in running the CPU tests on the GPU runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, on one hand they are already being ran on the normal CI, OTOH I can see bugs arising in the CPU version only on a GPU env or the other way . To give some examples this conda-forge/openmm-torch-feedstock#37 or the problems we could not eventually crack on this very PR.
The safe thing to do is probably just give it 3 hours and run every test.
Another option could be to run CPU tests in parallel, which pytest allows AFAIK. Something like:
$ pytest -n 4 -v -s -k "Reference or CPU" Test*py &
$ pytest -v -s -k "not Reference and not CPU" Test*py
$ wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With -n auto
the tests takes ~11 minutes to run so I think it is worth running the CPU tests since for troubleshooting one of the first things I would want to do is see if the CPU tests work.
timeout-minutes: 1200 # 20 hrs | ||
env: | ||
HOME: /home/ec2-user | ||
os: ubuntu-22.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above it says ubuntu-latest, but here it is ubuntu-22.04, is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is to make the way the env file is chosen devtools/conda-envs/build-${{ env.os }}.yml
match how we do the CI on GHA.
For the start-runner
block, we just need a VM to spin up and turn on the GPU runner at AWS, so I chose to use ubuntu-latest
since it doesn't really matter the version and I rather not have to worry about it, it should work fine when 24.04 comes out for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of minor comments, but LGTM. Thanks @mikemhenry !
"OK! Let's use Micromamba if there is no other option."
No description provided.