Add AWS GPU Runner #107

mikemhenry · 2023-04-20T09:54:02Z

No description provided.

mikemhenry · 2023-04-20T10:14:23Z

Got a few segfaults:

Warning: Version of installed CUDA didn't match package
Test project /actions-runner/_work/openmm-torch/openmm-torch/build
    Start 1: TestSerializeTorchForce
1/5 Test #1: TestSerializeTorchForce ..........   Passed    0.27 sec
    Start 2: TestReferenceTorchForce
2/5 Test #2: TestReferenceTorchForce ..........   Passed    0.61 sec
    Start 3: TestOpenCLTorchForceSingle
3/5 Test #3: TestOpenCLTorchForceSingle .......***Exception: SegFault  0.45 sec

    Start 4: TestOpenCLTorchForceMixed
4/5 Test #4: TestOpenCLTorchForceMixed ........***Exception: SegFault  0.45 sec

    Start 5: TestOpenCLTorchForceDouble
5/5 Test #5: TestOpenCLTorchForceDouble .......***Exception: SegFault  0.45 sec


40% tests passed, 3 tests failed out of 5

Total Test time (real) =   2.22 sec

The following tests FAILED:
	  3 - TestOpenCLTorchForceSingle (SEGFAULT)
	  4 - TestOpenCLTorchForceMixed (SEGFAULT)
	  5 - TestOpenCLTorchForceDouble (SEGFAULT)
Errors while running CTest

On the OpenCL stuff, so might need to look into what we might be missing, also saw this: Warning: Version of installed CUDA didn't match package so I am going to see if we can use CUDA 11.7 which is what the host has by default

mikemhenry · 2023-04-20T10:30:29Z

Okay switching to cuda 11.7 made the warning go away, but I am still getting a segfault. @peastman do you have any ideas what could be going on?

mikemhenry · 2023-04-25T17:45:18Z

Forgot to run the CUDA tests!

Test project /actions-runner/_work/openmm-torch/openmm-torch/build
    Start 1: TestSerializeTorchForce
1/8 Test #1: TestSerializeTorchForce ..........   Passed    0.32 sec
    Start 2: TestReferenceTorchForce
2/8 Test #2: TestReferenceTorchForce ..........   Passed    0.62 sec
    Start 3: TestOpenCLTorchForceSingle
3/8 Test #3: TestOpenCLTorchForceSingle .......***Exception: SegFault  0.50 sec

    Start 4: TestOpenCLTorchForceMixed
4/8 Test #4: TestOpenCLTorchForceMixed ........***Exception: SegFault  0.50 sec

    Start 5: TestOpenCLTorchForceDouble
5/8 Test #5: TestOpenCLTorchForceDouble .......***Exception: SegFault  0.42 sec

    Start 6: TestCudaTorchForceSingle
6/8 Test #6: TestCudaTorchForceSingle .........   Passed   12.85 sec
    Start 7: TestCudaTorchForceMixed
7/8 Test #7: TestCudaTorchForceMixed ..........   Passed    4.67 sec
    Start 8: TestCudaTorchForceDouble
8/8 Test #8: TestCudaTorchForceDouble .........   Passed    4.57 sec

Okay so just opencl failing, I wonder if we need to add something to the runner host?

mikemhenry · 2023-04-25T18:36:28Z

Hmmm

CMake Warning (dev) at 3/envs/build/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:438 (message):
  The package name passed to `find_package_handle_standard_args` (OPENCL)
  does not match the name of the calling package (OpenCL).  This can lead to
-- Found OPENCL: /actions-runner/_work/openmm-torch/openmm-torch/3/envs/build/lib/libOpenCL.so  
  problems in calling code that expects `find_package` result variables
  (e.g., `_FOUND`) to follow a certain pattern.
Call Stack (most recent call first):
  FindOpenCL.cmake:85 (find_package_handle_standard_args)
  CMakeLists.txt:124 (FIND_PACKAGE)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Configuring done (26.8s)
CMake Warning at platforms/opencl/CMakeLists.txt:59 (ADD_LIBRARY):
  Cannot generate a safe runtime search path for target OpenMMTorchOpenCL
  because files in some directories may conflict with libraries in implicit
  directories:

    runtime library [libOpenCL.so.1] in /actions-runner/_work/openmm-torch/openmm-torch/3/envs/build/lib may be hidden by files in:
      /usr/local/cuda/lib64

  Some of these libraries may not be found correctly.


CMake Warning at platforms/opencl/tests/CMakeLists.txt:13 (ADD_EXECUTABLE):
  Cannot generate a safe runtime search path for target TestOpenCLTorchForce
  because files in some directories may conflict with libraries in implicit
  directories:

    runtime library [libOpenCL.so.1] in /actions-runner/_work/openmm-torch/openmm-torch/3/envs/build/lib may be hidden by files in:
      /usr/local/cuda/lib64

  Some of these libraries may not be found correctly.

Looks like it might be an issue for where libOpenCL.so.1 is coming from

mikemhenry · 2023-04-25T18:38:23Z

Maybe not since the output of the github runner has the same error:

CMake Warning at platforms/opencl/CMakeLists.txt:59 (ADD_LIBRARY):
  Cannot generate a safe runtime search path for target OpenMMTorchOpenCL
  because files in some directories may conflict with libraries in implicit
  directories:

    runtime library [libOpenCL.so.1] in /usr/share/miniconda3/envs/build/lib may be hidden by files in:
      /usr/local/cuda-11.2/lib64

I am guessing it has to do with the opencl driver on the amazon linux box... I will spin up an instance to try and debug it interactively

mikemhenry · 2023-04-25T22:46:43Z

So when I spun up an ec2 instance with the same ami, I didn't get any tests to fail, but I did use micromamba to set up the env, so I will switch the self hosted CI to micromamba and see if that fixes our problems.

mikemhenry · 2023-04-25T22:57:44Z

Switching to micromamba worked!

mikemhenry · 2023-04-25T23:00:07Z

.github/workflows/self-hosted-gpu-test.yml

+  push:
+    branches:
+      - master
+      - feat/add_aws_gpu_runner 


Suggested change

- feat/add_aws_gpu_runner

Remove before merge

mikemhenry · 2023-04-25T23:02:54Z

@peastman how does this look? I think we just need to decide when it runs, maybe on marge to master and I will keep the part that lets you run it on-demand?

raimis · 2023-04-26T09:09:22Z

.github/workflows/self-hosted-gpu-test.yml

+          NVCC_VERSION: ${{ env.nvcc-version }}
+          PYTORCH_VERSION: ${{ env.pytorch-version }} 
+
+      - uses: mamba-org/provision-with-micromamba@main


Switching to mciromamba is not a solution. It just hides some issue with the dependencies or mamba.

Well here is the diff between the envs:

2c2 < _openmp_mutex 4.5 2_gnu conda-forge --- > _openmp_mutex 4.5 2_kmp_llvm conda-forge 33d32 < intel-openmp 2022.1.0 h9e868ea_3769 69a69 > llvm-openmp 16.0.2 h4dfa4b3_0 conda-forge 72c72 < mkl 2022.1.0 hc2b9512_224 --- > mkl 2022.2.1 h84fe81f_16997 conda-forge 92c92 < python 3.10.10 he550d4f_0_cpython conda-forge --- > python 3.10.0 h543edf9_3_cpython conda-forge 103a104 > sqlite 3.40.0 h4ff8645_1 conda-forge 105a107 > tbb 2021.8.0 hf52228f_0 conda-forge

Where > is the working one, so it looks like sqlite & tbb pacakges are added, and llvm-openmp instead of intel's implementation was chosen. I am not sure which pins need to be adjusted. Why only the openCL tests would fail with a different mutex flavor, mkl version, and openmp implementation and none of the others, I don't have any idea.

My bet is on the ocl-icd package being somehow overriden by the system one.

So now do you want me to add pins to the environment yaml? Previously, you didn't want me to do that.

@mikemhenry I say if pinning some versions its what it takes so be it. We can later on relax these after merging. AFAIK I cannot play around with this at the moment.

I would start by trying if installing llvm-openmp is enough. Another option could be to use the OCL library that comes with CUDA.
Maybe skipping these two is enough for cmake to pick the CUDA ones?

-DOPENCL_INCLUDE_DIR=${CONDA_PREFIX}/include \ -DOPENCL_LIBRARY=${CONDA_PREFIX}/lib/libOpenCL${SHLIB_EXT}

I would really like this merged ASAP so we can also merge #106

mikemhenry · 2023-05-19T22:37:43Z

Resolves #93

mikemhenry · 2023-05-23T15:34:24Z

Going to see if I can figure this out interactively, for reference this is what mambaforge reports

(base) [ec2-user@ip-10-0-142-194 ~]$ mamba info

                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (1.4.1) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


     active environment : base
    active env location : /home/ec2-user/mambaforge
            shell level : 1
       user config file : /home/ec2-user/.condarc
 populated config files : /home/ec2-user/mambaforge/.condarc
                          /home/ec2-user/.condarc
          conda version : 23.1.0
    conda-build version : not installed
         python version : 3.10.10.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=11.7=0
                          __glibc=2.26=0
                          __linux=4.14.304=0
                          __unix=0=0
       base environment : /home/ec2-user/mambaforge  (writable)
      conda av data dir : /home/ec2-user/mambaforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/pytorch/linux-64
                          https://conda.anaconda.org/pytorch/noarch
          package cache : /home/ec2-user/mambaforge/pkgs
                          /home/ec2-user/.conda/pkgs
       envs directories : /home/ec2-user/mambaforge/envs
                          /home/ec2-user/.conda/envs
               platform : linux-64
             user-agent : conda/23.1.0 requests/2.28.2 CPython/3.10.10 Linux/4.14.304-226.531.amzn2.x86_64 amzn/2 glibc/2.26
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

mikemhenry · 2023-05-23T15:37:33Z

And now micromamba:

[ec2-user@ip-10-0-142-194 ~]$ micromamba info

                                           __
          __  ______ ___  ____ _____ ___  / /_  ____ _
         / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
        / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
       / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
      /_/


            environment : None (not found)
           env location : -
      user config files : /home/ec2-user/.mambarc
 populated config files : /home/ec2-user/.condarc
       libmamba version : 1.4.3
     micromamba version : 1.4.3
           curl version : libcurl/7.88.1 OpenSSL/3.1.0 zlib/1.2.13 zstd/1.5.2 libssh2/1.10.0 nghttp2/1.52.0
     libarchive version : libarchive 3.6.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.2
       virtual packages : __unix=0=0
                          __linux=4.14.304=0
                          __glibc=2.26=0
                          __archspec=1=x86_64
                          __cuda=11.7=0
               channels : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/pytorch/linux-64
                          https://conda.anaconda.org/pytorch/noarch
       base environment : /home/ec2-user/micromamba
               platform : linux-64

mikemhenry · 2023-05-23T20:42:27Z

Using a lock file (so the envs with mamba and micromamba are the same) gives me the same result, all the tests pass except these

          3 - TestOpenCLTorchForceSingle (SEGFAULT)
          4 - TestOpenCLTorchForceMixed (SEGFAULT)
          5 - TestOpenCLTorchForceDouble (SEGFAULT)

when using mamba, everything works with micromamba. What is also really confusing is that the python OpenCL tests pass for mamba...

mikemhenry · 2023-05-23T22:00:24Z

This is what the segfault looks like when I use gdb and compile it with the debug symbols:

Thread 1 "TestOpenCLTorch" received signal SIGSEGV, Segmentation fault.
0x00007ffff7de2acd in _dl_lookup_symbol_x (undef_name=0x7fffc0946069 "clGetExtensionFunctionAddress", undef_map=0x16, ref=0x7fffffffa568, symbol_scope=0x39e, version=0x0, type_class=0, flags=2, skip_map=0x0)
    at dl-lookup.c:825
825       if (__glibc_unlikely (skip_map != NULL))

mikemhenry · 2023-05-23T22:09:13Z

Here is the env file, if you make the env with micromamba the tests will pass, if you make it with mambaforge then theses and only these tests fail, the python opencl tests pass.

3 - TestOpenCLTorchForceSingle (SEGFAULT)
4 - TestOpenCLTorchForceMixed (SEGFAULT)
5 - TestOpenCLTorchForceDouble (SEGFAULT)

openmm-torch-env.txt

At this point, I am inclined to just move forward and use micromamba for the GPU tests on our self-hosted runner. I do not know why I am getting those segfaults.

mikemhenry · 2023-05-23T22:15:59Z

This is the output of ldd on the different binaries (TestOpenCLTorchForce)
mamba_ldd.txt
micromamba_ldd.txt

peastman · 2023-05-23T22:52:50Z

What version of OpenCL are you compiling against, and which are you running against? clGetExtensionFunctionAddress() has been deprecated since 1.2.

mikemhenry · 2023-05-23T23:25:59Z

Looks like whatever ships with conda -- Found OPENCL: /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib/libOpenCL.so

I do get this warning:

CMake Warning at platforms/opencl/tests/CMakeLists.txt:13 (ADD_EXECUTABLE):
  Cannot generate a safe runtime search path for target TestOpenCLTorchForce
  because files in some directories may conflict with libraries in implicit
  directories:

    runtime library [libOpenCL.so.1] in /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib may be hidden by files in:
      /usr/local/cuda/lib64

  Some of these libraries may not be found correctly.

mikemhenry · 2023-05-23T23:26:41Z

libOpenCL.so.1 => /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib/libOpenCL.so.1 (0x00007f10fc427000)

peastman · 2023-05-24T00:19:58Z

What header files are you compiling against?

RaulPPelaez · 2023-05-24T08:41:20Z

Inspect libOpenCL.so.1 (with the file command) in both cases, maybe you will see in one instance you will see that it is just a symlink to the system one?

mikemhenry · 2023-05-24T21:16:28Z

Looks like they are not pointing to the system libopencl:

(base) [ec2-user@ip-10-0-142-194 ~]$ ll -a  /home/ec2-user/micromamba/envs/*/lib/libOpenCL.so.1
lrwxrwxrwx 1 ec2-user ec2-user 18 May 23 19:51 /home/ec2-user/micromamba/envs/mamaba-torch-112/lib/libOpenCL.so.1 -> libOpenCL.so.1.0.0
lrwxrwxrwx 1 ec2-user ec2-user 18 May 23 15:56 /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib/libOpenCL.so.1 -> libOpenCL.so.1.0.0
(base) [ec2-user@ip-10-0-142-194 ~]$ ll -a  /home/ec2-user/micromamba/envs/*/lib/libOpenCL.so.1.0.0
-rwxrwxr-x 1 ec2-user ec2-user 230240 May 23 19:51 /home/ec2-user/micromamba/envs/mamaba-torch-112/lib/libOpenCL.so.1.0.0
-rwxrwxr-x 1 ec2-user ec2-user 230240 May 23 15:56 /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib/libOpenCL.so.1.0.0

Interestingly, they are not the same file:

(base) [ec2-user@ip-10-0-142-194 ~]$ md5sum /home/ec2-user/micromamba/envs/*/lib/libOpenCL.so.1.0.0
8bd407a0bfd8a438d8dc6114e513812b  /home/ec2-user/micromamba/envs/mamaba-torch-112/lib/libOpenCL.so.1.0.0
b0e6a48c65b7591b3d22844c3e638a12  /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib/libOpenCL.so.1.0.0

Will keep investigating

mikemhenry · 2023-05-24T21:38:39Z

Looks like they downloaded the same package...

(openmm-torch-pytorch-112) [ec2-user@ip-10-0-142-194 ~]$ find /home/ec2-user/*/pkgs/* -not -path "/proc/*" -not -path "/sys/*" -type f -name libOpenCL.so.1.0.0 -exec md5sum {} \; | sort
121636c1b1d396ffe4467c7a3588cd8b  /home/ec2-user/mambaforge/pkgs/ocl-icd-2.3.1-h7f98852_0/lib/libOpenCL.so.1.0.0
121636c1b1d396ffe4467c7a3588cd8b  /home/ec2-user/micromamba/pkgs/ocl-icd-2.3.1-h7f98852_0/lib/libOpenCL.so.1.0.0

mikemhenry · 2023-05-24T21:40:55Z

And just for completeness, every libOpenCL.so.1.0.0 on the system

(openmm-torch-pytorch-112) [ec2-user@ip-10-0-142-194 ~]$ sudo find / -not -path "/proc/*" -not -path "/sys/*" -type f -name libOpenCL.so.1.0.0 -exec md5sum {} \; | sort
121636c1b1d396ffe4467c7a3588cd8b  /home/ec2-user/mambaforge/pkgs/ocl-icd-2.3.1-h7f98852_0/lib/libOpenCL.so.1.0.0
121636c1b1d396ffe4467c7a3588cd8b  /home/ec2-user/micromamba/pkgs/ocl-icd-2.3.1-h7f98852_0/lib/libOpenCL.so.1.0.0
68354165be3952cb6bf401df62c0dce3  /usr/lib64/libOpenCL.so.1.0.0
6ecc1dcf9ecc3ea22658d9c2c3f70165  /usr/local/cuda-11.7/targets/x86_64-linux/lib/libOpenCL.so.1.0.0
8645aa66c8a3a074364872e5c62ebffe  /home/ec2-user/mambaforge/envs/mamaba-torch-112/lib/libOpenCL.so.1.0.0
88765fa2da13b998974447e06bc2bd35  /opt/conda/envs/pytorch/lib/libOpenCL.so.1.0.0
88765fa2da13b998974447e06bc2bd35  /opt/conda/pkgs/cuda-cudart-11.7.99-0/lib/libOpenCL.so.1.0.0
b0e6a48c65b7591b3d22844c3e638a12  /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib/libOpenCL.so.1.0.0
c429eb2328aa74536f104aa2938843b0  /usr/lib/libOpenCL.so.1.0.0

mikemhenry · 2023-05-24T21:43:15Z

And just to double check, both tests point to a different libopencl

(openmm-torch-pytorch-112) [ec2-user@ip-10-0-142-194 ~]$ ldd /home/ec2-user/openmm-torch/*/TestOpenCLTorchForce | grep libOpenCL.so
libOpenCL.so.1 => /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib/libOpenCL.so.1 (0x00007fb0d9af0000)
libOpenCL.so.1 => /home/ec2-user/mambaforge/envs/mamaba-torch-112/lib/libOpenCL.so.1 (0x00007f0025071000)

peastman · 2023-05-24T22:56:25Z

What OpenCL headers are you compiling against?

mikemhenry · 2023-05-25T00:22:24Z

Replacing the libOpenCL.so.1 in the mambaforge install with the one from the micromamba install fixed the segfault, so I need to figure out where that library is coming from in the mambaforge install.

What OpenCL headers are you compiling against?

Looking at ccmake,

OPENCL_INCLUDE_DIR               /home/ec2-user/mambaforge/envs/mamaba-torch-112/include
OPENCL_LIBRARY                   /home/ec2-user/mambaforge/envs/mamaba-torch-112/lib/libOpenCL.so

So the ones in /home/ec2-user/mambaforge/envs/mamaba-torch-112/include/OpenCL/

(openmm-torch-pytorch-112) [ec2-user@ip-10-0-142-194 build-torch-mamba-112]$ ll /home/ec2-user/mambaforge/envs/mamaba-torch-112/include/OpenCL/
total 328
-rw-rw-r-- 2 ec2-user ec2-user  4374 Aug 18  2021 cl_d3d10.h
-rw-rw-r-- 2 ec2-user ec2-user  4368 Aug 18  2021 cl_d3d11.h
-rw-rw-r-- 2 ec2-user ec2-user  9079 Aug 18  2021 cl_dx9_media_sharing.h
-rw-rw-r-- 2 ec2-user ec2-user   959 Aug 18  2021 cl_dx9_media_sharing_intel.h
-rw-rw-r-- 2 ec2-user ec2-user  4434 Aug 18  2021 cl_egl.h
-rw-rw-r-- 2 ec2-user ec2-user 69009 Aug 18  2021 cl_ext.h
-rw-rw-r-- 2 ec2-user ec2-user   902 Aug 18  2021 cl_ext_intel.h
-rw-rw-r-- 2 ec2-user ec2-user   905 Aug 18  2021 cl_gl_ext.h
-rw-rw-r-- 2 ec2-user ec2-user  6767 Aug 18  2021 cl_gl.h
-rw-rw-r-- 2 ec2-user ec2-user 81345 Aug 18  2021 cl.h
-rw-rw-r-- 2 ec2-user ec2-user 10430 Aug 18  2021 cl_half.h
-rw-rw-r-- 2 ec2-user ec2-user 52277 Aug 18  2021 cl_icd.h
-rw-rw-r-- 2 ec2-user ec2-user 43260 Aug 18  2021 cl_platform.h
-rw-rw-r-- 2 ec2-user ec2-user  5434 Aug 18  2021 cl_va_api_media_sharing_intel.h
-rw-rw-r-- 2 ec2-user ec2-user  3125 Aug 18  2021 cl_version.h
-rw-rw-r-- 2 ec2-user ec2-user   970 Aug 18  2021 opencl.h

RaulPPelaez · 2023-05-25T07:51:31Z

See this in your previous answer:

88765fa2da13b998974447e06bc2bd35 /opt/conda/envs/pytorch/lib/libOpenCL.so.1.0.0
88765fa2da13b998974447e06bc2bd35 /opt/conda/pkgs/cuda-cudart-11.7.99-0/lib/libOpenCL.so.1.0.0

Maybe you are linking against this one via torch/cuda.
Then you are also linking explicitly with either this (mamba):

8645aa66c8a3a074364872e5c62ebffe /home/ec2-user/mambaforge/envs/mamaba-torch-112/lib/libOpenCL.so.1.0.0

Or this (micro):

b0e6a48c65b7591b3d22844c3e638a12 /home/ec2-user/micromamba/envs/openmm-torch-pytorch-112/lib/libOpenCL.so.1.0.0

So you have two different versions of the same library in the binary, right?
Maybe the micro one just happens to be close enough to work in these tests. I believe you should be pointing in the CMake to the pytorch one.

Another guess is the ocl-icd package https://github.com/OCL-dev/ocl-icd

This package aims at creating an Open Source alternative to vendor specific
OpenCL ICD loaders.

The main difficulties to create such software is that the order of
function pointers in a structure is not publicy available.
This software maintains a YAML database of all known and guessed
entries.

Perhaps this is working incorrectly, or the installation order matters.

mikemhenry · 2023-05-25T17:54:00Z

One thing that makes no sense is if I copy the libopencl.so from the working micromamba install into the mamba install, then the segfault goes away. I am going to nuke everything and try again to make sure I didn't make a mistake, but if that is the case, it is quite strange since the binary diff of the files was

< 00023190: 616d 6261 666f 7267 652f 656e 7673 2f6d  ambaforge/envs/m
< 000231a0: 616d 6162 612d 746f 7263 682d 3131 322f  amaba-torch-112/
< 000231b0: 6574 632f 4f70 656e 434c 2f76 656e 646f  etc/OpenCL/vendo
< 000231c0: 7273 0000 0000 0000 0000 0000 0000 0000  rs..............
---
> 00023190: 6963 726f 6d61 6d62 612f 656e 7673 2f6f  icromamba/envs/o
> 000231a0: 7065 6e6d 6d2d 746f 7263 682d 7079 746f  penmm-torch-pyto
> 000231b0: 7263 682d 3131 322f 6574 632f 4f70 656e  rch-112/etc/Open
> 000231c0: 434c 2f76 656e 646f 7273 0000 0000 0000  CL/vendors......

which is just from conda patching the path into the binary

RaulPPelaez · 2023-05-26T06:27:53Z

Maybe this path inside libopencl.so makes the binary load also other stuff from micromamba.
Also, chasing ghosts but that is not the only change, you might be changing the timestamp too.

mikemhenry · 2023-08-23T14:24:41Z

@raimis @RaulPPelaez

Revisiting this, we have been using this PR to test a few other PRs and it has been helpful. I understand that switching to micromamba may be a bit of a hack, but I rather have something that works than nothing at all.

RaulPPelaez · 2023-09-04T06:43:56Z

I agree @mikemhenry, we ran out of things to try. We cannot say we understand what is making the OpenCL tests fail, but I think it is safe to say that it is caused by something in the environment.
For some unknown reason micromamba prevents the issue, so lets go with it for now.
I say we merge this and then revisit if and when the need arises.
cc @raimis @peastman

raimis · 2023-09-04T10:45:17Z

OK! Let's use Micromamba if there is no other option.

mikemhenry · 2023-09-05T20:17:03Z

Sounds good, sorry I couldn't quite figure it out!

RaulPPelaez · 2023-09-06T06:45:30Z

.github/workflows/self-hosted-gpu-test.yml

+    name: Do the job on the runner
+    needs: start-runner # required to start the main job when the runner is ready
+    runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
+    timeout-minutes: 1200 # 20 hrs


Given that it costs money, I would lower this. In my machine the tests take ~1 minute. Maybe 120 mins should give enough room for compilation, etc?

Forgot to take into account the python tests, which do take a while in the CPU. Although 2 hours should be enough, perhaps we can skip CPU tests in this runner:

$ pytest -v -s -k "not Reference and not CPU" Test*py

Yes this is a great idea, I didn't think about that. I am trying to think if there is a case where it is useful to run the CPU tests on this runner... something else to play with is pytest using xdist since these boxes are MUCH more powerful than what GHA gives us.

So do you think there is any value in running the CPU tests on the GPU runner?

I mean, on one hand they are already being ran on the normal CI, OTOH I can see bugs arising in the CPU version only on a GPU env or the other way . To give some examples this conda-forge/openmm-torch-feedstock#37 or the problems we could not eventually crack on this very PR.
The safe thing to do is probably just give it 3 hours and run every test.
Another option could be to run CPU tests in parallel, which pytest allows AFAIK. Something like:

$ pytest -n 4 -v -s -k "Reference or CPU" Test*py & $ pytest -v -s -k "not Reference and not CPU" Test*py $ wait

With -n auto the tests takes ~11 minutes to run so I think it is worth running the CPU tests since for troubleshooting one of the first things I would want to do is see if the CPU tests work.

RaulPPelaez · 2023-09-06T06:47:29Z

.github/workflows/self-hosted-gpu-test.yml

+    timeout-minutes: 1200 # 20 hrs
+    env:
+      HOME: /home/ec2-user
+      os: ubuntu-22.04


Above it says ubuntu-latest, but here it is ubuntu-22.04, is this intentional?

That is to make the way the env file is chosen devtools/conda-envs/build-${{ env.os }}.yml match how we do the CI on GHA.

For the start-runner block, we just need a VM to spin up and turn on the GPU runner at AWS, so I chose to use ubuntu-latest since it doesn't really matter the version and I rather not have to worry about it, it should work fine when 24.04 comes out for example.

RaulPPelaez

Just a couple of minor comments, but LGTM. Thanks @mikemhenry !

"OK! Let's use Micromamba if there is no other option."

mikemhenry added 3 commits April 20, 2023 02:49

port the latest CI versions to the self-hosted runner

6d4e00e

forgot to remove bit from version I copy

712af05

see if cuda 11.7 works

2c4dde6

run CUDA tests on GPU runner

e488b36

Switch to using micromamba

552c4f0

mikemhenry commented Apr 25, 2023

View reviewed changes

mikemhenry requested a review from peastman April 25, 2023 23:02

raimis previously requested changes Apr 26, 2023

View reviewed changes

RaulPPelaez mentioned this pull request Apr 28, 2023

Torch2 compatibility #106

Merged

mikemhenry mentioned this pull request May 19, 2023

Tests on GPU #93

Closed

mikemhenry force-pushed the feat/add_aws_gpu_runner branch from 7bb3469 to 552c4f0 Compare May 23, 2023 23:22

mikemhenry requested review from raimis and RaulPPelaez September 5, 2023 20:15

run on push to master + weekly

77df4f8

RaulPPelaez reviewed Sep 6, 2023

View reviewed changes

RaulPPelaez approved these changes Sep 6, 2023

View reviewed changes

mikemhenry added 2 commits September 6, 2023 09:41

see if xdist speeds things up

ff2bf34

only run on master branch

0e733e3

mikemhenry merged commit a0b51a9 into master Sep 6, 2023
4 checks passed

Add AWS GPU Runner #107

Add AWS GPU Runner #107

Conversation

mikemhenry commented Apr 20, 2023

mikemhenry commented Apr 20, 2023

mikemhenry commented Apr 20, 2023

mikemhenry commented Apr 25, 2023

mikemhenry commented Apr 25, 2023

mikemhenry commented Apr 25, 2023

mikemhenry commented Apr 25, 2023

mikemhenry commented Apr 25, 2023

Choose a reason for hiding this comment

mikemhenry commented Apr 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemhenry commented May 19, 2023

mikemhenry commented May 23, 2023

mikemhenry commented May 23, 2023

mikemhenry commented May 23, 2023

mikemhenry commented May 23, 2023

mikemhenry commented May 23, 2023

mikemhenry commented May 23, 2023

peastman commented May 23, 2023

mikemhenry commented May 23, 2023

mikemhenry commented May 23, 2023

peastman commented May 24, 2023

RaulPPelaez commented May 24, 2023

mikemhenry commented May 24, 2023

mikemhenry commented May 24, 2023

mikemhenry commented May 24, 2023

mikemhenry commented May 24, 2023

peastman commented May 24, 2023

mikemhenry commented May 25, 2023

RaulPPelaez commented May 25, 2023 • edited Loading

mikemhenry commented May 25, 2023

RaulPPelaez commented May 26, 2023

mikemhenry commented Aug 23, 2023

RaulPPelaez commented Sep 4, 2023

raimis commented Sep 4, 2023

mikemhenry commented Sep 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RaulPPelaez left a comment

Choose a reason for hiding this comment

RaulPPelaez commented May 25, 2023 •

edited

Loading