Replies: 1 comment
-
Been experiencing the same issue. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I have some experience with using slurm and using optuna + joblib launchers for HP sweeps. However in my current setup I would like to start main sweeping process on login machine and make it submit jobs to cluster in batches.
I have a training script that works fine if i just sbatch it to the cluster, but when I attempt to do something like this:
(I pruned a bunch of parameters)
and then:
bash hp_sweep.sh
script fails after a few seconds and there is no way to even access the stack trace because
HYDRA_FULL_ERROR
doesn't get passed (as you can see I tried every possible way I know)I was installing submitit launcher with
python -m pip install 'git+https://github.com/facebookresearch/hydra.git#egg=hydra-submitit-launcher&subdirectory=plugins/hydra_submitit_launcher'
since latest version is not released in PyPIIs there any way to fix the stack trace issue?
EDIT: I got to the core of a problem, it wasn't actually issue in my code, but
submitit
for some reason can't decide which env to use:RuntimeError: Could not figure out which environment the job is runnning in. Known environments: slurm, local, debug.
I tried to hack around this by specifying
_TEST_CLUSTER_
is bothsetup
andsrun_args
but this boils down to the same problem with passingHYDRA_FULL_ERROR
because those values got ignoredThere are multiple issues regarding this specific problem, but I don't see any clear resolution to this issue. @omry
Beta Was this translation helpful? Give feedback.
All reactions