Skip to content

Commit

Permalink
fix: resolve fork failures during training runs
Browse files Browse the repository at this point in the history
torchrun jobs create a number of children per GPU which can
often exceed the 2k limit.

Signed-off-by: Jason T. Greene <jason.greene@redhat.com>
  • Loading branch information
n1hility committed Aug 12, 2024
1 parent 2249889 commit ac7a55d
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 0 deletions.
1 change: 1 addition & 0 deletions training/ilab-wrapper/ilab
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ PODMAN_COMMAND=("sudo" "--preserve-env=$PRESERVE_ENV" "podman" "run" "--rm" "-it
"--device" "${CONTAINER_DEVICE}"
"--security-opt" "label=disable" "--net" "host"
"--shm-size" "10G"
"--pids-limit" "-1"
"-v" "$HOME:$HOME"
"${ADDITIONAL_MOUNT_OPTIONS[@]}"
# This is intentionally NOT using "--env" "HOME" because we want the HOME
Expand Down
1 change: 1 addition & 0 deletions training/nvidia-bootc/duplicated/ilab-wrapper/ilab
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ PODMAN_COMMAND=("sudo" "--preserve-env=$PRESERVE_ENV" "podman" "run" "--rm" "-it
"--device" "${CONTAINER_DEVICE}"
"--security-opt" "label=disable" "--net" "host"
"--shm-size" "10G"
"--pids-limit" "-1"
"-v" "$HOME:$HOME"
"${ADDITIONAL_MOUNT_OPTIONS[@]}"
# This is intentionally NOT using "--env" "HOME" because we want the HOME
Expand Down

0 comments on commit ac7a55d

Please sign in to comment.