You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
When running a 'SlurmPipelineExecutor' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a stage has completed (observed during initial testing).
The issue originates from the 'launch_slurm_job' function in 'src/datatrove/executor/slurm.py':
def launch_slurm_job(launch_file_contents, *args):
"""
Small helper function to save a sbatch script and call it.
Args:
launch_file_contents: Contents of the sbatch script
*args: any other arguments to pass to the sbatch command
Returns: the id of the launched slurm job
"""
with tempfile.NamedTemporaryFile("w") as f:
f.write(launch_file_contents)
f.flush()
return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1]
The issues arises because my cluster's job submission output of the following form: Submitted batch job <job_id> on cluster <cluster_name>
In this format, the 'launch_slurm_job' function incorrectly reads the cluster name as the job ID, leading to errors in subsequent stages that depend on the correct job ID.
Reproducibility:
Set up a pipeline using 'SlurmPipelineExecutor'.
Run the pipeline on a cluster with a similar job submission output format as described above.
Observe the dependency error and failed execution.
Proposed Solution:
To handle clusters with non-standard job submission output formats, I propose adding an argument or flag to the 'SlurmPipelineExecutor' class. This would allow users to specify the correct position of the job ID in the submission output.
In my case, I temporarily resolved the issue by hardcoding the function to extract the job ID from the 4th element in the subprocess call return.
Additional Information:
Environment: HPC cluster with custom Slurm job submission output
Version: Most recent version of datatrove installed from source.
Impact: This issue will affect users on systems with similar output formats.
Please let me know if further details or a pull request would be helpful.
The text was updated successfully, but these errors were encountered:
Description:
When running a 'SlurmPipelineExecutor' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a stage has completed (observed during initial testing).
The issue originates from the 'launch_slurm_job' function in 'src/datatrove/executor/slurm.py':
The issues arises because my cluster's job submission output of the following form:
Submitted batch job <job_id> on cluster <cluster_name>
In this format, the 'launch_slurm_job' function incorrectly reads the cluster name as the job ID, leading to errors in subsequent stages that depend on the correct job ID.
Reproducibility:
Proposed Solution:
To handle clusters with non-standard job submission output formats, I propose adding an argument or flag to the 'SlurmPipelineExecutor' class. This would allow users to specify the correct position of the job ID in the submission output.
In my case, I temporarily resolved the issue by hardcoding the function to extract the job ID from the 4th element in the subprocess call return.
Additional Information:
Please let me know if further details or a pull request would be helpful.
The text was updated successfully, but these errors were encountered: