Skip to content

Wrapper tool to simplify dispatching jobs via HTCondor

License

Notifications You must be signed in to change notification settings

hutchresearch/fly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Overview:

fly.py is a command-line tool that simplifies dispatching condor jobs to the WWU CSCI department cluster.

Usage Examples

fly.py --command "/bin/echo test"
fly.py --command "/path/to/train_cool_model.py train.npy dev.npy" --gpus 1 --gpu_mem 11 --cores 2
fly.py --commands_fn commands.txt --venv /cluster/home/$(whoami)/venv --J 10
fly.py --commands_fn commands.txt --conda /cluster/home/$(whoami)/anaconda3 --conda_name CondaEnvName

Call fly.py -h to see all of the options available.

Tips:

  • Condor must be able to run the commands on the remote machine, so please
    • Make sure you have a proper #! at the top of any scripts you wish to run via condor.
      • E.g. #! /usr/bin/env python3
    • Make sure you have given the execute permission to your user for any binary or script you wish to run via condor.
    • Please provide the absolute path for the command you wish to run
      • You may use which command to find the absolute path for command

Helpful Condor Commands

  • condor_q -> check current queue
    • You can watch your status with watch -N 5 condor_q (but do not leave this running for too long... it bogs down the job scheduler)
  • condor_q -hold -> check what error caused your job to be placed into the holding queue
  • condor_q -better-analyze -> see how many machines can run the job you submitted and why
  • condor_status -> check which computers are being used
  • condor_ssh_to_job (job_id_number) -> ssh to the machine a given job is on (e.g. to check top or nvidia-smi)
  • condor_rm (job_id_number) -> cancel a running job

fly Output Files

In the condor directory (defaults to .condor_jobs), fly will create an output directory with the format USER_YYYYMMDD_HHmmss_ff (USER is your username, ff is microseconds sections). Within that, it will produce a set of files of the format NUM.EXT.

  • NUM is the job number. If you only submit one job, it will be 0. If you submit N jobs, you will have sets of files for NUM=0,...,N-1.
  • EXT is one of the following
    • err - the contents of your command's standard out and any condor error output
    • job - the job file automatically created for you
    • log - condor's logging
    • out - the contents of your command's standard out
    • sh - the wrapper script that actually calls your command(s)
  • The *.out files are buffered and may only be written once the job has completed.

I recommend you design your scripts to write any output logging directly to a file that you specify, instead of relying on standard out or standard error. If you do so, you may not ever need to inspect any of the files produced for you in the output directory.

Condor Documentation:

About

Wrapper tool to simplify dispatching jobs via HTCondor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages