-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with/question about dynamic scheduling method #46
Comments
I forgot to add that the "block" and "interleaved" scheduling methods work just fine with multi-node jobs. Are there any advantages to using any one of the three over the other two? |
Hi Marc, It sounds like The other two scheduling methods are static methods, so they do not have to communicate with |
Hi Lucas, Thank you for the prompt response! I'm running Python 2.7.5. I think that Thank you! |
Hi Lucas, I've made a terrible error. When I run However, it still doesn't appear to work within a job, and now I'm really at a loss to figure out why that is. |
Hello,
I'm having a bit of trouble getting dynamic scheduling to work properly on my cluster. Using a single node, it works fine. However, every time I attempt to use more than one node, it appears as if the launcher process hangs on the second node, and the first node (the node that has the task server running on it) completes all of the jobs alone.
I have verified that the correct ports are open between the compute nodes. For instance, I can
nc -l localhost 9471
on node1, connect to that process usingnc -4 node1 9471
on node2, and successfully pass arbitrary text back and forth. When I try to run thetskserver
manually, however (i.e../tskserver 5 localhost 9471
) on node1, the abovenc
command on node2 fails with connection refused.I have also verified that the
launcher
script is actually getting started on node2 (viatop
), but it doesn't appear to be doing any of the work. When I look at the job output, I just see tasks being executed on node1. When they're done, the output is just a bunch "connection refused" messages from netcat.I'm on CentOS 7.3, if that has any bearing. Please let me know if you need any additional information from me. Any help would be appreciated.
Thank you,
Marc
The text was updated successfully, but these errors were encountered: