Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with/question about dynamic scheduling method #46

Open
ironbars opened this issue Nov 16, 2017 · 4 comments
Open

Issue with/question about dynamic scheduling method #46

ironbars opened this issue Nov 16, 2017 · 4 comments

Comments

@ironbars
Copy link

Hello,

I'm having a bit of trouble getting dynamic scheduling to work properly on my cluster. Using a single node, it works fine. However, every time I attempt to use more than one node, it appears as if the launcher process hangs on the second node, and the first node (the node that has the task server running on it) completes all of the jobs alone.

I have verified that the correct ports are open between the compute nodes. For instance, I can nc -l localhost 9471 on node1, connect to that process using nc -4 node1 9471 on node2, and successfully pass arbitrary text back and forth. When I try to run the tskserver manually, however (i.e. ./tskserver 5 localhost 9471) on node1, the above nc command on node2 fails with connection refused.

I have also verified that the launcher script is actually getting started on node2 (via top), but it doesn't appear to be doing any of the work. When I look at the job output, I just see tasks being executed on node1. When they're done, the output is just a bunch "connection refused" messages from netcat.

I'm on CentOS 7.3, if that has any bearing. Please let me know if you need any additional information from me. Any help would be appreciated.

Thank you,
Marc

@ironbars
Copy link
Author

ironbars commented Nov 16, 2017

I forgot to add that the "block" and "interleaved" scheduling methods work just fine with multi-node jobs. Are there any advantages to using any one of the three over the other two?

@lwilson
Copy link
Contributor

lwilson commented Nov 16, 2017

Hi Marc,

It sounds like tskserver is not binding to the port on node1, which is why the connection refused error is occurring. What version of Python are you running?

The other two scheduling methods are static methods, so they do not have to communicate with tskserver (which doesn't even run). These two methods work really well if you know the runtime of all jobs are approximately the same. They are faster than dynamic and more scalable, but if you have variability in runtimes for your individual jobs, these two methods can leave cores idle.

@ironbars
Copy link
Author

Hi Lucas,

Thank you for the prompt response!

I'm running Python 2.7.5. I think that tskserver is binding to the port; if I run it, I can see it show up in the output of netstat -nlp. The state is LISTEN and it is tied to the Python process that tskserver starts. Also, if I run nc localhost 9471 on node1 it will respond with an number as expected (which would be a Launcher job ID, if I understand the code correctly). This would indicate that tskserver is binding to the port correctly, right? It's perfectly possible that I'm misunderstanding how such things work.

Thank you!

@ironbars
Copy link
Author

ironbars commented Nov 17, 2017

Hi Lucas,

I've made a terrible error. When I run ./tskserver 5 localhost 9471, it is binding and listening on the loopback interface (so only connections coming from 127.0.0.0/8 will be accepted!). When I run it on the actual network interface (using ./tskserver 5 $HOSTNAME 9471) it will serve the integers as normal, even to a remote host.

However, it still doesn't appear to work within a job, and now I'm really at a loss to figure out why that is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants