Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check process health #7

Closed
jacobtomlinson opened this issue Nov 7, 2023 · 1 comment
Closed

Check process health #7

jacobtomlinson opened this issue Nov 7, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@jacobtomlinson
Copy link
Collaborator

jacobtomlinson commented Nov 7, 2023

When starting the scheduler and workers we just call them and move on.

https://github.com/jacobtomlinson/dask-databricks/blob/a6ff0d4ff5ca91c29b174386474a69b9454c48d4/dask_databricks/cli.py#L38

https://github.com/jacobtomlinson/dask-databricks/blob/a6ff0d4ff5ca91c29b174386474a69b9454c48d4/dask_databricks/cli.py#L51

It would be a better user experience if we watched their health for a little while. Eventually, we need to exit the init script and leave the Dask processes running, and we don't want to watch them forever otherwise the init script will never exit. But in the overall timeline of the Databricks cluster starting we could afford to spend a few seconds watching the Dask components to make sure they are healthy and don't exit prematurely.

Here are a few ideas for health checks we could implement:

  • Scheduler startup
    • For a few seconds after starting the scheduler check the process hasn't exited
    • Poll the network socket like the workers do and wait for the scheduler to start
    • Watch the stdout for the Scheduler at: ... log line
  • Worker startup
    • Poll the worker socket (requires starting all workers on the same port)
    • Watch the stdout for Start worker at: log line

We should add at least one check for the scheduler and workers.

We could add some configurable timeout and if the components don't start up in that time we would exit with a non-zero return code and some useful logs about what is going on.

#!/bin/bash

# Install Dask Databricks
/databricks/python/bin/pip install dask-databricks

# Start Dask cluster
dask databricks run --timeout 30s

When init scripts exit like that the whole cluster provisioning fails so we want to be cautious about when we do this, but if the scheduler or worker processes fail to start up cleanly this seems like a good time to do this.

image
@jacobtomlinson jacobtomlinson added the enhancement New feature or request label Nov 7, 2023
@jacobtomlinson
Copy link
Collaborator Author

Closed by #20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant