Check process health #7

jacobtomlinson · 2023-11-07T10:49:47Z

When starting the scheduler and workers we just call them and move on.

https://github.com/jacobtomlinson/dask-databricks/blob/a6ff0d4ff5ca91c29b174386474a69b9454c48d4/dask_databricks/cli.py#L38

https://github.com/jacobtomlinson/dask-databricks/blob/a6ff0d4ff5ca91c29b174386474a69b9454c48d4/dask_databricks/cli.py#L51

It would be a better user experience if we watched their health for a little while. Eventually, we need to exit the init script and leave the Dask processes running, and we don't want to watch them forever otherwise the init script will never exit. But in the overall timeline of the Databricks cluster starting we could afford to spend a few seconds watching the Dask components to make sure they are healthy and don't exit prematurely.

Here are a few ideas for health checks we could implement:

Scheduler startup
- For a few seconds after starting the scheduler check the process hasn't exited
- Poll the network socket like the workers do and wait for the scheduler to start
- Watch the stdout for the Scheduler at: ... log line
Worker startup
- Poll the worker socket (requires starting all workers on the same port)
- Watch the stdout for Start worker at: log line

We should add at least one check for the scheduler and workers.

We could add some configurable timeout and if the components don't start up in that time we would exit with a non-zero return code and some useful logs about what is going on.

#!/bin/bash

# Install Dask Databricks
/databricks/python/bin/pip install dask-databricks

# Start Dask cluster
dask databricks run --timeout 30s

When init scripts exit like that the whole cluster provisioning fails so we want to be cautious about when we do this, but if the scheduler or worker processes fail to start up cleanly this seems like a good time to do this.

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2023-11-09T16:52:26Z

Closed by #20

jacobtomlinson added the enhancement New feature or request label Nov 7, 2023

jacobtomlinson closed this as completed Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check process health #7

Check process health #7

jacobtomlinson commented Nov 7, 2023 •

edited

Loading

jacobtomlinson commented Nov 9, 2023

Check process health #7

Check process health #7

Comments

jacobtomlinson commented Nov 7, 2023 • edited Loading

jacobtomlinson commented Nov 9, 2023

jacobtomlinson commented Nov 7, 2023 •

edited

Loading