Add alternative worker commands, config options #20

skirui-source · 2023-11-08T02:14:00Z

Closes #10

jacobtomlinson

Thanks for this, excited to give it a spin.

One quick thought is what happens if the options the user wants to provide contains spaces?

$ dask databricks run --worker-args "--foo 'bar baz'"

The above example wouldn't split up cleanly. I wonder if we also want to add optional JSON support. So before calling worker_args.split() we try and call json.loads(worker_args).

That way a user could specify a JSON list of arguments if they want to be explicit.

$ dask databricks run --worker-args "['--foo' 'bar baz']"

… main

for more information, see https://pre-commit.ci

skirui-source · 2023-11-09T07:41:17Z

Tested my changes with the following script:

#!/bin/bash
set -e

pip install --upgrade pip dask[complete] git+https://github.com/skirui-source/dask-databricks.git@main dask-cuda==23.10.0 bokeh==3.2.2
pip install pyspark==3.5.0 numpy==1.23.5 scikit-learn==0.22.1
dask databricks run --worker-command "dask cuda worker" --worker-args "--nthreads 2"

Seeing this error on the Scheduler/Driver node:

Running command git clone --filter=blob:none --quiet https://github.com/skirui-source/dask-databricks.git /tmp/pip-req-build-pixm6hlg
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
databricks-feature-store 0.14.1 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 1.26.1 which is incompatible.
scipy 1.9.1 requires numpy<1.25.0,>=1.18.5, but you have numpy 1.26.1 which is incompatible.
mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imbalanced-learn 0.10.1 requires scikit-learn>=1.0.2, but you have scikit-learn 0.22.1 which is incompatible.
/databricks/python3/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.
  warnings.warn(
/databricks/python3/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.
  warnings.warn(
2023-11-09 07:32:18,796 - distributed.scheduler - INFO - -----------------------------------------------
2023-11-09 07:32:19,132 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-11-09 07:32:19,168 - distributed.scheduler - INFO - State start
2023-11-09 07:32:19,174 - distributed.scheduler - INFO - -----------------------------------------------
2023-11-09 07:32:19,174 - distributed.scheduler - INFO -   Scheduler at:  tcp://10.59.230.165:8786
2023-11-09 07:32:19,175 - distributed.scheduler - INFO -   dashboard at:  http://10.59.230.165:8787/status
2023-11-09 07:32:19,175 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2023-11-09 07:32:19,875 - distributed.comm.tcp - INFO - Connection from tcp://10.59.241.62:56834 closed before handshake completed
2023-11-09 07:32:19,877 - distributed.comm.tcp - INFO - Connection from tcp://10.59.249.7:34864 closed before handshake completed
2023-11-09 07:34:32,415 - distributed.scheduler - INFO - Receive client connection: Client-6c879bb7-7ed2-11ee-8e0b-00163e5e434a
2023-11-09 07:34:32,417 - distributed.core - INFO - Starting established connection to tcp://10.59.230.165:56762

skirui-source · 2023-11-09T07:46:51Z

and this error...from the dask worker:

Running command git clone --filter=blob:none --quiet https://github.com/skirui-source/dask-databricks.git /tmp/pip-req-build-vvt7ougt
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
databricks-feature-store 0.14.1 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 1.26.1 which is incompatible.
scipy 1.9.1 requires numpy<1.25.0,>=1.18.5, but you have numpy 1.26.1 which is incompatible.
mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imbalanced-learn 0.10.1 requires scikit-learn>=1.0.2, but you have scikit-learn 0.22.1 which is incompatible.
/databricks/python3/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.
  warnings.warn(
/databricks/python3/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.
  warnings.warn(
Usage: dask [OPTIONS] COMMAND [ARGS]...
Try 'dask -h' for help.

Error: No such command 'cuda'.

jacobtomlinson · 2023-11-09T11:17:37Z

I think the core of the problem is in this line

/databricks/python/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.

I reproduced the error in a Databricks notebook to get the full traceback

import importlib_metadata
[ep] = [ep for ep in importlib_metadata.entry_points(group="dask_cli") if ep.name == "cuda"]
ep.load()

AttributeError: 'function' object has no attribute 'command'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File <command-1479567383531443>, line 3
      1 import importlib_metadata
      2 [ep] = [ep for ep in importlib_metadata.entry_points(group="dask_cli") if ep.name == "cuda"]
----> 3 ep.load()

File /databricks/python/lib/python3.10/site-packages/importlib_metadata/__init__.py:209, in EntryPoint.load(self)
    204 """Load the entry point from its definition. If only a module
    205 is indicated by the value, return that module. Otherwise,
    206 return the named object.
    207 """
    208 match = self.pattern.match(self.value)
--> 209 module = import_module(match.group('module'))
    210 attrs = filter(None, (match.group('attr') or '').split('.'))
    211 return functools.reduce(getattr, attrs, module)

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1006, in _find_and_load_unlocked(name, import_)

File <frozen importlib._bootstrap>:688, in _load_unlocked(spec)

File <frozen importlib._bootstrap_external>:883, in exec_module(self, module)

File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)

File /databricks/python/lib/python3.10/site-packages/dask_cuda/cli.py:61
     56 @click.group
     57 def cuda():
     58     """Subcommands to launch or query distributed workers with GPUs."""
---> 61 @cuda.command(name="worker", context_settings=dict(ignore_unknown_options=True))
     62 @scheduler
     63 @preload_argv
     64 @click.option(
     65     "--host",
     66     type=str,
     67     default=None,
     68     help="""IP address of serving host; should be visible to the scheduler and other
     69     workers. Can be a string (like ``"127.0.0.1"``) or ``None`` to fall back on the
     70     address of the interface specified by ``--interface`` or the default interface.""",
     71 )
   (...)
    322 def worker(
   (...)
    357     **kwargs,
    358 ):
    359     """Launch a distributed worker with GPUs attached to an existing scheduler.
    360 
    361     A scheduler can be specified either through a URI passed through the ``SCHEDULER``
   (...)
    366     for info.
    367     """
    368     if multiprocessing_method == "forkserver":

AttributeError: 'function' object has no attribute 'command'

I can also reproduce this by importing the submodule.

>>> import dask_cuda.cli
AttributeError: 'function' object has no attribute 'command'

… skirui-source/main

jacobtomlinson · 2023-11-09T11:47:16Z

Looks like Databricks gives us click==8.0.4 which causes this bug. It's fixed in >=8.1 so I've bumped the minimum version here and also made PRs upstream to do the same.

jacobtomlinson · 2023-11-09T14:26:36Z

Ok I got things working! I pushed a couple of extra commits to this PR but ultimately reverted one.

The main change I've made is to bump the minimum version of click to >=8.1.

I used g4dn.xlarge instances for the driver and workers with proton disabled and the custom container image databricksruntime/gpu-tensorflow:cuda11.8.

Then I used this init script which does make a couple of small tweaks due to using the databricksruntime/gpu-tensorflow:cuda11.8 container and also install cudf and dask-cudf.

#!/bin/bash
set -e

# The Databricks Python directory isn't on the path in 
# databricksruntime/gpu-tensorflow:cuda11.8 for some reason
export PATH="/databricks/python/bin:$PATH"

# Install git just so that we can install dask-databricks from source
# as it's not included in databricksruntime/gpu-tensorflow:cuda11.8.
# We can remove this when installing dask-databricks from PyPI.
apt-get update && apt-get install git -y

# Install RAPIDS (cudf & dask-cudf) and dask-databricks
/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
      bokeh==3.2.2 \
      cudf-cu11 \
      dask[complete] \
      dask-cudf-cu11 \
      dask-cuda \
      git+https://github.com/skirui-source/dask-databricks.git@main 

# Start the Dask cluster with CUDA workers
dask databricks run --worker-command "dask cuda worker"

jacobtomlinson · 2023-11-09T16:51:53Z

I cleaned things up a little further including adding a --cuda flag for convenience.

dask databricks run --cuda
# is equivalent to
dask databricks run --worker-command "dask cuda worker"

I also noticed that not pinning dask-cuda occasionally results in getting a super old version of dask-cuda (0.18.0).

So things work nicely now with this init script.

#!/bin/bash
set -e

# The Databricks Python directory isn't on the path in 
# databricksruntime/gpu-tensorflow:cuda11.8 for some reason
export PATH="/databricks/python/bin:$PATH"

# Install git just so that we can install dask-databricks from source
# as it's not included in databricksruntime/gpu-tensorflow:cuda11.8.
# We can remove this when installing dask-databricks from PyPI.
apt-get update && apt-get install git -y

# Install RAPIDS (cudf & dask-cudf) and dask-databricks
/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
      bokeh==3.2.2 \
      cudf-cu11 \
      dask[complete] \
      dask-cudf-cu11 \
      dask-cuda==23.10.0 \
      git+https://github.com/skirui-source/dask-databricks.git@main 

# Start the Dask cluster with CUDA workers
dask databricks run --cuda

add alternative worker commands, config options

a356add

jacobtomlinson reviewed Nov 8, 2023

View reviewed changes

skirui-source and others added 5 commits November 8, 2023 13:55

Merge branch 'main' of github.com:jacobtomlinson/dask-databricks into…

da34b30

… main

[pre-commit.ci] auto fixes from pre-commit.com hooks

6efb91a

for more information, see https://pre-commit.ci

add support for Json worker args

6bda422

add support for Json worker args

6eabbe7

polls the scheduler for health check

3793883

extra health checks for dask workers

458240d

jacobtomlinson mentioned this pull request Nov 9, 2023

Bump minimum click to >=8.1 #27

Merged

Merge branch 'main' of github.com:jacobtomlinson/dask-databricks into…

38ff14a

… skirui-source/main

Handle dask not being on the path

3f6065b

jacobtomlinson added 3 commits November 9, 2023 14:32

Revert sys.executable change

93755ac

Fix when worker args are not specified

f849a4c

Clean up and add --cuda flag

5301b63

jacobtomlinson merged commit 2e42701 into dask-contrib:main Nov 9, 2023
3 checks passed

jacobtomlinson mentioned this pull request Nov 9, 2023

Check process health #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alternative worker commands, config options #20

Add alternative worker commands, config options #20

skirui-source commented Nov 8, 2023 •

edited by jacobtomlinson

Loading

jacobtomlinson left a comment •

edited

Loading

skirui-source commented Nov 9, 2023

skirui-source commented Nov 9, 2023

jacobtomlinson commented Nov 9, 2023 •

edited

Loading

jacobtomlinson commented Nov 9, 2023

jacobtomlinson commented Nov 9, 2023 •

edited

Loading

jacobtomlinson commented Nov 9, 2023

Add alternative worker commands, config options #20

Add alternative worker commands, config options #20

Conversation

skirui-source commented Nov 8, 2023 • edited by jacobtomlinson Loading

jacobtomlinson left a comment • edited Loading

Choose a reason for hiding this comment

skirui-source commented Nov 9, 2023

skirui-source commented Nov 9, 2023

jacobtomlinson commented Nov 9, 2023 • edited Loading

jacobtomlinson commented Nov 9, 2023

jacobtomlinson commented Nov 9, 2023 • edited Loading

jacobtomlinson commented Nov 9, 2023

skirui-source commented Nov 8, 2023 •

edited by jacobtomlinson

Loading

jacobtomlinson left a comment •

edited

Loading

jacobtomlinson commented Nov 9, 2023 •

edited

Loading

jacobtomlinson commented Nov 9, 2023 •

edited

Loading