Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alternative worker commands, config options #20

Merged
merged 12 commits into from
Nov 9, 2023

Conversation

skirui-source
Copy link
Collaborator

@skirui-source skirui-source commented Nov 8, 2023

Closes #10

Copy link
Collaborator

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, excited to give it a spin.

One quick thought is what happens if the options the user wants to provide contains spaces?

$ dask databricks run --worker-args "--foo 'bar baz'"

The above example wouldn't split up cleanly. I wonder if we also want to add optional JSON support. So before calling worker_args.split() we try and call json.loads(worker_args).

That way a user could specify a JSON list of arguments if they want to be explicit.

$ dask databricks run --worker-args "['--foo' 'bar baz']"

@skirui-source
Copy link
Collaborator Author

Tested my changes with the following script:

#!/bin/bash
set -e

pip install --upgrade pip dask[complete] git+https://github.com/skirui-source/dask-databricks.git@main dask-cuda==23.10.0 bokeh==3.2.2
pip install pyspark==3.5.0 numpy==1.23.5 scikit-learn==0.22.1
dask databricks run --worker-command "dask cuda worker" --worker-args "--nthreads 2"

Seeing this error on the Scheduler/Driver node:

Running command git clone --filter=blob:none --quiet https://github.com/skirui-source/dask-databricks.git /tmp/pip-req-build-pixm6hlg
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
databricks-feature-store 0.14.1 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 1.26.1 which is incompatible.
scipy 1.9.1 requires numpy<1.25.0,>=1.18.5, but you have numpy 1.26.1 which is incompatible.
mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imbalanced-learn 0.10.1 requires scikit-learn>=1.0.2, but you have scikit-learn 0.22.1 which is incompatible.
/databricks/python3/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.
  warnings.warn(
/databricks/python3/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.
  warnings.warn(
2023-11-09 07:32:18,796 - distributed.scheduler - INFO - -----------------------------------------------
2023-11-09 07:32:19,132 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-11-09 07:32:19,168 - distributed.scheduler - INFO - State start
2023-11-09 07:32:19,174 - distributed.scheduler - INFO - -----------------------------------------------
2023-11-09 07:32:19,174 - distributed.scheduler - INFO -   Scheduler at:  tcp://10.59.230.165:8786
2023-11-09 07:32:19,175 - distributed.scheduler - INFO -   dashboard at:  http://10.59.230.165:8787/status
2023-11-09 07:32:19,175 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2023-11-09 07:32:19,875 - distributed.comm.tcp - INFO - Connection from tcp://10.59.241.62:56834 closed before handshake completed
2023-11-09 07:32:19,877 - distributed.comm.tcp - INFO - Connection from tcp://10.59.249.7:34864 closed before handshake completed
2023-11-09 07:34:32,415 - distributed.scheduler - INFO - Receive client connection: Client-6c879bb7-7ed2-11ee-8e0b-00163e5e434a
2023-11-09 07:34:32,417 - distributed.core - INFO - Starting established connection to tcp://10.59.230.165:56762

@skirui-source
Copy link
Collaborator Author

and this error...from the dask worker:

Running command git clone --filter=blob:none --quiet https://github.com/skirui-source/dask-databricks.git /tmp/pip-req-build-vvt7ougt
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
databricks-feature-store 0.14.1 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 1.26.1 which is incompatible.
scipy 1.9.1 requires numpy<1.25.0,>=1.18.5, but you have numpy 1.26.1 which is incompatible.
mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imbalanced-learn 0.10.1 requires scikit-learn>=1.0.2, but you have scikit-learn 0.22.1 which is incompatible.
/databricks/python3/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.
  warnings.warn(
/databricks/python3/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.
  warnings.warn(
Usage: dask [OPTIONS] COMMAND [ARGS]...
Try 'dask -h' for help.

Error: No such command 'cuda'.

@jacobtomlinson
Copy link
Collaborator

jacobtomlinson commented Nov 9, 2023

I think the core of the problem is in this line

/databricks/python/lib/python3.10/site-packages/dask/cli.py:100: UserWarning: While registering the command with name 'cuda', an exception ocurred; 'function' object has no attribute 'command'.

I reproduced the error in a Databricks notebook to get the full traceback

import importlib_metadata
[ep] = [ep for ep in importlib_metadata.entry_points(group="dask_cli") if ep.name == "cuda"]
ep.load()
AttributeError: 'function' object has no attribute 'command'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File <command-1479567383531443>, line 3
      1 import importlib_metadata
      2 [ep] = [ep for ep in importlib_metadata.entry_points(group="dask_cli") if ep.name == "cuda"]
----> 3 ep.load()

File /databricks/python/lib/python3.10/site-packages/importlib_metadata/__init__.py:209, in EntryPoint.load(self)
    204 """Load the entry point from its definition. If only a module
    205 is indicated by the value, return that module. Otherwise,
    206 return the named object.
    207 """
    208 match = self.pattern.match(self.value)
--> 209 module = import_module(match.group('module'))
    210 attrs = filter(None, (match.group('attr') or '').split('.'))
    211 return functools.reduce(getattr, attrs, module)

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1006, in _find_and_load_unlocked(name, import_)

File <frozen importlib._bootstrap>:688, in _load_unlocked(spec)

File <frozen importlib._bootstrap_external>:883, in exec_module(self, module)

File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)

File /databricks/python/lib/python3.10/site-packages/dask_cuda/cli.py:61
     56 @click.group
     57 def cuda():
     58     """Subcommands to launch or query distributed workers with GPUs."""
---> 61 @cuda.command(name="worker", context_settings=dict(ignore_unknown_options=True))
     62 @scheduler
     63 @preload_argv
     64 @click.option(
     65     "--host",
     66     type=str,
     67     default=None,
     68     help="""IP address of serving host; should be visible to the scheduler and other
     69     workers. Can be a string (like ``"127.0.0.1"``) or ``None`` to fall back on the
     70     address of the interface specified by ``--interface`` or the default interface.""",
     71 )
   (...)
    322 def worker(
   (...)
    357     **kwargs,
    358 ):
    359     """Launch a distributed worker with GPUs attached to an existing scheduler.
    360 
    361     A scheduler can be specified either through a URI passed through the ``SCHEDULER``
   (...)
    366     for info.
    367     """
    368     if multiprocessing_method == "forkserver":

AttributeError: 'function' object has no attribute 'command'

I can also reproduce this by importing the submodule.

>>> import dask_cuda.cli
AttributeError: 'function' object has no attribute 'command'

@jacobtomlinson
Copy link
Collaborator

Looks like Databricks gives us click==8.0.4 which causes this bug. It's fixed in >=8.1 so I've bumped the minimum version here and also made PRs upstream to do the same.

@jacobtomlinson
Copy link
Collaborator

jacobtomlinson commented Nov 9, 2023

Ok I got things working! I pushed a couple of extra commits to this PR but ultimately reverted one.

The main change I've made is to bump the minimum version of click to >=8.1.

I used g4dn.xlarge instances for the driver and workers with proton disabled and the custom container image databricksruntime/gpu-tensorflow:cuda11.8.

Then I used this init script which does make a couple of small tweaks due to using the databricksruntime/gpu-tensorflow:cuda11.8 container and also install cudf and dask-cudf.

#!/bin/bash
set -e

# The Databricks Python directory isn't on the path in 
# databricksruntime/gpu-tensorflow:cuda11.8 for some reason
export PATH="/databricks/python/bin:$PATH"

# Install git just so that we can install dask-databricks from source
# as it's not included in databricksruntime/gpu-tensorflow:cuda11.8.
# We can remove this when installing dask-databricks from PyPI.
apt-get update && apt-get install git -y

# Install RAPIDS (cudf & dask-cudf) and dask-databricks
/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
      bokeh==3.2.2 \
      cudf-cu11 \
      dask[complete] \
      dask-cudf-cu11 \
      dask-cuda \
      git+https://github.com/skirui-source/dask-databricks.git@main 

# Start the Dask cluster with CUDA workers
dask databricks run --worker-command "dask cuda worker"
Screenshot 2023-11-09 at 14 20 12 Screenshot 2023-11-09 at 15 28 22

@jacobtomlinson
Copy link
Collaborator

I cleaned things up a little further including adding a --cuda flag for convenience.

dask databricks run --cuda
# is equivalent to
dask databricks run --worker-command "dask cuda worker"

I also noticed that not pinning dask-cuda occasionally results in getting a super old version of dask-cuda (0.18.0).

So things work nicely now with this init script.

#!/bin/bash
set -e

# The Databricks Python directory isn't on the path in 
# databricksruntime/gpu-tensorflow:cuda11.8 for some reason
export PATH="/databricks/python/bin:$PATH"

# Install git just so that we can install dask-databricks from source
# as it's not included in databricksruntime/gpu-tensorflow:cuda11.8.
# We can remove this when installing dask-databricks from PyPI.
apt-get update && apt-get install git -y

# Install RAPIDS (cudf & dask-cudf) and dask-databricks
/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
      bokeh==3.2.2 \
      cudf-cu11 \
      dask[complete] \
      dask-cudf-cu11 \
      dask-cuda==23.10.0 \
      git+https://github.com/skirui-source/dask-databricks.git@main 

# Start the Dask cluster with CUDA workers
dask databricks run --cuda

@jacobtomlinson jacobtomlinson merged commit 2e42701 into dask-contrib:main Nov 9, 2023
4 checks passed
@jacobtomlinson jacobtomlinson mentioned this pull request Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for alternative worker commands and config options
2 participants