Skip to content

Commit

Permalink
release: v0.2.0 (#206)
Browse files Browse the repository at this point in the history
* fix manifest.in

* remove tools dir

* __version__.py: 0.2.0

* update readme

* fix docs

* fix best-practice.md

* readme: nccl path

* readme: improve

* readme: improve

* add changelog.rst

* fix changelog

* readme: add pypi badge and news

* improve readme and changelog
  • Loading branch information
ymjiang authored Feb 19, 2020
1 parent af6fd58 commit 536dd85
Show file tree
Hide file tree
Showing 11 changed files with 81 additions and 207 deletions.
18 changes: 18 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Changelog for BytePS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

0.2.0 (2020-02)
------------------
* Largely improve RDMA performance by enforcing page aligned memory.
* Add IPC support for RDMA. Now support colocating servers and workers without sacrificing much performance.
* Fix a hanging bug in BytePS server.
* Fix RDMA-related segmentation fault problem during fork() (e.g., used by PyTorch data loader).
* New feature: Enable mixing use of colocate and non-colocate servers, along with a smart tensor allocation strategy.
* New feature: Add ``bpslaunch`` as the command to launch tasks.
* Add support for pip install: ``pip3 install byteps``


0.1.0 (2019-12)
------------------
* First official release.
3 changes: 2 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
include */*
include */* LICENSE byteps.lds byteps.exp
exclude .git/*
recursive-include * *.cc *.h
graft 3rdparty/ps-lite
28 changes: 21 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,18 @@

[![Build Status](https://travis-ci.org/bytedance/byteps.svg?branch=master)](https://travis-ci.org/bytedance/byteps)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
![Pypi](https://img.shields.io/pypi/v/byteps.svg)

BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on either TCP or RDMA network.

BytePS outperforms existing open-sourced distributed training frameworks by a large margin. For example, on BERT-large training, BytePS can achieve ~90% scaling efficiency with 256 GPUs (see below), which is much higher than [Horovod](https://github.com/horovod/horovod)+[NCCL](https://github.com/NVIDIA/nccl). In certain scenarios, BytePS can double the training speed compared with Horovod+NCCL.

## News

- [BytePS-0.2.0](CHANGELOG.rst) has been released.
- Now pip install is available, refer to the [install tutorial](https://github.com/bytedance/byteps#quick-start).
- [Largely improve RDMA performance](https://github.com/bytedance/byteps/pull/184). Now support colocating servers and workers with high performance.
- Fix [RDMA fork problem](https://github.com/bytedance/byteps/pull/192) caused by multi-processing.
- [New Server](https://github.com/bytedance/byteps/pull/151): We improve the server performance by a large margin, and it is now independent of MXNet KVStore. Try our [new docker images](docker/).
- Use [the ssh launcher](launcher/) to launch your distributed jobs
- [Improved key distribution strategy for better load-balancing](https://github.com/bytedance/byteps/pull/116)
Expand Down Expand Up @@ -41,21 +46,30 @@ BytePS also incorporates many acceleration techniques such as hierarchical strat

We provide a [step-by-step tutorial](docs/step-by-step-tutorial.md) for you to run benchmark training tasks. The simplest way to start is to use our [docker images](docker). Refer to [Documentations](docs) for how to [launch distributed jobs](docs/running.md) and more [detailed configurations](docs/env.md). After you can start BytePS, read [best practice](docs/best-practice.md) to get the best performance.

Below, we explain how to build and run BytePS by yourself. BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet. BytePS depends on CUDA and NCCL, and requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try `yum install devtoolset-7` before everything else.
Below, we explain how to install BytePS by yourself. There are two options.

### Install by pip

```
pip3 install byteps
```

### Build from source code

If the above does not contain your desired wheel resource, or you want to try building from source code:
You can try out the latest features by directly installing from master branch:

```
git clone --recurse-submodules https://github.com/bytedance/byteps
git clone --recursive https://github.com/bytedance/byteps
cd byteps
python setup.py install
python3 setup.py install
```

Notes:
- For best compatibility, please pin your gcc to 4.9 before building, [here](https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.pytorch#L72-L80) is an example.
- You may set `BYTEPS_USE_RDMA=1` to install with RDMA support. Before this, make sure your RDMA drivers have been properly installed and tested.
Notes for above two options:
- BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet.
- BytePS depends on CUDA and NCCL. You should specify the NCCL path with `export BYTEPS_NCCL_HOME=/path/to/nccl`. By default it points to `/usr/local/nccl`.
- The installation requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try `yum install devtoolset-7` before everything else. In general, we recommend using gcc 4.9 for best compatibility ([an example](https://github.com/bytedance/byteps/blob/3fba75def0d81c1d3225f8f397cc985200f57de7/docker/Dockerfile.mxnet#L72-L80) to pin gcc).
- RDMA support: During setup, the script will automatically detect the RDMA header file. If you want to use RDMA, make sure your RDMA environment has been properly installed and tested before install ([an example](https://github.com/bytedance/byteps/blob/3fba75def0d81c1d3225f8f397cc985200f57de7/docker/Dockerfile.mxnet#L29-L33) for Ubuntu-18.04).


## Use BytePS in Your Code

Expand Down
2 changes: 1 addition & 1 deletion byteps/__version__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
VERSION = (0, 1, 0)
VERSION = (0, 2, 0)

__version__ = '.'.join(map(str, VERSION))
12 changes: 10 additions & 2 deletions docs/best-practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,23 @@ If you have NVLinks, leave `BYTEPS_PCIE_SWITCH_SIZE` unmodified. If you don't kn

## Multi-machine (distributed mode)

This mode requires at least **4** physical machines, otherwise you won't see any benefits of BytePS. Two of the machines should have GPUs and run as workers. The other two run as servers and do not need GPUs. The scheduler can run on any machine.
### With additional CPU servers

This mode requires at least **4** physical machines. Two of the machines should have GPUs and run as workers. The other two run as CPU servers and do not need GPUs. The scheduler can run on any machine.

The key here is to make sure the following:
* Servers must be on different physical machines from workers.
* The total bandwidth of the servers must be equal or larger than the total bandwidth of workers.

If you are using RDMA, this should be sufficient. However, with TCP and >=25Gbps networks, it's possible that BytePS cannot fully utilize the bandwidth because a single TCP connection usually cannot run up to 25Gbps.

To address this, you can try running more BytePS server instances on the server machines. For example, you can try running two server instances per server machines. This effectively doubles the number of TCP connections and should be sufficient for 25Gbps networks. For 40Gbps/50Gbps networks, you need three server instances per server machine, and so on. When doing this, you probably need to set `MXNET_OMP_MAX_THREADS` as: your CPU cores number divided by number of server instances per machine. For example, one machine has 32 cores and you put 4 server instances on it, then you need to `export MXNET_OMP_MAX_THREADS=8`. The idea is to reduce the CPU contention of different server instances.
To address this, you can try running more BytePS server instances on the server machines. For example, you can try running two server instances per server machines. This effectively doubles the number of TCP connections and should be sufficient for 25Gbps networks. For 40Gbps/50Gbps networks, you need three server instances per server machine, and so on.

### No additional CPU servers

When you don't have additional CPU servers, then for each physical machine, you should launch a worker and a server process. We call this *co-locate* mode, and the resource consumption is the same with Horovod (no additional servers).

If you are using TCP, you will probably get near-identical performance with Horovod-TCP. However, if you are using RDMA, you can set `BYTEPS_ENABLE_IPC=1` to enable the IPC communication between the co-located worker and server. And eventually you will get higher end-to-end performance than Horovod.

## The expected performance

Expand Down
8 changes: 4 additions & 4 deletions docs/env.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,16 +107,16 @@ export BYTEPS_NCCL_GROUP_SIZE=w
```

Servers can also be the performance bottleneck, e.g., when there are only one server but multiple workers.
You can try to increase the number of push threads on the servers (default is 1):
You can try to increase the number of processing threads on the servers (default is 4):

```
export SERVER_PUSH_NTHREADS=v
export BYTEPS_SERVER_ENGINE_THREAD=v
```

Increasing the number of engine CPU threads may also improves server performance:
Or enable scheduling at the server side to prioritize tensors with higher priority:

```
export MXNET_CPU_WORKER_NTHREADS=p
export BYTEPS_SERVER_ENABLE_SCHEDULE=1
```

## Asynchronous training
Expand Down
8 changes: 4 additions & 4 deletions docs/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ On worker 0, run:
```
DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
DMLC_WORKER_ID=0 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
python launcher/launcher.py YOUR_COMMAND
bpslaunch YOUR_COMMAND
```

On worker 1, run (only DMLC_WORKER_ID is different from above):

```
DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
DMLC_WORKER_ID=1 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
python launcher/launcher.py YOUR_COMMAND
bpslaunch YOUR_COMMAND
```

**For servers and schedulers, we highly recommend you use the docker image we build:**
Expand All @@ -32,14 +32,14 @@ Start server and scheduler docker instances with this image. In the server, run

```
DMLC_ROLE=server DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 bpslaunch
```

On the scheduler, run (we also remove DMLC_WORKER_ID, and set role to scheduler):

```
DMLC_ROLE=scheduler DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 bpslaunch
```

In this example, your scheduler must be able to bind to `10.0.0.1:9000`.
Expand Down
52 changes: 15 additions & 37 deletions docs/step-by-step-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,7 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.1
export DMLC_PS_ROOT_PORT=1234
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
--model ResNet50 --num-iters 1000000
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
```

### PyTorch
Expand All @@ -47,9 +45,7 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.1
export DMLC_PS_ROOT_PORT=1234
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
--model resnet50 --num-iters 1000000
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000
```

### MXNet
Expand All @@ -70,9 +66,7 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.1
export DMLC_PS_ROOT_PORT=1234
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
--benchmark 1 --batch-size=32
bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
```

## Distributed Training (TCP)
Expand All @@ -95,7 +89,7 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
python3 /usr/local/byteps/launcher/launch.py
bpslaunch
```

For the server:
Expand All @@ -111,7 +105,7 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
python3 /usr/local/byteps/launcher/launch.py
bpslaunch
```


Expand All @@ -129,9 +123,7 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
--model ResNet50 --num-iters 1000000
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
```

For worker-1:
Expand All @@ -149,26 +141,20 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
--model ResNet50 --num-iters 1000000
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
```


If your workers use PyTorch, you need to change the image name to `bytepsimage/pytorch`, and replace the python script of the workers with

```
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
--model resnet50 --num-iters 1000000
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000
```


If your workers use MXNet, you need to change the image name to `bytepsimage/mxnet`, and replace the python script of the workers with
```
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
--benchmark 1 --batch-size=32
bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
```

## Distributed Training with RDMA
Expand Down Expand Up @@ -198,7 +184,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
export DMLC_PS_ROOT_PORT=9000
# launch the job
python3 /usr/local/byteps/launcher/launch.py
bpslaunch
```

For the server:
Expand All @@ -222,7 +208,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
export DMLC_PS_ROOT_PORT=9000
# launch the job
python3 /usr/local/byteps/launcher/launch.py
bpslaunch
```

For worker-0:
Expand Down Expand Up @@ -250,9 +236,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
export DMLC_PS_ROOT_PORT=9000
# launch the job
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
--model ResNet50 --num-iters 1000000
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
```

For worker-1:
Expand Down Expand Up @@ -281,25 +265,19 @@ export DMLC_PS_ROOT_URI=10.0.0.100
export DMLC_PS_ROOT_PORT=9000
# launch the job
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
--model ResNet50 --num-iters 1000000
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
```



If your workers use PyTorch, you need to change the image name to `bytepsimage/pytorch`, and replace the python script of the workers with

```
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
--model resnet50 --num-iters 1000000
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000
```


If your workers use MXNet, you need to change the image name to `bytepsimage/mxnet`, and replace the python script of the workers with
```
python3 /usr/local/byteps/launcher/launch.py \
python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
--benchmark 1 --batch-size=32
bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
```
12 changes: 6 additions & 6 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ When launching distributed jobs, if you see hanging at the beginning, one possib
Install ps-lite:

```
git clone --branch byteps https://github.com/bytedance/ps-lite.git
git clone -b byteps https://github.com/bytedance/ps-lite.git
cd ps-lite
make -j
```
Expand All @@ -25,7 +25,7 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP]
export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT]
export DMLC_INTERFACE=eth0
./ps-lite/tests/test_kv_app_benchmark
./ps-lite/tests/test_benchmark
```

For the server
Expand All @@ -36,7 +36,7 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP]
export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT]
export DMLC_INTERFACE=eth0
./ps-lite/tests/test_kv_app_benchmark
./ps-lite/tests/test_benchmark
```

For the worker:
Expand All @@ -47,13 +47,13 @@ export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP]
export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT]
export DMLC_INTERFACE=eth0
./ps-lite/tests/test_kv_app_benchmark 1024000 100 0
./ps-lite/tests/test_benchmark 1024000 100 0
```

If it succeed, you should be able to see something like this on the worker.
```
tests/test_kv_app_benchmark.cc:77: push_byte=4096000, repeat=100, total_time=128.842ms
tests/test_kv_app_benchmark.cc:91: pull_byte=4096000, repeat=100, total_time=353.38ms
push_byte=4096000, repeat=100, total_time=128.842ms
pull_byte=4096000, repeat=100, total_time=353.38ms
```

(Note: for RDMA networks, use `make -j USE_RDMA=1` to build, and `export DMLC_ENABLE_RDMA=1` for running the scheduler / server / worker)
Expand Down
Loading

0 comments on commit 536dd85

Please sign in to comment.