diff --git a/GETTING_STARTED.md b/GETTING_STARTED.md index 0894d2d4..eeb3563d 100644 --- a/GETTING_STARTED.md +++ b/GETTING_STARTED.md @@ -48,27 +48,32 @@ to understand their behavior. Some common arguments are: ``` -* To train a model on 8 NPUs/GPUs: - ``` - mpirun --allow-run-as-root -n 8 python train.py --config ./configs/yolov7/yolov7.yaml --is_parallel True - ``` - * To train a model on 1 NPU/GPU/CPU: ``` python train.py --config ./configs/yolov7/yolov7.yaml ``` - +* To train a model on 8 NPUs/GPUs: + ``` + msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml --is_parallel True + ``` * To evaluate a model's performance on 1 NPU/GPU/CPU: ``` python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt ``` * To evaluate a model's performance 8 NPUs/GPUs: ``` - mpirun --allow-run-as-root -n 8 python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt --is_parallel True + msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt --is_parallel True ``` *Notes: (1) The default hyper-parameter is used for 8-card training, and some parameters need to be adjusted in the case of a single card. (2) The default device is Ascend, and you can modify it by specifying 'device_target' as Ascend/GPU/CPU, as these are currently supported.* * For more options, see `train/test.py -h`. +* Notice that if you are using `msrun` startup with 2 devices, please add `--bind_core=True` to improve performance. For example: +``` + msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \ + --log_dir=msrun_log --join=True --cluster_time_out=300 \ + python train.py --config ./configs/yolov7/yolov7.yaml --is_parallel True +``` +> For more information, please refer to [here](https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/startup_method.html). ### Deployment diff --git a/GETTING_STARTED_CN.md b/GETTING_STARTED_CN.md index 5daeb397..35405a36 100644 --- a/GETTING_STARTED_CN.md +++ b/GETTING_STARTED_CN.md @@ -45,18 +45,15 @@ python demo/predict.py --config ./configs/yolov7/yolov7.yaml --weight=/path_to_c ``` -* 在多卡NPU/GPU上进行分布式模型训练,以8卡为例: - - ```shell - mpirun --allow-run-as-root -n 8 python train.py --config ./configs/yolov7/yolov7.yaml --is_parallel True - ``` - * 在单卡NPU/GPU/CPU上训练模型: ```shell python train.py --config ./configs/yolov7/yolov7.yaml ``` - +* 在多卡NPU/GPU上进行分布式模型训练,以8卡为例: + ```shell + msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml --is_parallel True + ``` * 在单卡NPU/GPU/CPU上评估模型的精度: ```shell @@ -65,12 +62,20 @@ python demo/predict.py --config ./configs/yolov7/yolov7.yaml --weight=/path_to_c * 在多卡NPU/GPU上进行分布式评估模型的精度: ```shell - mpirun --allow-run-as-root -n 8 python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt --is_parallel True + msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt --is_parallel True ``` *注意:默认超参为8卡训练,单卡情况需调整部分参数。 默认设备为Ascend,您可以指定'device_target'的值为Ascend/GPU/CPU。* * 有关更多选项,请参阅 `train/test.py -h`. -* 在云脑上进行训练,请在[这里](./tutorials/cloud/modelarts_CN.md)查看 +* 在云脑上进行训练,请在[这里](./tutorials/cloud/modelarts_CN.md)查看。 + +*注意:如果您在 2 个设备上使用`msrun`指令启动,请添加`--bind_core=True`以提高性能。例如: +``` + msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \ + --log_dir=msrun_log --join=True --cluster_time_out=300 \ + python train.py --config ./configs/yolov7/yolov7.yaml --is_parallel True +``` +> 有关更多选项, 请参阅[这里](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/startup_method.html)。 ### 部署 diff --git a/configs/yolov3/README.md b/configs/yolov3/README.md index 58fcc1c1..ff4959bf 100644 --- a/configs/yolov3/README.md +++ b/configs/yolov3/README.md @@ -56,11 +56,11 @@ python mindyolo/utils/convert_weight_darknet53.py It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config ./configs/yolov3/yolov3.yaml --device_target Ascend --is_parallel True +msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov3_log python train.py --config ./configs/yolov3/yolov3.yaml --device_target Ascend --is_parallel True ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above mpirun command. +Similarly, you can train the model on multiple GPU devices with the above msrun command. +**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html) For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py). diff --git a/configs/yolov4/README.md b/configs/yolov4/README.md index eab33a86..d9255029 100644 --- a/configs/yolov4/README.md +++ b/configs/yolov4/README.md @@ -70,11 +70,11 @@ python mindyolo/utils/convert_weight_cspdarknet53.py It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config ./configs/yolov4/yolov4-silu.yaml --device_target Ascend --is_parallel True --epochs 320 +msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov4_log python train.py --config ./configs/yolov4/yolov4-silu.yaml --device_target Ascend --is_parallel True --epochs 320 ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above mpirun command. +Similarly, you can train the model on multiple GPU devices with the above msrun command. +**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html) For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py). diff --git a/configs/yolov5/README.md b/configs/yolov5/README.md index 1a003cd9..a80b2589 100644 --- a/configs/yolov5/README.md +++ b/configs/yolov5/README.md @@ -50,11 +50,11 @@ Please refer to the [GETTING_STARTED](https://github.com/mindspore-lab/mindyolo/ It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config ./configs/yolov5/yolov5n.yaml --device_target Ascend --is_parallel True +msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov5_log python train.py --config ./configs/yolov5/yolov5n.yaml --device_target Ascend --is_parallel True ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above mpirun command. +Similarly, you can train the model on multiple GPU devices with the above msrun command. +**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html) For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py). diff --git a/configs/yolov7/README.md b/configs/yolov7/README.md index d8ff6862..ba7a91df 100644 --- a/configs/yolov7/README.md +++ b/configs/yolov7/README.md @@ -51,11 +51,11 @@ Please refer to the [GETTING_STARTED](https://github.com/mindspore-lab/mindyolo/ It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config ./configs/yolov7/yolov7.yaml --device_target Ascend --is_parallel True +msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml --device_target Ascend --is_parallel True ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above mpirun command. +Similarly, you can train the model on multiple GPU devices with the above msrun command. +**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html) For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py). diff --git a/configs/yolov8/README.md b/configs/yolov8/README.md index 95a6d3ce..1ea0386e 100644 --- a/configs/yolov8/README.md +++ b/configs/yolov8/README.md @@ -60,11 +60,11 @@ Please refer to the [GETTING_STARTED](https://github.com/mindspore-lab/mindyolo/ It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config ./configs/yolov8/yolov8n.yaml --device_target Ascend --is_parallel True +msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov8_log python train.py --config ./configs/yolov8/yolov8n.yaml --device_target Ascend --is_parallel True ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above mpirun command. +Similarly, you can train the model on multiple GPU devices with the above msrun command. +**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html) For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py). diff --git a/configs/yolox/README.md b/configs/yolox/README.md index 5812198c..18d8a8c2 100644 --- a/configs/yolox/README.md +++ b/configs/yolox/README.md @@ -50,13 +50,11 @@ Please refer to the [GETTING_STARTED](https://github.com/mindspore-lab/mindyolo/ It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config ./configs/yolox/yolox-s.yaml --device_target Ascend --is_parallel True +msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolox_log python train.py --config ./configs/yolox/yolox-s.yaml --device_target Ascend --is_parallel True ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. - - -Similarly, you can train the model on multiple GPU devices with the above mpirun command. +Similarly, you can train the model on multiple GPU devices with the above msrun command. +**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html) For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py). diff --git a/examples/finetune_SHWD/README.md b/examples/finetune_SHWD/README.md index 89ecdec5..bfaaf78c 100644 --- a/examples/finetune_SHWD/README.md +++ b/examples/finetune_SHWD/README.md @@ -114,7 +114,7 @@ optimizer: * 在多卡NPU/GPU上进行分布式模型训练,以8卡为例: ```shell - mpirun --allow-run-as-root -n 8 python train.py --config ./examples/finetune_SHWD/yolov7-tiny_shwd.yaml --is_parallel True + msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7-tiny_log python train.py --config ./examples/finetune_SHWD/yolov7-tiny_shwd.yaml --is_parallel True ``` * 在单卡NPU/GPU/CPU上训练模型: diff --git a/tutorials/configuration_CN.md b/tutorials/configuration_CN.md index cd7df649..18aa4150 100644 --- a/tutorials/configuration_CN.md +++ b/tutorials/configuration_CN.md @@ -38,7 +38,7 @@ __BASE__: [ 该部分参数通常由命令行传入,示例如下: ```shell - mpirun --allow-run-as-root -n 8 python train.py --config ./configs/yolov7/yolov7.yaml --is_parallel True --log_interval 50 + msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml --is_parallel True --log_interval 50 ``` ## 数据集