To run algorithms such as SP-NAS, you need to install the open-source software mmdetection. For details, see the installation guide of the software.
Before running the benchmark, install the open-source software NASBench. For details, see the installation guide of the software.
The possible causes are as follows:
- The network is not registered with the Vega. Before invoking the network, you need to use
@ClassFactory.register
to register the network. For details, see https://github.com/huawei-noah/vega/tree/master/examples/fully_train/fmd. - The model description file of the network is incorrect. You can locate the fault based on
<model desc>
in the exception information.
1.5 Exception ImportError: libgthread-2.0.so.0: cannot open shared object file: No such file or directory
The opencv-python system dependency library is missing. Run the following command:
sudo apt install libglib2.0-0
1.6 Exception ModuleNotFoundError: No module named'skbuild '
or stuck in Running setup.py bdist_wheel for opencv-python-headless...
during installation
The possible cause is that the PIP version is too early. Run the following command:
pip3 install --user --upgrade pip
1.7 Exception PermissionError: [Errno 13] Permission denied: 'dask-scheduler'
, FileNotFoundError: [Errno 2] No such file or directory: 'dask-scheduler': 'dask-scheduler'
, or vega: command not found
This type of exception is usually caused by the failure to find dask-scheduler
in PATH
. Generally, the file is installed in /<user home path>/.local/bin
.
After the Vega is installed , /<user home path>/.local/bin/
is automatically added to the PATH
environment variable. The setting does not take effect immediately. You can run the ls command source ~/.profile
or log in again to make the setting take effect.
If the problem persists, check whether the dask-scheduler file exists in the /<user home path>/.local/bin
directory.
If the file already exists, manually add /<user home path>/.local/bin
to the environment variable PATH
.
1.8 Exception During Pytorch model evaluation: FileNotFoundError: [Errno 2] No such file or directory: '<path>/torch2caffe.prototxt'
For details, see section 6.1 in Evaluate Service.
If multiple GPUs or NPUs are deployed on the host running Vega, you can set the following configuration items to support multiple GPUs or NPUs:
general:
parallel_search: True
parallel_fully_train: True
devices_per_trainer: 1
Where:
- parallel_search:Controls whether multiple models are searched in parallel during the model search phase, each of which uses one or more GPUs/NPUs.
- parallel_fully_train: Controls whether to train multiple models concurrently in the Fully Train phase. Each model uses one or more GPUs or NPUs.
- devices_per_trainer: If any of the preceding parameters is set to True, this parameter specifies the number of GPUs/NPUs corresponding to a model.
Note: The CARS and DARTS algorithms do not support parallel search.
If there are multiple GPUs in the running environment, run the following command to control the GPUs used by Vega:
Using a single GPU:
CUDA_VISIBLE_DEVICES=1 python3 -m vega.pipeline ./nas/backbone_nas/backbone_nas.yml
Using multiple GPUs:
CUDA_VISIBLE_DEVICES=2,3 python3 -m vega.pipeline ./nas/backbone_nas/backbone_nas.yml
You can load the pre-training model by modifying the configuration item. For example, load the pre-training model simple_cnn.pth.
model:
model_desc:
modules: [backbone]
backbone:
type: SimpleCnn
num_class: 10
fp16: False
pretrained_model_file: "./simple_cnn.pth"
By default, Vega logs are stored in the following path:
./tasks/<task id>/logs
To configure the log level, modify the following configuration items:
general:
logger:
level: info # debug|info|warn|error|
Vega provides the visualized progress of the model search process. User could set VisualCallBack
within USER.yml
as follow,
trainer:
type: Trainer
callbacks: [VisualCallBack, ]
The output directory of the visualized information is as follows:
./tasks/<task id>/visual
Run the tensorboard --logdir PATH
command on the active node to start the service and view the progress in the browser. For details, see TensorBoard commands and instructions.
If only the main Vega process is killed, some processes will not be stopped in time, and the resources occupied by the processes will not be released.
The Vega application can be terminated using the following command:
# Query the process ID of the running Vega main program.
vega-process
# Terminate a Vega main program and related processes.
vega-kill -p <pid>
# Terminate a Vega main program and related processes.
vega-kill -t <task id>
# Or stop all Vega-related processes at a time.
vega-kill -a
# If the main program is shut down normally and there are remaining related processes, you can forcibly clear the process.
vega-kill -f
In the multi-GPU/NPU scenario, Vega starts the dask scheduler, dask worker, and trainer. If only the main Vega process is killed, some processes are not stopped in time and the resources occupied by these processes are not released.
Run the following command to stop the Vega application:
# Query the process ID of the running Vega main program.
vega-kill -l
# Stop a Vega main program and related processes.
vega-kill -p <pid>
# Or stop all Vega processes at a time.
vega-kill -a
# If the main program is closed normally and there are still residual processes, you can forcibly clear the process.
vega-kill -f