Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using single gpu for training #21

Open
Adnan-Khan7 opened this issue Nov 9, 2022 · 5 comments
Open

Error using single gpu for training #21

Adnan-Khan7 opened this issue Nov 9, 2022 · 5 comments

Comments

@Adnan-Khan7
Copy link

Adnan-Khan7 commented Nov 9, 2022

Thanks for the work you have done.

I encounter the following error using the single GPU training,
ValueError:num_samples should be a positive integer value, but got num_samples=-67108864

Command I am using is; python train.py --rank 0 --gpu 0

Can you please assist?

Thanks

@LeeDoYup
Copy link
Owner

LeeDoYup commented Nov 9, 2022

Hello @Adnan-Khan7 , could you let me know the details errors such as traceback and the code line ?

@Adnan-Khan7
Copy link
Author

sure, please have a look at the traceback

Traceback (most recent call last):
File "train.py", line 319, in
main(args)
File "train.py", line 67, in main
main_worker(args.gpu, ngpus_per_node, args)
File "train.py", line 194, in main_worker
loader_dict['train_lb'] = get_data_loader(dset_dict['train_lb'],
File "/home/adnan.khan/FixMatch-pytorch/datasets/data_utils.py", line 120, in get_data_loader
data_sampler = data_sampler(dset, replacement, num_samples, generator)
File "/home/adnan.khan/.conda/envs/fixmatch/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 107, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=-67108864

@LeeDoYup
Copy link
Owner

LeeDoYup commented Nov 9, 2022

have you change some default arguments?
because there is no logic to make the num_samples be negative.

is the same with the command python train.py --world-size 1 --rank 0 ?

@Adnan-Khan7
Copy link
Author

I didn't change any other default arguments. Adding --world-size 1 now generates ZeroDivisionError, please see the below command that I am running

python train.py --world-size 1 --rank 0 --overwrite

train.py:40: UserWarning: You have chosen to seed training. This will turn on the CUDNN deterministic setting, which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
warnings.warn('You have chosen to seed training. '
Traceback (most recent call last):
File "train.py", line 319, in
main(args)
File "train.py", line 67, in main
main_worker(args.gpu, ngpus_per_node, args)
File "train.py", line 102, in main_worker
if args.rank % ngpus_per_node == 0:
ZeroDivisionError: integer division or modulo by zero

by adding --gpu 0
python train.py --world-size 1 --rank 0 --gpu 0 --overwrite
generates same error, but with different warning

train.py:40: UserWarning: You have chosen to seed training. This will turn on the CUDNN deterministic setting, which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
warnings.warn('You have chosen to seed training. '
train.py:47: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
warnings.warn('You have chosen a specific GPU. This will completely '
Traceback (most recent call last):
File "train.py", line 319, in
main(args)
File "train.py", line 67, in main
main_worker(args.gpu, ngpus_per_node, args)
File "train.py", line 102, in main_worker
if args.rank % ngpus_per_node == 0:
ZeroDivisionError: integer division or modulo by zero

@Adnan-Khan7
Copy link
Author

Dear Lee, any comments on the above-stated error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants