Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

having problems with OOM #25

Open
mustardlove opened this issue Jul 11, 2019 · 5 comments
Open

having problems with OOM #25

mustardlove opened this issue Jul 11, 2019 · 5 comments

Comments

@mustardlove
Copy link

Hello, Mr.Volk
Thank you very much for your nice codes!
I have one question for you

I'm new to deep learning, have only basic understanding about keras codes, and currently trying to run your DSOD_train.py.
Problem is, I keep getting OOM errors while executing the "Train" section of the code (error message below)

I tried to use only one GPU out of two I have, and to use 'allow_growth' option in tensorflow, and neither worked
I believe I need to reduce the size of minibatch(guess your code using batch size 128, am I right?), but I have no idea where to find the code to make this change. (just changing batch_size = 26 to some number lower didn't solve the problem, so I searched your .py files, ended up with no clue)
I'd really appreciate your help on my problem

By the way, I'm using Ubuntu 16.04 and latest tensorflow-keras

------------------------------------------------error message

ResourceExhaustedError Traceback (most recent call last)
in
49 workers=1,
50 #use_multiprocessing=False,
---> 51 initial_epoch=initial_epoch)

/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your ' + object_name + ' call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1416 use_multiprocessing=use_multiprocessing,
1417 shuffle=shuffle,
-> 1418 initial_epoch=initial_epoch)
1419
1420 @interfaces.legacy_generator_methods_support

/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
215 outs = model.train_on_batch(x, y,
216 sample_weight=sample_weight,
--> 217 class_weight=class_weight)
218
219 outs = to_list(outs)

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
1215 ins = x + y + sample_weights
1216 self._make_train_function()
-> 1217 outputs = self.train_function(ins)
1218 return unpack_singleton(outputs)
1219

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in call(self, inputs)
2713 return self._legacy_call(inputs)
2714
-> 2715 return self._call(inputs)
2716 else:
2717 if py_any(is_tensor(x) for x in inputs):

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
2673 fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
2674 else:
-> 2675 fetched = self._callable_fn(*array_vals)
2676 return fetched[:len(self.outputs)]
2677

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in call(self, *args, **kwargs)
1456 ret = tf_session.TF_SessionRunCallable(self._session._session,
1457 self._handle, args,
-> 1458 run_metadata_ptr)
1459 if run_metadata:
1460 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[6,1376,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node batch_normalization_302/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[loss_5/mul/_21899]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[6,1376,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node batch_normalization_302/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

@mvoelk
Copy link
Owner

mvoelk commented Jul 11, 2019

In DSOD_train.ipynb, batch size is actually 6 and the gradients get accumulated with AdamAccumulate for 128//6 batches before a gradient update is performed. This results in a virtual batch size of 126, but the log is updateted after each batch.

Setting the batch size to 4 or even 2 should solve the issue. How large is your GPU memory?

@mustardlove
Copy link
Author

Thank you so much for your kind help!
I changed the 512's batch size to 4 and the train code is running!

I'm using two Titan Xp GPU and the memory spec is as follows:
11.4 GbpsMemory Speed
12 GB GDDR5XStandard Memory Config
384-bitMemory Interface Width
547.7 GB/sMemory Bandwidth (GB/sec)

currently the execution is using only 1 GPU..don't know why

I have one more question!

In your data_coco.py, there is convert_to_voc function.
I'm only using COCO dataset, so in DSOD_trian, I commented out codes related to VOC dataset and did
gt_util_train = gt_util_coco.convert_to_voc()
gt_util_val = gt_util_coco_val.convert_to_voc()

Does this code make DSOD_Train to train on only 21 categories? I figured you only have 21 initial weights.

@mvoelk
Copy link
Owner

mvoelk commented Jul 12, 2019

I've always used 1 GPU for training a model, but it should work with multiple GPUs as well. The documentation of Model.fit_generator() explains how to do this.

convert_to_voc in the COCO case returns a new GTUtility with COCO data, but with the 20 (21 including background) VOC classes leading to a model with 21 categories.

The weights you mentioned are not trainable parameters... See #14 for more details.

@mustardlove
Copy link
Author

mustardlove commented Jul 15, 2019

Thank you for the reply!

I played some parameters in fit_generator() (use_multiprocessing=True, workers=2) but still only one gpu was on.

I also tried using multi_gpu_model from keras.utils, but failed with _TfDeviceCaptureOp does not have method _set_device_from_string.
I found that the class _TfDeviceCaptureOp in tensorflow/python/keras/backend.py does have _set_device_from_string, but in keras/backend/tensorflow_backend.py does not have that method..

If anyone solved this issue, please share your knowledge
Thank you!

@mvoelk
Copy link
Owner

mvoelk commented Sep 5, 2019

Search for keras.utils.multi_gpu_model, use_multiprocessing=True, workers=2 refers to data loading.

@mvoelk mvoelk mentioned this issue Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants