Has anybody trained a FAST detection model with pytorch on Windows? #1712

chmaz · 2024-08-31T08:42:07Z

chmaz
Aug 31, 2024

Hello,

During my first try with a very small training dataset on Windows 11 to see if I got the training format right, I get the following error stack trace. I tried to find if a workaround had been provided in the repo or internet but did not find anything.

Thanks in advance for any insight you may have,

Best,

Chris

Trace:

(base) C:\deepLearning\doctr\doctr-main>python references/detection/train_pytorch.py E:/training E:/validation fast_base --name fast_base1 --device 0 --epochs 20 --batch_size 8 --lr 0.001 --amp --early-stop --early-stop-epochs 5 --early-stop-delta 0.01
Namespace(train_path='E:/training', val_path='E:/validation', arch='fast_base', name='fast_base1', epochs=20, batch_size=8, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=True, find_lr=False, early_stop=True, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.001002s (1 samples in 1 batches)
Train set loaded in 0.008007s (15 samples in 1 batches)
Traceback (most recent call last): | 0/1 [00:00<?, ?it/s]
File "C:\deepLearning\doctr\doctr-main\references\detection\train_pytorch.py", line 481, in
main(args)
File "C:\deepLearning\doctr\doctr-main\references\detection\train_pytorch.py", line 388, in main
fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
File "C:\deepLearning\doctr\doctr-main\references\detection\train_pytorch.py", line 108, in fit_one_epoch
pbar = tqdm(train_loader, position=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\chris\anaconda3\Lib\site-packages\tqdm\asyncio.py", line 33, in init
self.iterable_iterator = iter(iterable)
^^^^^^^^^^^^^^
File "C:\Users\chris\anaconda3\Lib\site-packages\torch\utils\data\dataloader.py", line 439, in iter
return self._get_iterator()
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\chris\anaconda3\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\chris\anaconda3\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in init
w.start()
File "C:\Users\chris\anaconda3\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "C:\Users\chris\anaconda3\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\chris\anaconda3\Lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "C:\Users\chris\anaconda3\Lib\multiprocessing\popen_spawn_win32.py", line 95, in init
reduction.dump(process_obj, to_child)
File "C:\Users\chris\anaconda3\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main..'
0%| | 0/1 [00:00<?, ?it/s]

(base) C:\deepLearning\doctr\doctr-main>Traceback (most recent call last):
File "", line 1, in
File "C:\Users\chris\anaconda3\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\chris\anaconda3\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input

felixdittrich92 · 2024-09-02T06:06:46Z

felixdittrich92
Sep 2, 2024
Maintainer

Hi @chmaz 👋,

Looks like a problem with tqdm on your machine.
Try to modify the import from from tqdm.auto import tqdm to from tqdm import tqdm
Or remove the position arg here:

doctr/references/detection/train_pytorch.py

Line 108 in 9045dcf

pbar = tqdm(train_loader, position=1)

Best reagrds,
Felix

2 replies

chmaz Sep 2, 2024
Author

Hi @felixdittrich92,

Thanks!
Unfortunately, I tried the two options and still got the same error. I reinstalled completely pytorch (with anaconda) and doctr (including tqdm) and the behavior does not change. Which version of tqdm are you using? may be I could force it. Mine is:

(base) C:\deepLearning\doctr\doctr-main>pip show tqdm
Name: tqdm
Version: 4.66.5
Summary: Fast, Extensible Progress Meter
Home-page:
Author:
Author-email:
License: MPL-2.0 AND MIT
Location: C:\Users\chris\anaconda3\Lib\site-packages
Requires: colorama
Required-by: anaconda-client, anaconda-project, conda, conda-build, huggingface-hub, nltk, panel, peft, python-doctr, transformers

Thanks a lot,

Chris

felixT2K Sep 2, 2024

Have you tried to install tqdm with conda instead of pip ?
So uninstall the pip installed one and install with conda install tqdm

I don't think that's an version issue ..the issue seems to come from python's multiprocessing on your machine and is raised in tqdm

Last option would be to replace the tqdm stuff from the script (it's only for the progress bar)

chmaz · 2024-09-03T15:11:47Z

chmaz
Sep 3, 2024
Author

@felixT2K
Thanks a lot! I have not found yet how to install tqdm with conda. However, I also did a quick experiment replacing tqdm with simple prints and I got the same error provoked by the Pytorch dataloader and multiprocessing later in the training process. If I find a solution, i will share it but will for the moment try with a Linux machine.

1 reply

felixdittrich92 Sep 3, 2024
Maintainer

You could pass --workers 1 but this will slow down the training a lot :)

chmaz · 2024-09-04T12:18:20Z

chmaz
Sep 4, 2024
Author

@felixdittrich92 Unfortunately, I tried but get the error even with the switch --workers 1 :( on windows . As it works fine on Linux, I will switch to this environment for my experiments with text detection training and hope inference will work smoothy on windows (or a fix has been found in-between).

I have two follow up questions:
(1) In terms of size of training data set to train from scratch a good text detector with fast_base, would you have a ball-park recommendation for how many representative A4 text pages should I prepare?
(2) When training the text detector, what percentages for the validation precision and recall metrics are considered good? what kind of percentages did you get for your trainings?

Really thanks for the great feedback,

Best,

Chris

1 reply

felixdittrich92 Sep 6, 2024
Maintainer

Hi @chmaz 👋,

It really depends on how complex the documents are and how different they are. For simple documents that are all relatively the same, you can train a solid model from ~400 samples. The pre-trained models were trained on ~250K real samples, which contains a wide range of different documents.
If i remember correctly from the last runs it was something ~ 80% recall 82% precision 70% mIoU (mIoU is really hard to reach so focus should be on recall and precision)

Best,
Felix 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has anybody trained a FAST detection model with pytorch on Windows? #1712

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Has anybody trained a FAST detection model with pytorch on Windows? #1712

chmaz Aug 31, 2024

Replies: 3 comments · 4 replies

felixdittrich92 Sep 2, 2024 Maintainer

chmaz Sep 2, 2024 Author

felixT2K Sep 2, 2024

chmaz Sep 3, 2024 Author

felixdittrich92 Sep 3, 2024 Maintainer

chmaz Sep 4, 2024 Author

felixdittrich92 Sep 6, 2024 Maintainer

chmaz
Aug 31, 2024

Replies: 3 comments 4 replies

felixdittrich92
Sep 2, 2024
Maintainer

chmaz Sep 2, 2024
Author

chmaz
Sep 3, 2024
Author

felixdittrich92 Sep 3, 2024
Maintainer

chmaz
Sep 4, 2024
Author

felixdittrich92 Sep 6, 2024
Maintainer