-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training usually doesn't start #44
Comments
hm, that's odd, can you remove tee and check the output? |
What do you mean by tee? |
If what you mean is to change this line in train_cifar.sh: |
hm, I'd assume that would be threads then, but these issues should have been fixed years ago. can you update threads and torchnet? |
I updated threads and torchnet, but I'm still getting the issue. |
@soumith maybe you've seen issues like that with latest lua torch? |
lua-torch hasn't updated it's packages since July 2017: https://github.com/torch/distro/commits/master I'm not sure what changed. |
I'm running this command:
model=wide-resnet widen_factor=4 depth=40 dropout=0.3 ./scripts/debug_cifar.sh
Most of the time (80%+), the program will reach the point where it prints this:
Network has 40 convolutions
Will save at logs/wide-resnet_1639021580
tput: No value for $TERM and no -T specified
...then it will do nothing. The other 20% of the time, it will begin training and printing out each epoch and its progress.
After a big of debugging, the stalling is occuring at engine:train in train.lua.
How can I fix this?
The text was updated successfully, but these errors were encountered: