Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change in ImageNet format after being hosted on kaggle. #3

Open
Spandan-Madan opened this issue Apr 18, 2021 · 1 comment
Open

change in ImageNet format after being hosted on kaggle. #3

Spandan-Madan opened this issue Apr 18, 2021 · 1 comment

Comments

@Spandan-Madan
Copy link

Spandan-Madan commented Apr 18, 2021

Hi,

I'm extremely excited about support for large scale/fast I/O in PyTorch. I am trying to run the example and downloaded ImageNet. As you might be aware, ImageNet is no longer available for download from http://www.image-net.org/download and is now hosted at Kaggle. I downloaded the dataset, but it seems there's a change in the format from the previous version and can no longer be loaded with PyTorch's inbuilt Dataset class. This leads to errors in creating shards.

Here's the error I get:-

The archive ILSVRC2012_devkit_t12.tar.gz is not present in the root directory or is corrupted. You need to download it externally and place it in ./data

The structure of the downloaded dataset contains:-

.
├── Annotations
│   └── CLS-LOC
│       ├── train
│       └── val
├── Data
│   └── CLS-LOC
│       ├── test
│       ├── train
│       └── val
└── ImageSets
    └── CLS-LOC
        ├── test.txt
        ├── train_cls.txt
        ├── train_loc.txt
        └── val.txt

Can we come up with a work-around which works out of the box with the current distribution of ImageNet? The original PyTorch ImageNet example works with it as we only need the image files. I think the error originates from the parsing of metadata while making shards, so a workaround should be possible I think. Happy to help with this.

Best,
Spandan

@Spandan-Madan Spandan-Madan changed the title ImageNet downloaded from kaggle does not contain the right format change in ImageNet format after being hosted on kaggle. Apr 18, 2021
@Spandan-Madan
Copy link
Author

I found the solution - we can fall back on the ImageFolder dataset class that comes inbuilt with PyTorch. The ImageNet class inherits from this anyway, and the problem can be easily solved with this fix.

Happy to create a Pull Request with this fix. Let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant