Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question / Feature Request] PyTorch dataset abstraction #1627

Open
HalkScout opened this issue Oct 3, 2024 · 0 comments
Open

[Question / Feature Request] PyTorch dataset abstraction #1627

HalkScout opened this issue Oct 3, 2024 · 0 comments
Assignees

Comments

@HalkScout
Copy link

All of the documentation I have sifted through has referenced essentially re-saving data when the format changes, but is there a way to use this library without that? A good use-case for this example is if you have a very large amount of data. You can read your data if it's a supported format, some conversion occurs, then you pass it into your pipeline. It would add to the cost of data-loading, but it can be worth it if it saves terabytes of disk space.

I am loading a dataset like this:

from datumaro.components.dataset import Dataset
dataset = Dataset.import_from("./data", "yolo")
print(dataset)
Dataset
	...
subsets
	test: # of items=...
	train: # of items=...
	val: # of items=...
infos
	...

Where I would want to implement a PyTorch Lightning data module like:

import lightning as L
from datumaro.components.dataset import Dataset

class MyDataModule(L.LightningDataModule):
    def __init__(self, batch_size: int = 32):
        super().__init__()
        self.batch_size = batch_size

    def setup(self, stage: str):
        # Being able to load only specific subsets would be nice here too, but that sounds like a large undertaking
        dataset = Dataset.import_from("./data", "yolo")
        if stage == "fit":
            dataset_train = dataset.get_subset("train")
            dataset_val = dataset.get_subset("val")
        if stage == "test":
            dataset_test = dataset.get_subset("test")

    def train_dataloader(self):
        return DataLoader(self.dataset_train, batch_size=self.batch_size)

    # and so on for "test" and "val"

Is this possible? Neither of these solutions I have tried work:

train = dataset.get_subset("train")
print(train.__getitem__(0))
AttributeError: 'DatasetSubset' object has no attribute '__getitem__'

Or even an attempt to make a wrapper:

train = dataset.get_subset("train")
print(train.get(0))
---> [96]     assert (subset or DEFAULT_SUBSET_NAME) == (self.name or DEFAULT_SUBSET_NAME)
AssertionError: 

The problem I am running into is that the subsets are not able to be separated from the main dataset & are not treated as their own dataset. Could I be doing anything differently?

This is the main thing stopping me from using this really useful library in my pipeline, as I can really see potential but it doesn't offer specifically the dataloading features I am looking for (which might be on purpose). If anyone knows of a good method / tool to do this, I would love to hear! Thank you 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants