You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All of the documentation I have sifted through has referenced essentially re-saving data when the format changes, but is there a way to use this library without that? A good use-case for this example is if you have a very large amount of data. You can read your data if it's a supported format, some conversion occurs, then you pass it into your pipeline. It would add to the cost of data-loading, but it can be worth it if it saves terabytes of disk space.
Dataset
...
subsets
test: # of items=...
train: # of items=...
val: # of items=...
infos
...
Where I would want to implement a PyTorch Lightning data module like:
importlightningasLfromdatumaro.components.datasetimportDatasetclassMyDataModule(L.LightningDataModule):
def__init__(self, batch_size: int=32):
super().__init__()
self.batch_size=batch_sizedefsetup(self, stage: str):
# Being able to load only specific subsets would be nice here too, but that sounds like a large undertakingdataset=Dataset.import_from("./data", "yolo")
ifstage=="fit":
dataset_train=dataset.get_subset("train")
dataset_val=dataset.get_subset("val")
ifstage=="test":
dataset_test=dataset.get_subset("test")
deftrain_dataloader(self):
returnDataLoader(self.dataset_train, batch_size=self.batch_size)
# and so on for "test" and "val"
Is this possible? Neither of these solutions I have tried work:
---> [96] assert (subset or DEFAULT_SUBSET_NAME) == (self.name or DEFAULT_SUBSET_NAME)
AssertionError:
The problem I am running into is that the subsets are not able to be separated from the main dataset & are not treated as their own dataset. Could I be doing anything differently?
This is the main thing stopping me from using this really useful library in my pipeline, as I can really see potential but it doesn't offer specifically the dataloading features I am looking for (which might be on purpose). If anyone knows of a good method / tool to do this, I would love to hear! Thank you 😄
The text was updated successfully, but these errors were encountered:
All of the documentation I have sifted through has referenced essentially re-saving data when the format changes, but is there a way to use this library without that? A good use-case for this example is if you have a very large amount of data. You can read your data if it's a supported format, some conversion occurs, then you pass it into your pipeline. It would add to the cost of data-loading, but it can be worth it if it saves terabytes of disk space.
I am loading a dataset like this:
Where I would want to implement a PyTorch Lightning data module like:
Is this possible? Neither of these solutions I have tried work:
Or even an attempt to make a wrapper:
The problem I am running into is that the subsets are not able to be separated from the main dataset & are not treated as their own dataset. Could I be doing anything differently?
This is the main thing stopping me from using this really useful library in my pipeline, as I can really see potential but it doesn't offer specifically the dataloading features I am looking for (which might be on purpose). If anyone knows of a good method / tool to do this, I would love to hear! Thank you 😄
The text was updated successfully, but these errors were encountered: