Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PublicBenchmarkDataset & SecretDataset #747

Closed
wants to merge 34 commits into from

Conversation

RasmusOrsoe
Copy link
Collaborator

@RasmusOrsoe RasmusOrsoe commented Sep 13, 2024

This PR adds extensions of ERDAHostedDataset that allows us to build and share public benchmarking datasets, and secret ones! It also introduces functionality to ParquetDataset that removes chunk ids from selection that doesn't exist.

Below is an example of the syntax of SecretDataset - a way for us to share datasets using ERDA sharelinks:

from graphnet.data import SecretDataset

dm = SecretDataset(secret= "secret-erda-sharelink",
                   graph_definition= ... ,
                    download_dir="/home/cool-datasets/",
                    backend = 'parquet',
                    mode = 'train')

training_dataloader = dm.train_dataloader
validation_dataloader = dm.val_dataloader
test_dataloader = dm.test_dataloader

The idea here is that we can distribute datasets "secretly" to colleagues, and once the data is ready to be made public, the data can be made available through the PublicBenchmarkDataset by subclassing, providing a similar syntax:

from graphnet.datasets import ABenchmarkDataset 

dm = ABenchmarkDataset(
                    graph_definition= ... ,
                    download_dir="/home/cool-datasets/",
                    backend = 'parquet',
                    mode = 'train')

training_dataloader = dm.train_dataloader
validation_dataloader = dm.val_dataloader
test_dataloader = dm.test_dataloader

Copy link
Collaborator

@Aske-Rosted Aske-Rosted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Also seems like some of the unit tests are failing.

isinstance(value, (int, float))
for value in values_to_merge
):
# alculate the mean for all attributes except charge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculate*

assert pulses[0] == input_features.shape[0]
# Merging window (2 ns) is large enough to merge two of the pulses
assert pulses[1] == (input_features.shape[0] - 1)
# Merging window (2 ns) is large enough to merge four of the pulses
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be 8 ns? Also maybe the test should check that grouping more than 2 of the pulses in one is handled correctly.

graph_definition: GraphDefinition,
download_dir: str,
backend: str = "parquet",
mode: str = "train",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not quite understand why the naming of the different modes are "train/test/test-no-noise".

os.path.join(
self.dataset_dir,
"selections",
"test_noise_selection.parquet",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this one "test_noise" instead of calling the no noise mode "test-no-noise" after having specified "no-noise" throughout the code is a bit confusing

@RasmusOrsoe
Copy link
Collaborator Author

@Aske-Rosted thanks for taking a look. Looks like I by mistake managed to merge another branch into this one, causing the checks to fail. I think your comments on the toggles between "test", "train" and "no-noise" is fair - and is granted quite specific to what I intend to use it for. I'll close this PR and make a new one in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants