Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and fix gnn model factory and models #2177

Closed
wants to merge 3 commits into from

Conversation

JasonMts
Copy link
Contributor

@JasonMts JasonMts commented Feb 21, 2024

This PR deals with a variety of issues around gnn canary models. Currently the models:

  • sage
  • gcn
  • gat

are failing directly from the installation step as they are missing the required data file, sub_reddit.pt.

The PR checks out the data file Reddit_minimal.tar.gz from S3 for all these models. It also updates the requirements and installation files, for example importing pyg_lib, since running the models without it causes NeighborSampler to throw a deprecation warning.

Lastly, this PR focuses on the updating of the gnn model factory to be more in line with both model.py and _invoke_staged_train_test() as it is a multi batch model. This means that it needed a forward(), backward(), optimizer_step() and get_input_iter() function. This would also make it more in line with other model factories such as the vision one.

These changes allow the models to be trained with run.py:

python benchmark/run.py sage -d cpu -t train --metrics model_flops,cpu_peak_mem,ttfb
Warning: The model sage cannot be found at core set.
Running train method from sage on cpu in eager mode with input batch size 64 and precision fp32.
3054644320
Module              FLOP    % Total
-------------  ---------  ---------
Global         3054.644M    100.00%
 - aten.addmm   763.661M     25.00%
 - aten.mm     2290.983M     75.00%
CPU Wall Time per batch:   1.654 milliseconds
CPU Wall Time:       201.804 milliseconds
Time to first batch:          321.6082 ms
Model Flops:         0.0151 TFLOPs per second
CPU Peak Memory:                0.3770 GB
python benchmark/run.py gat -d cpu -t train --metrics cpu_peak_mem,ttfb
Warning: The model gat cannot be found at core set.
Running train method from gat on cpu in eager mode with input batch size 64 and precision fp32.
CPU Wall Time per batch:   2.721 milliseconds
CPU Wall Time:       331.996 milliseconds
Time to first batch:          174.6178 ms
CPU Peak Memory:                0.3594 GB
python benchmark/run.py gcn -d cpu -t train --metrics cpu_peak_mem,ttfb
Warning: The model gcn cannot be found at core set.
Running train method from gcn on cpu in eager mode with input batch size 64 and precision fp32.
CPU Wall Time per batch:   1.795 milliseconds
CPU Wall Time:       219.015 milliseconds
Time to first batch:          220.0093 ms
CPU Peak Memory:                0.3350 GB

NOTE: Gat and Gcn cannot collect model_flops metrics because there is a bug when running these models with the FlopCounterMode context manager (here).

NOTE 2: eval is not supported yet as there is no _invoke_staged_eval_test() function yet, but this would be a good idea to implement for completion.

@JasonMts JasonMts marked this pull request as ready for review February 21, 2024 15:07
@xuzhao9
Copy link
Contributor

xuzhao9 commented Feb 23, 2024

For eval we don't need a staged eval test, because inference test does not include a backward pass.

Also, curious what is the error message when running with the FLOPCounterMode

@JasonMts
Copy link
Contributor Author

For eval we don't need a staged eval test, because inference test does not include a backward pass.

Yes agreed, I was just thinking of a multi batch evaluation just to have some results but indeed it doesn't make that much sense.

Also, curious what is the error message when running with the FLOPCounterMode

It was a runtime error:
RuntimeError: Creating a new Tensor subclass EdgeIndex but the raw Tensor object is already associated to a python object of type Tensor
which was taking place during the forward pass when running some torch_geometric functions.

Full output for gat below, it's very similar for gnn:

Traceback (most recent call last):
  File "/benchmark/run.py", line 623, in <module>
    main()  # pragma: no cover
  File "/benchmark/run.py", line 593, in main
    run_one_step(
  File "/benchmark/run.py", line 261, in run_one_step
    model_flops = get_model_flops(model)
  File "/benchmark/torchbenchmark/util/experiment/metrics.py", line 107, in get_model_flops
    work_func()
  File "/benchmark/torchbenchmark/util/experiment/metrics.py", line 105, in work_func
    model.invoke()
  File "/benchmark/torchbenchmark/util/model.py", line 310, in invoke
    return self._invoke_staged_train_test(num_batch=self.num_batch)
  File "/benchmark/torchbenchmark/util/model.py", line 300, in _invoke_staged_train_test
    losses = self.forward()
  File "/benchmark/torchbenchmark/util/framework/gnn/model_factory.py", line 115, in forward
    pred = self.model(**self.example_inputs)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch_geometric/nn/models/basic_gnn.py", line 254, in forward
    x = conv(x, edge_index, edge_attr=edge_attr)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch_geometric/nn/conv/gat_conv.py", line 322, in forward
    edge_index, edge_attr = remove_self_loops(
  File "/lib/python3.10/site-packages/torch_geometric/utils/loop.py", line 113, in remove_self_loops
    edge_index = edge_index[:, mask]
  File "/lib/python3.10/site-packages/torch_geometric/edge_index.py", line 1057, in __torch_function__
    return HANDLED_FUNCTIONS[func](*args, **(kwargs or {}))
  File "/lib/python3.10/site-packages/torch_geometric/edge_index.py", line 1353, in getitem
    out = out.as_subclass(EdgeIndex)
RuntimeError: Creating a new Tensor subclass EdgeIndex but the raw Tensor object is already associated to a python object of type Tensor

@xuzhao9
Copy link
Contributor

xuzhao9 commented Feb 26, 2024

For eval we don't need a staged eval test, because inference test does not include a backward pass.

Yes agreed, I was just thinking of a multi batch evaluation just to have some results but indeed it doesn't make that much sense.

The --num-batch option can be used to run a model with multiple batches. By default, we run --num-batch 1.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in 7f76813.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants