Update and fix gnn model factory and models #2177

JasonMts · 2024-02-21T15:06:24Z

This PR deals with a variety of issues around gnn canary models. Currently the models:

sage
gcn
gat

are failing directly from the installation step as they are missing the required data file, sub_reddit.pt.

The PR checks out the data file Reddit_minimal.tar.gz from S3 for all these models. It also updates the requirements and installation files, for example importing pyg_lib, since running the models without it causes NeighborSampler to throw a deprecation warning.

Lastly, this PR focuses on the updating of the gnn model factory to be more in line with both model.py and _invoke_staged_train_test() as it is a multi batch model. This means that it needed a forward(), backward(), optimizer_step() and get_input_iter() function. This would also make it more in line with other model factories such as the vision one.

These changes allow the models to be trained with run.py:

python benchmark/run.py sage -d cpu -t train --metrics model_flops,cpu_peak_mem,ttfb
Warning: The model sage cannot be found at core set.
Running train method from sage on cpu in eager mode with input batch size 64 and precision fp32.
3054644320
Module              FLOP    % Total
-------------  ---------  ---------
Global         3054.644M    100.00%
 - aten.addmm   763.661M     25.00%
 - aten.mm     2290.983M     75.00%
CPU Wall Time per batch:   1.654 milliseconds
CPU Wall Time:       201.804 milliseconds
Time to first batch:          321.6082 ms
Model Flops:         0.0151 TFLOPs per second
CPU Peak Memory:                0.3770 GB

python benchmark/run.py gat -d cpu -t train --metrics cpu_peak_mem,ttfb
Warning: The model gat cannot be found at core set.
Running train method from gat on cpu in eager mode with input batch size 64 and precision fp32.
CPU Wall Time per batch:   2.721 milliseconds
CPU Wall Time:       331.996 milliseconds
Time to first batch:          174.6178 ms
CPU Peak Memory:                0.3594 GB

python benchmark/run.py gcn -d cpu -t train --metrics cpu_peak_mem,ttfb
Warning: The model gcn cannot be found at core set.
Running train method from gcn on cpu in eager mode with input batch size 64 and precision fp32.
CPU Wall Time per batch:   1.795 milliseconds
CPU Wall Time:       219.015 milliseconds
Time to first batch:          220.0093 ms
CPU Peak Memory:                0.3350 GB

NOTE: Gat and Gcn cannot collect model_flops metrics because there is a bug when running these models with the FlopCounterMode context manager (here).

NOTE 2: eval is not supported yet as there is no _invoke_staged_eval_test() function yet, but this would be a good idea to implement for completion.

xuzhao9 · 2024-02-23T19:28:25Z

For eval we don't need a staged eval test, because inference test does not include a backward pass.

Also, curious what is the error message when running with the FLOPCounterMode

JasonMts · 2024-02-24T20:47:00Z

For eval we don't need a staged eval test, because inference test does not include a backward pass.

Yes agreed, I was just thinking of a multi batch evaluation just to have some results but indeed it doesn't make that much sense.

Also, curious what is the error message when running with the FLOPCounterMode

It was a runtime error:
RuntimeError: Creating a new Tensor subclass EdgeIndex but the raw Tensor object is already associated to a python object of type Tensor
which was taking place during the forward pass when running some torch_geometric functions.

Full output for gat below, it's very similar for gnn:

Traceback (most recent call last):
  File "/benchmark/run.py", line 623, in <module>
    main()  # pragma: no cover
  File "/benchmark/run.py", line 593, in main
    run_one_step(
  File "/benchmark/run.py", line 261, in run_one_step
    model_flops = get_model_flops(model)
  File "/benchmark/torchbenchmark/util/experiment/metrics.py", line 107, in get_model_flops
    work_func()
  File "/benchmark/torchbenchmark/util/experiment/metrics.py", line 105, in work_func
    model.invoke()
  File "/benchmark/torchbenchmark/util/model.py", line 310, in invoke
    return self._invoke_staged_train_test(num_batch=self.num_batch)
  File "/benchmark/torchbenchmark/util/model.py", line 300, in _invoke_staged_train_test
    losses = self.forward()
  File "/benchmark/torchbenchmark/util/framework/gnn/model_factory.py", line 115, in forward
    pred = self.model(**self.example_inputs)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch_geometric/nn/models/basic_gnn.py", line 254, in forward
    x = conv(x, edge_index, edge_attr=edge_attr)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch_geometric/nn/conv/gat_conv.py", line 322, in forward
    edge_index, edge_attr = remove_self_loops(
  File "/lib/python3.10/site-packages/torch_geometric/utils/loop.py", line 113, in remove_self_loops
    edge_index = edge_index[:, mask]
  File "/lib/python3.10/site-packages/torch_geometric/edge_index.py", line 1057, in __torch_function__
    return HANDLED_FUNCTIONS[func](*args, **(kwargs or {}))
  File "/lib/python3.10/site-packages/torch_geometric/edge_index.py", line 1353, in getitem
    out = out.as_subclass(EdgeIndex)
RuntimeError: Creating a new Tensor subclass EdgeIndex but the raw Tensor object is already associated to a python object of type Tensor

xuzhao9 · 2024-02-26T17:33:11Z

For eval we don't need a staged eval test, because inference test does not include a backward pass.

Yes agreed, I was just thinking of a multi batch evaluation just to have some results but indeed it doesn't make that much sense.

The --num-batch option can be used to run a model with multiple batches. By default, we run --num-batch 1.

facebook-github-bot · 2024-02-26T17:33:32Z

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-02-27T14:40:30Z

@xuzhao9 merged this pull request in 7f76813.

JasonMts added 3 commits February 20, 2024 23:41

got gnn model factory working

b42fd05

requirements for gat and gcn

2dd9377

some more cleaning up

8101bfc

facebook-github-bot added the cla signed label Feb 21, 2024

JasonMts marked this pull request as ready for review February 21, 2024 15:07

JasonMts temporarily deployed to docker-s3-upload February 21, 2024 18:30 — with GitHub Actions Inactive

JasonMts temporarily deployed to docker-s3-upload February 21, 2024 18:31 — with GitHub Actions Inactive

xuzhao9 approved these changes Feb 23, 2024

View reviewed changes

facebook-github-bot closed this in 7f76813 Feb 27, 2024

facebook-github-bot added the Merged label Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update and fix gnn model factory and models #2177

Update and fix gnn model factory and models #2177

JasonMts commented Feb 21, 2024 •

edited

Loading

xuzhao9 commented Feb 23, 2024 •

edited

Loading

JasonMts commented Feb 24, 2024

xuzhao9 commented Feb 26, 2024

facebook-github-bot commented Feb 26, 2024

facebook-github-bot commented Feb 27, 2024

Update and fix gnn model factory and models #2177

Update and fix gnn model factory and models #2177

Conversation

JasonMts commented Feb 21, 2024 • edited Loading

xuzhao9 commented Feb 23, 2024 • edited Loading

JasonMts commented Feb 24, 2024

xuzhao9 commented Feb 26, 2024

facebook-github-bot commented Feb 26, 2024

facebook-github-bot commented Feb 27, 2024

JasonMts commented Feb 21, 2024 •

edited

Loading

xuzhao9 commented Feb 23, 2024 •

edited

Loading