`aten.linear` Performance Benchmarking + Exploration #2296

gs-olive · 2023-09-05T18:45:12Z

gs-olive
Sep 5, 2023
Collaborator

gs-olive · 2023-09-06T21:06:29Z

gs-olive
Sep 6, 2023
Collaborator Author

CC: @zewenli98, @narendasan, @laikhtewari

Overview

We compare the difference in performance between the PyTorch-decomposed aten.linear operator, and the Torch-TRT converter for aten.linear, for a static input size across 10,000 iterations. The graphs are as below:

Not Torch-Decomposed

graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l_y_ : torch.Tensor [num_users=1] = placeholder[target=l_y_]
    %l_z_ : torch.Tensor [num_users=1] = placeholder[target=l_z_]
    %linear_default : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%l_x_, %l_y_, %l_z_), kwargs = {})
    return linear_default

Torch-Decomposed

graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %view : [num_users=1] = call_function[target=torch.ops.aten.view.default](args = (%arg0_1, [128, 32]), kwargs = {})
    %permute : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%arg1_1, [1, 0]), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1), kwargs = {})
    %mm : [num_users=1] = call_function[target=torch.ops.aten.mm.default](args = (%view, %permute), kwargs = {})
    %mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mm, 1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%mul, %mul_1), kwargs = {})
    %view_1 : [num_users=1] = call_function[target=torch.ops.aten.view.default](args = (%add, [4, 32, 64]), kwargs = {})
    return (view_1,)

Note: For aten.linear, AOT autograd automatically decomposes this for us, regardless of whether a decomposition is explicitly specified for it. We could intercept the graph and replace the above subgraph with aten.linear again, as a lowering pass to address this.

Performance Methods

Used the below PyTorch model and disabled aot_export_joint_simple to deactivate the decomposition (or left it present to activate the decomposition).

import torch
import torch_tensorrt

class Linear(torch.nn.Module):
    def forward(self, x, y, z):
        return torch.ops.aten.linear.default(x, y, z)


opt_model = torch.compile(Linear().cuda(), backend="torch_tensorrt", options={"debug": True,
                                                                              "min_block_size": 1,
                                                                              "optimization_level": 5})
inputs = [torch.rand((4, 32, 32)).cuda(), torch.rand((64, 32)).cuda(), torch.rand((64,)).cuda()]

opt_model(*inputs)

Results

Large perf difference (Up to 40%) when using the default optimization level (3). The components of the linear layer do not get fused, resulting in 7 total TRT layers. When using the aten.linear converter in feat: support linear (fully connected layer) dynamo converter #2253, there are only 4 layers.
No perf difference when using maximum optimization level (5):
- Observed no measurable difference up to a reasonable precision (0.0001s), and results within 1.5% of each other on overall performance over 10000 iterations of inference at max optimization level.
- Both methods generate a single layer in TRT, meaning fusion is likely automatic regardless of whether this operator is decomposed, for this optimization level.

Recommendation

feat: support linear (fully connected layer) dynamo converter #2253 should be merged, as there are cases where this operator can show up in graphs. Additionally, the decomposition seems to have a major performance impact at lower optimization levels, so a lowering pass to catch an aten.linear operator could be worthwhile.

2 replies

narendasan Sep 6, 2023
Collaborator

@gs-olive werent we observing worse performance for opt level 5 generally?

gs-olive Sep 8, 2023
Collaborator Author

Not generally, only in very specific/complex model cases. In general, optimization level 5 is faster and reduces the number of layers.

narendasan · 2023-09-06T21:10:51Z

narendasan
Sep 6, 2023
Collaborator

[going to move this to discussions]

0 replies

laikhtewari · 2023-10-23T22:30:05Z

laikhtewari
Oct 23, 2023
Collaborator

@gs-olive Do you have a data on compilation times? Ideally for level 3 without lowering, level 3 with lowering, and level 5 without lowering

1 reply

gs-olive Oct 24, 2023
Collaborator Author

Just tested this out with @zewenli98's PR, using the same model and shapes as above. Each one is from a single run, so an error of +/- 1s should be expected. My build is in debug mode, however, so the numbers are only good for comparison between one another and not as standalones.

3 with lowering: 9.620s [4 TRT Layers]
5 with lowering: 12.323s [1 Coalesced TRT Layer]
3 no lowering: 15.375s [7 TRT Layers]
5 no lowering: 16.585s [1 Coalesced TRT Layer]

Even with this lowering pass, there are still optimizations that level 5 can do beyond level 3, for instance removing the view operators and coalescing layers, but the lowering pass + level 3 is still much better performance-wise than no lowering pass + level 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`aten.linear` Performance Benchmarking + Exploration #2296

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

aten.linear Performance Benchmarking + Exploration #2296

gs-olive Sep 5, 2023 Collaborator

Replies: 3 comments · 3 replies

gs-olive Sep 6, 2023 Collaborator Author

Overview

Performance Methods

Results

Recommendation

narendasan Sep 6, 2023 Collaborator

gs-olive Sep 8, 2023 Collaborator Author

narendasan Sep 6, 2023 Collaborator

laikhtewari Oct 23, 2023 Collaborator

gs-olive Oct 24, 2023 Collaborator Author

`aten.linear` Performance Benchmarking + Exploration #2296

gs-olive
Sep 5, 2023
Collaborator

Replies: 3 comments 3 replies

gs-olive
Sep 6, 2023
Collaborator Author

narendasan Sep 6, 2023
Collaborator

gs-olive Sep 8, 2023
Collaborator Author

narendasan
Sep 6, 2023
Collaborator

laikhtewari
Oct 23, 2023
Collaborator

gs-olive Oct 24, 2023
Collaborator Author