A question of Calculation of attention weights #1007

MN-Guan · 2022-04-10T07:50:04Z

MN-Guan
Apr 10, 2022

I am appreciative when you see this question! In the original paper of transformer, the method of calculating attention weights is called 'Scaled Dot-Product Attention'. Why can not I see the 'Scaled' operation, i.e. divided by square root of dk, in source code of Transformers library or [code] (https://github.com/google-research/t5x/blob/main/t5x/examples/t5/layers.py)? When I read the paper of T5, I didn't see the relevant statements of this change. Did you deliberately remove this calculation? I am looking forward to your reply. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question of Calculation of attention weights #1007

{{title}}

Replies: 0 comments

Select a reply

A question of Calculation of attention weights #1007

MN-Guan Apr 10, 2022

Replies: 0 comments

MN-Guan
Apr 10, 2022