Question about causal attention in decoder #31

Porthoos · 2024-03-19T02:41:31Z

Hello, I'm instersted in your work, and I found a question whithout explanation in paper.
I noticed that the causal attention in decoder uses a different structure unlike normal transformers:

MAT causal attention uses encoder output as 'Key' and uses decoder self-attention output as 'Query' and 'Value', while normal transformers causal attention use encoder output as 'Query' and 'Value', and use decoder self-attention output as 'Key'.
MAT residual connects the output of encoder after the causal attention, while normal transformers residual connect the output of decoder self-attention.

I have circled this in the figure, is there any reason to change the structure like this?

morning9393 · 2024-08-26T05:05:43Z

hiya, thx for your attention, here we exchange the usage of obs_rep, which is different from the classical transformer. The intuitive reason is that, for marl problems here, the obs_rep from the encoder contained more information than the act_rep from the first attention block in the decoder (which is different from traditional NLP task, like translation, where they contained the same volume of information.). But for now, the second attention block in the decoder could be removed and its results should be theoretically the same, w.r.t the advantage decomposition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about causal attention in decoder #31

Question about causal attention in decoder #31

Porthoos commented Mar 19, 2024

morning9393 commented Aug 26, 2024

Question about causal attention in decoder #31

Question about causal attention in decoder #31

Comments

Porthoos commented Mar 19, 2024

morning9393 commented Aug 26, 2024