You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below is the forward function of the MultiHeadedAttention class:
def forward(self, query, key, value, mask=None):
"Implements Figure 2"
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = [
lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for lin, x in zip(self.linears, (query, key, value))
]
# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(
query, key, value, mask=mask, dropout=self.dropout
)
# 3) "Concat" using a view and apply a final linear.
x = (
x.transpose(1, 2)
.contiguous()
.view(nbatches, -1, self.h * self.d_k)
)
del query
del key
del value
return self.linears[-1](x)
I notice that the query, key, value is transposed (lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)') after passing through the linear layers. After calculating the attention, x' is then transposed back (`x.transpose(1, 2)').
May I know why we need such processing? Can we just use `lin(x).view(nbatches, -1, self.h, self.d_k)' and
x =x.contiguous().view(nbatches, -1, self.h * self.d_k)?
I delete all the transposing processing and the result is different. So I am wondering which one is correct, the original one with transpose, or the one without transpose.
The text was updated successfully, but these errors were encountered:
Note that -1 represents the length $N_{token}$ (#token) of the current input or time steps of a sequence in a batch and the shape of attention scores for each head is the same, $N_{token} \times N_{token}$.
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = scores.softmax(dim=-1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
Below is the forward function of the
MultiHeadedAttention
class:I notice that the query, key, value is transposed (
lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)') after passing through the linear layers. After calculating the attention,
x' is then transposed back (`x.transpose(1, 2)').May I know why we need such processing? Can we just use `lin(x).view(nbatches, -1, self.h, self.d_k)' and
x =x.contiguous().view(nbatches, -1, self.h * self.d_k)?
I delete all the transposing processing and the result is different. So I am wondering which one is correct, the original one with transpose, or the one without transpose.
The text was updated successfully, but these errors were encountered: