Multiplication positional encoding seems to work better than the original division one? #131

Mightlaus · 2024-10-31T11:58:27Z

Thank you for providing such a well-organized and comprehensive Transformer tutorial.
As a beginner, I’ve learned a lot from this repository☺️!

When I was building the positional encoding block, I mistakenly implemented it as:

pe[:, 0::2] = torch.sin(position / div_term) pe[:, 1::2] = torch.cos(position / div_term)

that is to multiply the position with the dominator, instead of the intended division form

pe[:, 0::2] = torch.sin(position * div_term pe[:, 1::2] = torch.cos(position * div_term))

However, in the first example where the model is trained to repeat the input words as the output, this incorrect implementation seems to converge much faster and nearly reaches zero loss.

I’m a bit confused—is it possible that this incorrect implementation actually performs better than the intended version?

position / div_term implementation outputs:
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.10 | Tokens / Sec:  1460.0 | Learning Rate: 5.5e-06
tensor([[0, 7, 7, 9, 7, 6, 8, 8, 8, 2]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.08 | Tokens / Sec:  1637.4 | Learning Rate: 6.1e-05
tensor([[0, 7, 2, 8, 5, 6, 8, 7, 3, 5]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.59 | Tokens / Sec:  1610.8 | Learning Rate: 1.2e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.50 | Tokens / Sec:  1661.7 | Learning Rate: 1.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.01 | Tokens / Sec:  1691.5 | Learning Rate: 2.3e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.00 | Tokens / Sec:  1654.0 | Learning Rate: 2.8e-04
...

position / div_term iimplementation outputs:
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.07 | Tokens / Sec:  1499.9 | Learning Rate: 5.5e-06
tensor([[0, 3, 6, 2, 2, 6, 3, 3, 4, 2]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.07 | Tokens / Sec:  1679.4 | Learning Rate: 6.1e-05
tensor([[0, 3, 2, 6, 5, 4, 8, 7, 6, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.76 | Tokens / Sec:  1664.8 | Learning Rate: 1.2e-04
tensor([[0, 3, 2, 6, 5, 4, 7, 9, 8, 3]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.45 | Tokens / Sec:  1662.9 | Learning Rate: 1.7e-04
tensor([[0, 2, 3, 6, 5, 4, 7, 8, 9, 7]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.93 | Tokens / Sec:  1643.7 | Learning Rate: 2.3e-04
tensor([[0, 2, 3, 5, 4, 6, 5, 9, 7, 8]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.55 | Tokens / Sec:  1684.9 | Learning Rate: 2.8e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.32 | Tokens / Sec:  1656.6 | Learning Rate: 3.4e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.24 | Tokens / Sec:  1673.1 | Learning Rate: 3.9e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.13 | Tokens / Sec:  1646.1 | Learning Rate: 4.5e-04
tensor([[0, 2, 3, 4, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.17 | Tokens / Sec:  1684.4 | Learning Rate: 5.0e-04
tensor([[0, 2, 3, 4, 5, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.10 | Tokens / Sec:  1655.1 | Learning Rate: 5.6e-04
tensor([[0, 1, 2, 3, 4, 4, 5, 6, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.11 | Tokens / Sec:  1645.2 | Learning Rate: 6.1e-04
tensor([[0, 2, 3, 2, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.08 | Tokens / Sec:  1682.0 | Learning Rate: 6.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.12 | Tokens / Sec:  1666.5 | Learning Rate: 7.2e-04
tensor([[0, 2, 1, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.04 | Tokens / Sec:  1622.6 | Learning Rate: 7.8e-04
tensor([[0, 2, 3, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.17 | Tokens / Sec:  1672.7 | Learning Rate: 8.3e-04
...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiplication positional encoding seems to work better than the original division one? #131

Multiplication positional encoding seems to work better than the original division one? #131

Mightlaus commented Oct 31, 2024

Multiplication positional encoding seems to work better than the original division one? #131

Multiplication positional encoding seems to work better than the original division one? #131

Comments

Mightlaus commented Oct 31, 2024