Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The output of BitLinear is quite abnormal #35

Closed
Jiangxg opened this issue Mar 5, 2024 · 6 comments
Closed

The output of BitLinear is quite abnormal #35

Jiangxg opened this issue Mar 5, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@Jiangxg
Copy link
Contributor

Jiangxg commented Mar 5, 2024

Describe the bug
I print the mean and variance of the tensor y in example.py.
Its mean and variance are abnormal, as follows:

mean and var of BitLinear output:
-0.567935049533844
1149.9969482421875

To make sure, I print the mean and variance of outputs from Linear and BitLinear, simutaneously.

mean and var of Linear output:
0.012186492793262005
0.33256232738494873
mean and var of BitLinear output:
0.9070871472358704
992.69384765625

I believe there are mistakes in the implementation of BitLinear in bitnet/bitlinear.py.

To Reproduce
Steps to reproduce the behavior:

  1. print the mean and variance of y in example.py
  2. insert output_linear = torch.nn.functional.linear(x, self.weight, self.bias) in bitnet/bitlinear.py line 129. Then print the mean and variance of output_linear

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar
@Jiangxg Jiangxg added the bug Something isn't working label Mar 5, 2024
@Jiangxg Jiangxg changed the title [BUG] The output of BitLinear is quite abnormal The output of BitLinear is quite abnormal Mar 5, 2024
@suzuke
Copy link

suzuke commented Mar 16, 2024

The implementation of this binear is completely wrong, not only does it not follow the process outlined in the Bitnet paper, but it also misunderstands all the computational principles. I don't understand why it still receives so many stars.

@suzuke
Copy link

suzuke commented Mar 16, 2024

Gemma, beta, and alpha are calculated using weights and input before quantization. These parameters are then utilized for weights binarization and input quantization. The binarized weights and quantized input undergo linear operations to produce the output, which is then dequantized using the previously calculated gemma, beta. It's not meaningful to calculate gemma and beta separately for quantization and dequantization stages, and even the implementation of grouping here is entirely nonsensical.

@2020zyc
Copy link

2020zyc commented Mar 16, 2024

Gemma, beta, and alpha are calculated using weights and input before quantization. These parameters are then utilized for weights binarization and input quantization. The binarized weights and quantized input undergo linear operations to produce the output, which is then dequantized using the previously calculated gemma, beta. It's not meaningful to calculate gemma and beta separately for quantization and dequantization stages, and even the implementation of grouping here is entirely nonsensical.

hi, I don't understand what u say. Could u tell more?
The code just calculates the gamma/beta in quantization stage dynamically, then uses the two statistics to dequant activation.
No extra calculation of gamma/beta in dequantization stage.
You of course can take the previous calculation out of the quantization stage, but still need dynamically get the gamma/beta.

@2020zyc
Copy link

2020zyc commented Mar 16, 2024

Gemma, beta, and alpha are calculated using weights and input before quantization. These parameters are then utilized for weights binarization and input quantization. The binarized weights and quantized input undergo linear operations to produce the output, which is then dequantized using the previously calculated gemma, beta. It's not meaningful to calculate gemma and beta separately for quantization and dequantization stages, and even the implementation of grouping here is entirely nonsensical.

Another implementation is BIT-Transformers. I don't know how its BitLinear works, especially the forward function. No obvious beta/gamma and no need to dequant output. Could u understand this code? Thanks

forward
image

@suzuke
Copy link

suzuke commented Mar 16, 2024

The issues I mentioned have been addressed in the commit 6cdb2ea.

@Jiangxg
Copy link
Contributor Author

Jiangxg commented Mar 18, 2024

The issues I mentioned have been addressed in the commit 6cdb2ea.

Yes, most of the problem has been addressed. Still got a bug in the implementation of grouping. I am working on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants