-
Notifications
You must be signed in to change notification settings - Fork 447
How could I improve the inference performance? #154
Comments
I print the weights of quant_pytorch_model.bin,but got dtypes some were int8 some were float and some were int32.Why aren't they all int8? |
Hi, Our quantization scheme is:
For more information please refer to our published paper on this model: Q8BERT: Quantized 8Bit BERT Regarding the flag I would like to note that in order to receive speed up from the quantized model you must run it with supporting hardware and software. |
I have got similar problem: inference with quantized model takes 1min54sec, while inference with unquantized one takes 1min58sec. My version is up-to-date. |
P.S. I ran inference on CPU and my CPU is as follows: |
Do you got the answer? |
I used the command
to training the model and
to do inference,but got the same performance as no flag
--load_quantized_model
.How could I improve the inference performance?The text was updated successfully, but these errors were encountered: