How could I improve the inference performance? #154

ZhiyiLan · 2020-04-02T02:34:02Z

I used the command

nlp-train transformer_glue \
    --task_name mrpc \
    --model_name_or_path bert-base-uncased \
    --model_type quant_bert \
    --learning_rate 2e-5 \
    --output_dir /tmp/mrpc-8bit \
    --evaluate_during_training \
    --data_dir /path/to/MRPC \
    --do_lower_case

to training the model and

nlp-inference transformer_glue \
    --model_path /tmp/mrpc-8bit \
    --task_name mrpc \
    --model_type quant_bert \
    --output_dir /tmp/mrpc-8bit \
    --data_dir /path/to/MRPC \
    --do_lower_case \
    --overwrite_output_dir \
    --load_quantized_model

to do inference,but got the same performance as no flag --load_quantized_model.How could I improve the inference performance?

The text was updated successfully, but these errors were encountered:

ZhiyiLan · 2020-04-02T02:42:13Z

I print the weights of quant_pytorch_model.bin,but got dtypes some were int8 some were float and some were int32.Why aren't they all int8?

ofirzaf · 2020-04-16T13:21:17Z

Hi,

Our quantization scheme is:

FC Weights are quantized to Int8
FC Biases are quantized to Int32
Everything else is left in FP32

For more information please refer to our published paper on this model: Q8BERT: Quantized 8Bit BERT

Regarding the flag --load_quantized_model not working for you, please make sure you are using release 0.5.1 or above. If you are still unable to run with this flag please give me all the information like which code base are you using and everything else that might be relevant.

I would like to note that in order to receive speed up from the quantized model you must run it with supporting hardware and software.

Njuapp · 2020-05-21T13:30:54Z

Hi,

Our quantization scheme is:

FC Weights are quantized to Int8

FC Biases are quantized to Int32

Everything else is left in FP32

For more information please refer to our published paper on this model: Q8BERT: Quantized 8Bit BERT

Regarding the flag --load_quantized_model not working for you, please make sure you are using release 0.5.1 or above. If you are still unable to run with this flag please give me all the information like which code base are you using and everything else that might be relevant.

I would like to note that in order to receive speed up from the quantized model you must run it with supporting hardware and software.

I have got similar problem: inference with quantized model takes 1min54sec, while inference with unquantized one takes 1min58sec. My version is up-to-date.
Is my hardware unable to take advantage of Q8BERT model ?

Njuapp · 2020-05-21T13:33:22Z

P.S. I ran inference on CPU and my CPU is as follows:
Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

Ayyub29 · 2022-08-25T03:44:54Z

Do you got the answer?

ZhiyiLan added the question Further information is requested label Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How could I improve the inference performance? #154

How could I improve the inference performance? #154

ZhiyiLan commented Apr 2, 2020

ZhiyiLan commented Apr 2, 2020

ofirzaf commented Apr 16, 2020

Njuapp commented May 21, 2020

Njuapp commented May 21, 2020

Ayyub29 commented Aug 25, 2022

How could I improve the inference performance? #154

How could I improve the inference performance? #154

Comments

ZhiyiLan commented Apr 2, 2020

ZhiyiLan commented Apr 2, 2020

ofirzaf commented Apr 16, 2020

Njuapp commented May 21, 2020

Njuapp commented May 21, 2020

Ayyub29 commented Aug 25, 2022