From 74562ee42d79b6e14357b92675aba598abc7b57b Mon Sep 17 00:00:00 2001 From: Alexander Date: Fri, 6 Sep 2024 11:21:16 +0400 Subject: [PATCH 1/5] Updated LLM compression related information --- .../llm_inference_guide/llm-inference-hf.rst | 4 ++-- .../weight-compression.rst | 24 +++++++++++++------ 2 files changed, 19 insertions(+), 9 deletions(-) diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst index f8023165b8f74c..1dd554ff101dfb 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst @@ -165,8 +165,8 @@ parameters. such as ``meta-llama/Llama-2-7b`` or ``Qwen/Qwen-7B-Chat``. These parameters are used by default only when ``bits=4`` is specified in the config. - For more details on compression options, refer to the - :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`. + For more details on compression options, refer to the correspoding `Optimum documentation `__. + For native NNCF weight quantization options, refer to :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`. OpenVINO also supports 4-bit models from Hugging Face `Transformers `__ library optimized with `GPTQ `__. In this case, there diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst index 67cd51a9554439..40c4e49ba262cb 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst @@ -182,9 +182,18 @@ trade-offs after optimization: ratio=0.9, ) +* ``scale_estimation`` - boolean parameter that enables the more accurate estimation of + quantization scales. Especially helpful when the weights of all the layers are quantized to + 4 bits. Requires dataset. + +* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight + quantization. Especially helpful when the weights of all the layers are quantized to + 4 bits. The method can sometimes result in reduced accuracy when used with + Dynamic Quantization of activations. Requires dataset. + * ``dataset`` - calibration dataset for data-aware weight compression. It is required - for some compression options, for example, some types ``sensitivity_metric`` can use - data for precision selection. + for some compression options, for example, ``scale_estimation`` or ``awq``. Some types + of ``sensitivity_metric`` can use data for precision selection. * ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing layers in the bit-width selection algorithm. Some of the metrics require dataset to be @@ -212,14 +221,15 @@ trade-offs after optimization: * ``all_layers`` - boolean parameter that enables INT4 weight quantization of all Fully-Connected and Embedding layers, including the first and last layers in the model. -* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight - quantization. Especially helpful when the weights of all the layers are quantized to - 4 bits. The method can sometimes result in reduced accuracy when used with - Dynamic Quantization of activations. Requires dataset. - For data-aware weight compression refer to the following `example `__. +.. note:: + + Some of the methods can be stacked one on top of another to achieve a better + accuracy-performance trade-off after weight quantization. For example, Scale Estimation + method can be applied along with AWQ and mixed-precision quantization (``ratio`` parameter). + The example below shows data-free 4-bit weight quantization applied on top of OpenVINO IR. Before trying the example, make sure Optimum Intel is installed in your environment by running the following command: From 8dfb998132122af7120ab1ff9d21349ea7801eb2 Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Tue, 10 Sep 2024 11:30:16 +0400 Subject: [PATCH 2/5] Update docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst Co-authored-by: Tatiana Savina --- .../learn-openvino/llm_inference_guide/llm-inference-hf.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst index 1dd554ff101dfb..23166da5c1cf83 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst @@ -165,7 +165,7 @@ parameters. such as ``meta-llama/Llama-2-7b`` or ``Qwen/Qwen-7B-Chat``. These parameters are used by default only when ``bits=4`` is specified in the config. - For more details on compression options, refer to the correspoding `Optimum documentation `__. + For more details on compression options, refer to the corresponding `Optimum documentation `__. For native NNCF weight quantization options, refer to :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`. OpenVINO also supports 4-bit models from Hugging Face `Transformers `__ From 367748958b08bfb17cb10e926b6f335f892fcb7e Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Tue, 10 Sep 2024 11:30:24 +0400 Subject: [PATCH 3/5] Update docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst Co-authored-by: Tatiana Savina --- .../learn-openvino/llm_inference_guide/llm-inference-hf.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst index 23166da5c1cf83..77cd0aca62021d 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst @@ -166,7 +166,7 @@ parameters. default only when ``bits=4`` is specified in the config. For more details on compression options, refer to the corresponding `Optimum documentation `__. - For native NNCF weight quantization options, refer to :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`. + For native NNCF weight quantization options, refer to the :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`. OpenVINO also supports 4-bit models from Hugging Face `Transformers `__ library optimized with `GPTQ `__. In this case, there From 1ff1040c1f6c5021463cd19d62f55bfe0757a89f Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Tue, 10 Sep 2024 11:30:45 +0400 Subject: [PATCH 4/5] Update docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst Co-authored-by: Tatiana Savina --- .../model-optimization-guide/weight-compression.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst index 40c4e49ba262cb..fb9d196f6f25fe 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst @@ -182,8 +182,8 @@ trade-offs after optimization: ratio=0.9, ) -* ``scale_estimation`` - boolean parameter that enables the more accurate estimation of - quantization scales. Especially helpful when the weights of all the layers are quantized to +* ``scale_estimation`` - boolean parameter that enables more accurate estimation of + quantization scales. Especially helpful when the weights of all layers are quantized to 4 bits. Requires dataset. * ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight From e01cc5add67e40cadb7ec80be30d40da1977b228 Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Tue, 10 Sep 2024 11:30:58 +0400 Subject: [PATCH 5/5] Update docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst Co-authored-by: Tatiana Savina --- .../model-optimization-guide/weight-compression.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst index fb9d196f6f25fe..62350d04ace4ec 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst @@ -226,9 +226,9 @@ For data-aware weight compression refer to the following .. note:: - Some of the methods can be stacked one on top of another to achieve a better - accuracy-performance trade-off after weight quantization. For example, Scale Estimation - method can be applied along with AWQ and mixed-precision quantization (``ratio`` parameter). + Some methods can be stacked on top of one another to achieve a better + accuracy-performance trade-off after weight quantization. For example, the Scale Estimation + method can be applied along with AWQ and mixed-precision quantization (the ``ratio`` parameter). The example below shows data-free 4-bit weight quantization applied on top of OpenVINO IR. Before trying the example, make sure Optimum Intel