From a72b4ef8c2d2b97965adfa4489f05f9a085f4168 Mon Sep 17 00:00:00 2001 From: Alexander Kozlov Date: Tue, 10 Sep 2024 15:01:38 +0400 Subject: [PATCH] Updated LLM compression related information (#26460) Co-authored-by: Tatiana Savina --- .../llm_inference_guide/llm-inference-hf.rst | 4 ++-- .../weight-compression.rst | 24 +++++++++++++------ 2 files changed, 19 insertions(+), 9 deletions(-) diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst index f8023165b8f74c..77cd0aca62021d 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst @@ -165,8 +165,8 @@ parameters. such as ``meta-llama/Llama-2-7b`` or ``Qwen/Qwen-7B-Chat``. These parameters are used by default only when ``bits=4`` is specified in the config. - For more details on compression options, refer to the - :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`. + For more details on compression options, refer to the corresponding `Optimum documentation `__. + For native NNCF weight quantization options, refer to the :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`. OpenVINO also supports 4-bit models from Hugging Face `Transformers `__ library optimized with `GPTQ `__. In this case, there diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst index 73e2ccf9ea5351..2b67dc0ddf5cbb 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst @@ -180,9 +180,18 @@ trade-offs after optimization: ratio=0.9, ) +* ``scale_estimation`` - boolean parameter that enables more accurate estimation of + quantization scales. Especially helpful when the weights of all layers are quantized to + 4 bits. Requires dataset. + +* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight + quantization. Especially helpful when the weights of all the layers are quantized to + 4 bits. The method can sometimes result in reduced accuracy when used with + Dynamic Quantization of activations. Requires dataset. + * ``dataset`` - calibration dataset for data-aware weight compression. It is required - for some compression options, for example, some types ``sensitivity_metric`` can use - data for precision selection. + for some compression options, for example, ``scale_estimation`` or ``awq``. Some types + of ``sensitivity_metric`` can use data for precision selection. * ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing layers in the bit-width selection algorithm. Some of the metrics require dataset to be @@ -210,14 +219,15 @@ trade-offs after optimization: * ``all_layers`` - boolean parameter that enables INT4 weight quantization of all Fully-Connected and Embedding layers, including the first and last layers in the model. -* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight - quantization. Especially helpful when the weights of all the layers are quantized to - 4 bits. The method can sometimes result in reduced accuracy when used with - Dynamic Quantization of activations. Requires dataset. - For data-aware weight compression refer to the following `example `__. +.. note:: + + Some methods can be stacked on top of one another to achieve a better + accuracy-performance trade-off after weight quantization. For example, the Scale Estimation + method can be applied along with AWQ and mixed-precision quantization (the ``ratio`` parameter). + The example below shows data-free 4-bit weight quantization applied on top of OpenVINO IR. Before trying the example, make sure Optimum Intel is installed in your environment by running the following command: