Updated LLM compression related information (#26460)

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
openvinotoolkit · Sep 10, 2024 · a72b4ef · a72b4ef
1 parent 52c9ae7
commit a72b4ef
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 9 deletions.
diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst
@@ -165,8 +165,8 @@ parameters.
    such as ``meta-llama/Llama-2-7b`` or ``Qwen/Qwen-7B-Chat``. These parameters are used by
    default only when ``bits=4`` is specified in the config.
 
-   For more details on compression options, refer to the
-   :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`.
+   For more details on compression options, refer to the corresponding `Optimum documentation <https://huggingface.co/docs/optimum/en/intel/openvino/optimization#4-bit>`__.
+   For native NNCF weight quantization options, refer to the :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`.
 
    OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__
    library optimized with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there

diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst
@@ -180,9 +180,18 @@ trade-offs after optimization:
       ratio=0.9,
     )
 
+* ``scale_estimation`` - boolean parameter that enables more accurate estimation of 
+  quantization scales. Especially helpful when the weights of all layers are quantized to
+  4 bits. Requires dataset.
+
+* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight
+  quantization. Especially helpful when the weights of all the layers are quantized to
+  4 bits. The method can sometimes result in reduced accuracy when used with
+  Dynamic Quantization of activations. Requires dataset.
+
 * ``dataset`` - calibration dataset for data-aware weight compression. It is required
-  for some compression options, for example, some types ``sensitivity_metric`` can use
-  data for precision selection.
+  for some compression options, for example, ``scale_estimation`` or ``awq``. Some types
+  of ``sensitivity_metric`` can use data for precision selection.
 
 * ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing
   layers in the bit-width selection algorithm. Some of the metrics require dataset to be
@@ -210,14 +219,15 @@ trade-offs after optimization:
 * ``all_layers`` - boolean parameter that enables INT4 weight quantization of all
   Fully-Connected and Embedding layers, including the first and last layers in the model.
 
-* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight
-  quantization. Especially helpful when the weights of all the layers are quantized to
-  4 bits. The method can sometimes result in reduced accuracy when used with
-  Dynamic Quantization of activations. Requires dataset.
-
 For data-aware weight compression refer to the following
 `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.
 
+.. note::
+
+  Some methods can be stacked on top of one another to achieve a better
+  accuracy-performance trade-off after weight quantization. For example, the Scale Estimation
+  method can be applied along with AWQ and mixed-precision quantization (the ``ratio`` parameter).
+
 The example below shows data-free 4-bit weight quantization
 applied on top of OpenVINO IR. Before trying the example, make sure Optimum Intel
 is installed in your environment by running the following command: