Skip to content

Commit

Permalink
Updated LLM compression related information (#26460)
Browse files Browse the repository at this point in the history
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
  • Loading branch information
AlexKoff88 and tsavina authored Sep 10, 2024
1 parent 52c9ae7 commit a72b4ef
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,8 @@ parameters.
such as ``meta-llama/Llama-2-7b`` or ``Qwen/Qwen-7B-Chat``. These parameters are used by
default only when ``bits=4`` is specified in the config.

For more details on compression options, refer to the
:doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`.
For more details on compression options, refer to the corresponding `Optimum documentation <https://huggingface.co/docs/optimum/en/intel/openvino/optimization#4-bit>`__.
For native NNCF weight quantization options, refer to the :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`.

OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__
library optimized with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -180,9 +180,18 @@ trade-offs after optimization:
ratio=0.9,
)
* ``scale_estimation`` - boolean parameter that enables more accurate estimation of
quantization scales. Especially helpful when the weights of all layers are quantized to
4 bits. Requires dataset.

* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight
quantization. Especially helpful when the weights of all the layers are quantized to
4 bits. The method can sometimes result in reduced accuracy when used with
Dynamic Quantization of activations. Requires dataset.

* ``dataset`` - calibration dataset for data-aware weight compression. It is required
for some compression options, for example, some types ``sensitivity_metric`` can use
data for precision selection.
for some compression options, for example, ``scale_estimation`` or ``awq``. Some types
of ``sensitivity_metric`` can use data for precision selection.

* ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing
layers in the bit-width selection algorithm. Some of the metrics require dataset to be
Expand Down Expand Up @@ -210,14 +219,15 @@ trade-offs after optimization:
* ``all_layers`` - boolean parameter that enables INT4 weight quantization of all
Fully-Connected and Embedding layers, including the first and last layers in the model.

* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight
quantization. Especially helpful when the weights of all the layers are quantized to
4 bits. The method can sometimes result in reduced accuracy when used with
Dynamic Quantization of activations. Requires dataset.

For data-aware weight compression refer to the following
`example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.

.. note::

Some methods can be stacked on top of one another to achieve a better
accuracy-performance trade-off after weight quantization. For example, the Scale Estimation
method can be applied along with AWQ and mixed-precision quantization (the ``ratio`` parameter).

The example below shows data-free 4-bit weight quantization
applied on top of OpenVINO IR. Before trying the example, make sure Optimum Intel
is installed in your environment by running the following command:
Expand Down

0 comments on commit a72b4ef

Please sign in to comment.