Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated LLM compression related information #26460

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,8 @@ parameters.
such as ``meta-llama/Llama-2-7b`` or ``Qwen/Qwen-7B-Chat``. These parameters are used by
default only when ``bits=4`` is specified in the config.

For more details on compression options, refer to the
:doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`.
For more details on compression options, refer to the corresponding `Optimum documentation <https://huggingface.co/docs/optimum/en/intel/openvino/optimization#4-bit>`__.
For native NNCF weight quantization options, refer to the :doc:`weight compression guide <../../openvino-workflow/model-optimization-guide/weight-compression>`.

OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__
library optimized with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,9 +182,18 @@ trade-offs after optimization:
ratio=0.9,
)

* ``scale_estimation`` - boolean parameter that enables more accurate estimation of
quantization scales. Especially helpful when the weights of all layers are quantized to
4 bits. Requires dataset.

* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight
quantization. Especially helpful when the weights of all the layers are quantized to
4 bits. The method can sometimes result in reduced accuracy when used with
Dynamic Quantization of activations. Requires dataset.

* ``dataset`` - calibration dataset for data-aware weight compression. It is required
for some compression options, for example, some types ``sensitivity_metric`` can use
data for precision selection.
for some compression options, for example, ``scale_estimation`` or ``awq``. Some types
of ``sensitivity_metric`` can use data for precision selection.

* ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing
layers in the bit-width selection algorithm. Some of the metrics require dataset to be
Expand Down Expand Up @@ -212,14 +221,15 @@ trade-offs after optimization:
* ``all_layers`` - boolean parameter that enables INT4 weight quantization of all
Fully-Connected and Embedding layers, including the first and last layers in the model.

* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight
quantization. Especially helpful when the weights of all the layers are quantized to
4 bits. The method can sometimes result in reduced accuracy when used with
Dynamic Quantization of activations. Requires dataset.

For data-aware weight compression refer to the following
`example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.

.. note::

Some methods can be stacked on top of one another to achieve a better
accuracy-performance trade-off after weight quantization. For example, the Scale Estimation
method can be applied along with AWQ and mixed-precision quantization (the ``ratio`` parameter).

The example below shows data-free 4-bit weight quantization
applied on top of OpenVINO IR. Before trying the example, make sure Optimum Intel
is installed in your environment by running the following command:
Expand Down
Loading