From c625680efbae6c26c3878351c3c296477c8edea6 Mon Sep 17 00:00:00 2001 From: Harish Subramony <81822986+hsubramony@users.noreply.github.com> Date: Sun, 20 Oct 2024 02:53:08 -0700 Subject: [PATCH] Update text-gen README.md to add auto-gptq fork install steps (#1442) --- examples/text-generation/README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md index 4257827596..8aaccfd124 100755 --- a/examples/text-generation/README.md +++ b/examples/text-generation/README.md @@ -282,7 +282,7 @@ You will also need to add `--torch_compile` and `--parallel_strategy="tp"` in yo Here is an example: ```bash PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py --world_size 8 run_generation.py \ ---model_name_or_path meta-llama/Llama-2-70b-hf \ +--model_name_or_path meta-llama/Llama-2-7b-hf \ --trim_logits \ --use_kv_cache \ --attn_softmax_bf16 \ @@ -593,6 +593,10 @@ For more details see [documentation](https://docs.habana.ai/en/latest/PyTorch/Mo Llama2-7b in UINT4 weight only quantization is enabled using [AutoGPTQ Fork](https://github.com/HabanaAI/AutoGPTQ), which provides quantization capabilities in PyTorch. Currently, the support is for UINT4 inference of pre-quantized models only. +```bash +BUILD_CUDA_EXT=0 python -m pip install -vvv --no-build-isolation git+https://github.com/HabanaAI/AutoGPTQ.git +``` + You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the following environment variables: `SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true` before running the command, and by adding the argument `--load_quantized_model_with_autogptq`.