Minimal GPU configuration #19
-
Hello, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hello @CyprienRicque, thank you for showing interest in the LLMLingua project. In fact, you can use any language model from Hugging Face as the small language model in LLMLingua pipeline. By default, LLMLingua uses llama-2-7b, which requires approximately 17-20GB of GPU memory for inference. However, by using the quantized version of the model, you can significantly reduce the GPU memory usage. For example, You can refer to the following code to use from llmlingua import PromptCompressor
llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200) |
Beta Was this translation helpful? Give feedback.
Hello @CyprienRicque, thank you for showing interest in the LLMLingua project.
In fact, you can use any language model from Hugging Face as the small language model in LLMLingua pipeline. By default, LLMLingua uses llama-2-7b, which requires approximately 17-20GB of GPU memory for inference.
However, by using the quantized version of the model, you can significantly reduce the GPU memory usage. For example,
TheBloke/Llama-2-7b-Chat-GPTQ
requires less than 8GB of GPU memory. You can even use smaller models, such as models the size of GPT-2-small.You can refer to the following code to use
TheBloke/Llama-2-7b-Chat-GPTQ
But before that, make sure to update LLMLingua and install optimum auto-…