Skip to content

Minimal GPU configuration #19

Answered by iofu728
CyprienRicque asked this question in Q&A
Discussion options

You must be logged in to vote

Hello @CyprienRicque, thank you for showing interest in the LLMLingua project.

In fact, you can use any language model from Hugging Face as the small language model in LLMLingua pipeline. By default, LLMLingua uses llama-2-7b, which requires approximately 17-20GB of GPU memory for inference.

However, by using the quantized version of the model, you can significantly reduce the GPU memory usage. For example, TheBloke/Llama-2-7b-Chat-GPTQ requires less than 8GB of GPU memory. You can even use smaller models, such as models the size of GPT-2-small.

You can refer to the following code to use TheBloke/Llama-2-7b-Chat-GPTQ But before that, make sure to update LLMLingua and install optimum auto-…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by iofu728
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
question Further information is requested
2 participants