CPU OOM when inferencing Llama3-70B-Chinese-Chat #904

GORGEOUSLCX · 2024-05-20T09:06:42Z

Code: text-generation demo
Command:
deepspeed --num_gpus 2 inference-test.py --dtype float16 --batch_size 4 --max_new_tokens 200 --model ../Llama3-70B-Chinese-Chat
Hardware: two A100 80GB GPUs, CPU 250GB
Problem: When using Deepspeed to load the float16 model, it consumes too much CPU memory, and 250GB of memory cannot load the 70B model. When I use the built-in model of Transformers for inference, Model=AutoModelForCausalLM. from_pretrained (model_id, torch dtype=torch. float16, device_map="auto"), can perform inference without occupying CPU memory.
How to reduce CPU memory usage？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU OOM when inferencing Llama3-70B-Chinese-Chat #904

CPU OOM when inferencing Llama3-70B-Chinese-Chat #904

GORGEOUSLCX commented May 20, 2024

CPU OOM when inferencing Llama3-70B-Chinese-Chat #904

CPU OOM when inferencing Llama3-70B-Chinese-Chat #904

Comments

GORGEOUSLCX commented May 20, 2024