vLLM help pls :(

#6
by fsaudm - opened

Im working with vLLM to deploy this. Maybe someone could help me figure some things out :)

I have access to only A100s, and I have 9 nodes with 2 GPUs (so 18 A100s of 80 GBs, 1440 GBs in total). I was trying to serve a bf16 version I found here on HF, but I am getting CUDA OOM... even though 685B params should be somewhere around 1350 GBs plus some overhead. Any thoughts? I am also trying to unload to CPU but not working either...

vllm serve opensourcerelease/DeepSeek-V3-bf16
--dtype bfloat16
--host 0.0.0.0
--port 5000
--gpu-memory-utilization 0.7
--cpu-offload-gb 540
--tensor-parallel-size 2
--pipeline-parallel-size 9
--trust-remote-code

any thoughts? pls help :(

@fsaudm
It seems that the value for --gpu-memory-utilization should be changed from 0.7 to 0.99.

  • 1440 x 0.7 = 1008
  • 1440 x 0.99 = 1425.6
    If it works well, I hope you'll let me know.

have been solved? how it work?

No, not yet.. :( even with

--gpu-memory-utilization 0.99

I was running out of mem. Im pretty sure its the mem overhead. Isaw somewhere that for a bf16 version, one would need 1500+ GBs, and I am 1 or 2 GPUs short lol

One thing I cannot figure out is why it wouldnt work with --cpu-offload-gb 540.

Sign up or log in to comment