软件及环境
- DeepSeek-R1:https://github.com/deepseek-ai/DeepSeek-R1
- vLLM:https://docs.vllm.ai/en/latest/index.html,使用的镜像版本为:
vllm-openai:v0.7.2
- Apptainer
- CUDA版本:
NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4
- GPU型号:Tesla V100-PCIE-32GB (c5节点3块V100,c6节点4块V100)
在开始之前,检查GPU的P2P连接及IB卡连接是否正常。检查方法点击:https://wuyea.top/posts/1873922600.html
wuyeblog-pw:vllm-deepseek
环境准备
拉取模型
[root@shield-head slurm]# export HF_ENDPOINT=https://hf-mirror.com
[root@shield-head slurm]# pip install huggingface_hub
[root@shield-head slurm]# python
>>> from huggingface_hub import snapshot_download
>>> snapshot_download(repo_id="deepseek-ai/DeepSeek-R1-Distill-Llama-70B", local_dir_use_symlinks=False, local_dir=r"/root/huggingface_model/DeepSeek-R1-Distill-Llama-70B") # 大小145G左右
使用huggingface官网下载速度慢,可以使用国内源的方法进行下载,具体可以看这篇文章的模型下载方式:使用hfd工具快速下载huggingface模型权重
拉取后的模型应该与huggingface中的file一致,即:
[root@c6 huggingface_model]# ls ./*
./DeepSeek-R1-Distill-Llama-70B:
config.json model-00005-of-000017.safetensors model-00014-of-000017.safetensors
figures model-00006-of-000017.safetensors model-00015-of-000017.safetensors
generation_config.json model-00007-of-000017.safetensors model-00016-of-000017.safetensors
hub model-00008-of-000017.safetensors model-00017-of-000017.safetensors
LICENSE model-00009-of-000017.safetensors model.safetensors.index.json
model-00001-of-000017.safetensors model-00010-of-000017.safetensors README.md
model-00002-of-000017.safetensors model-00011-of-000017.safetensors tokenizer_config.json
model-00003-of-000017.safetensors model-00012-of-000017.safetensors tokenizer.json
model-00004-of-000017.safetensors model-00013-of-000017.safetensors
制作镜像
制作vllm镜像:(先用沙盒测,如果没问题再转为镜像)
apptainer build --sandbox vllm-openai-v0.7.2 docker://vllm/vllm-openai:v0.7.2
使用单机多卡运行DeepSeek
# 进入容器
apptainer shell --nv -B /share/home/hpcadmin/wuye/huggingface_model:/root/.cache/huggingface --env VLLM_HOST_IP=10.240.214.72 vllm-openai-v0.7.2/
# 进入容器后执行
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B/ --tensor-parallel-size 4 --dtype half --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill=False --max-model-len 27000
......
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/models, Methods: GET
INFO 02-12 17:15:50 launcher.py:29] Route: /version, Methods: GET
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /pooling, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /score, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/score, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /rerank, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /invocations, Methods: POST
INFO: Started server process [683166]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
# 参数解释:
# --tensor-parallel-size 4 【张量并行副本的数量】使用4块GPU显卡
# --dtype half 【模型权重和激活的数据类型】因为V100不支持Bfloat16,所以设置为使用半精度
# --enable-chunked-prefill=False 【关闭预填充】解决 mma 转化出错的问题
# --max-model-len 27000 【设置模型上下文长度,如果未指定,将自动从模型配置中得出】解决模型最大序列长度大于可使用的最大令牌数的问题。27000的值是报错日志告诉我应该填的最大值;比如我在跑单机的时候最大值是27322,多机时候告诉我最大值应该是46288。所以只要小于等于这个值即可。
使用参数的具体原因请看运行模型期间碰到的错误。具体的参数请根据自己的实际情况运行。
当模型运行起来后,可以使用Open WebUI等其它UI来连接api进行使用。
使用多机多卡运行DeepSeek
vllm官方文档多机多卡推理:https://docs.vllm.ai/en/latest/serving/distributed_serving.html
根据官方文档说明,我们首先需要运行ray,在ray运行成功之后才能正常运行模型。根据官方文档提供的docker命令,我们转化为apptainer命令。
启动ray集群
在主节点(c5)运行命令:
apptainer shell --nv -B /share/home/hpcadmin/wuye/huggingface_model:/root/.cache/huggingface --env VLLM_HOST_IP=192.168.112.21 vllm-openai-v0.7.2/
# 进入容器后设置一些环境变量:
export VLLM_LOGGING_LEVEL=DEBUG # 开启调试信息
export NCCL_DEBUG=TRACE # 开启调试信息
export CUDA_VISIBLE_DEVICES=0,1 # 使用指定的GPU卡,最好两个GPU卡有P2P连接
export NCCL_SOCKET_IFNAME=ibp101s0 # 指定网卡为ib卡
export GLOO_SOCKET_IFNAME=ibp101s0 # 指定网卡为ib卡
# 设置NCCL强制使用TCP:
# export NCCL_P2P_DISABLE=1
# export NCCL_IB_DISABLE=1
# export NCCL_NET_GDR_LEVEL=0
Apptainer> ray start --head --node-ip-address 192.168.112.21 --port=6379 --dashboard-host='0.0.0.0' # 如果需要ray以阻塞形式运行,可以使用 --block 参数。
在子节点(c6)运行命令:
apptainer shell --nv -B /share/home/hpcadmin/wuye/huggingface_model:/root/.cache/huggingface --env VLLM_HOST_IP=192.168.112.22 vllm-openai-v0.7.2/
# 进入容器后设置一些环境变量:
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_SOCKET_IFNAME=ibs20
# 设置NCCL强制使用TCP:
# export NCCL_P2P_DISABLE=1
# export NCCL_IB_DISABLE=1
# export NCCL_NET_GDR_LEVEL=0
Apptainer> ray start --address=192.168.112.21:6379
# 等ray起来之后,可以运行 ray status 查看集群状态
Apptainer> ray status
======== Autoscaler status: 2025-02-12 11:03:32.806000 ========
Node status
---------------------------------------------------------------
Active:
1 node_ee3df702dda3a995b820f007634adbd683d5e4905dccd97f5e6c313f
1 node_118394a667ce619f38566fa0691c0b6aee4ebc902156272e0d67e578
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/304.0 CPU
0.0/4.0 GPU # c5节点3块GPU,c6节点4块GPU,一共7个,但是由于我们设置了CUDA_VISIBLE_DEVICES,所以两个节点加起来是4块
0B/665.73GiB memory
0B/289.31GiB object_store_memory
Demands:
(no resource demands)
Apptainer>
当检查ray集群状态正常后,就可以运行模型了。此时也可以像单机多卡一样去使用。
运行模型
实测:70B模型下载后占用磁盘内存132G,使用vllm多机多卡运行占用内存情况:显存:25026MiB * 8 ~= 195G 内存:6.6G * 8 ~= 52G (3台机器,GPU分布情况332,使用命令为
vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --dtype half --kv-cache-dtype fp8 --enable-chunked-prefill=False --enforce-eager --max_model_len 55000
)
在任意一个ray节点执行以下命令,当看到一些接口出来时即证明部署启动成功:
# 要运行的模型名: DeepSeek-R1-Distill-Qwen-32B DeepSeek-R1-Distill-Llama-70B
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --pipeline-parallel-size 2 --dtype half --cpu-offload-gb 20 --enable-reasoning --reasoning-parser deepseek_r1 --kv-cache-dtype fp8_e4m3 --gpu-memory-utilization 0.95 --enable-chunked-prefill=False --max-model-len 106000 --enforce-eager
......
(VllmWorkerProcess pid=683686) INFO 02-12 17:15:50 model_runner.py:1562] Graph capturing finished in 501 secs, took 1.24 GiB
INFO 02-12 17:15:50 model_runner.py:1562] Graph capturing finished in 502 secs, took 1.24 GiB
INFO 02-12 17:15:50 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 529.37 seconds
INFO 02-12 17:15:50 api_server.py:756] Using supplied chat template:
INFO 02-12 17:15:50 api_server.py:756] None
INFO 02-12 17:15:50 launcher.py:21] Available routes are:
INFO 02-12 17:15:50 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET
INFO 02-12 17:15:50 launcher.py:29] Route: /docs, Methods: HEAD, GET
INFO 02-12 17:15:50 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 02-12 17:15:50 launcher.py:29] Route: /redoc, Methods: HEAD, GET
INFO 02-12 17:15:50 launcher.py:29] Route: /health, Methods: GET
INFO 02-12 17:15:50 launcher.py:29] Route: /ping, Methods: POST, GET
INFO 02-12 17:15:50 launcher.py:29] Route: /tokenize, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /detokenize, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/models, Methods: GET
INFO 02-12 17:15:50 launcher.py:29] Route: /version, Methods: GET
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /pooling, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /score, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/score, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /rerank, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 02-12 17:15:50 launcher.py:29] Route: /invocations, Methods: POST
INFO: Started server process [683166]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
# 参数解释:
# --tensor-parallel-size 2 【张量并行副本的数量】每个节点使用2块的gpu数量
# --pipeline-parallel-size 2 【管道张量】使用2个节点
# --dtype half 【模型权重和激活的数据类型】因为V100不支持Bfloat16,所以设置为使用半精度
# --cpu-offload-gb 20 【每个 GPU 卸载到 CPU 的空间(以 GiB 为单位)】每块卡额外占用cpu内存20G
# --kv-cache-dtype fp8_e4m3 【kv 缓存存储的数据类型】将 KV 缓存量化为 FP8 可减少其内存占用
# --gpu-memory-utilization 0.95 【用于模型执行器的 GPU 内存比例,范围从 0 到 1】设置GPU内存利用率为95%,默认值0.9
# --enable-chunked-prefill=False 【关闭预填充】解决 mma 转化出错的问题
# --max-model-len 106000 【模型上下文长度。如果未指定,将自动从模型配置中得出】设置上下文长度为106000
# --enforce-eager 【始终使用 Eager 模式 PyTorch】开启后有利于提高内存稳定性
# max-model-len和cpu-offload-gb 一般情况下设置一个即可,如果设置了cpu-offload-gb就是扩大GPU内存,只要扩的足够大,就不存在模型上下文太长导致OOM的情况。
# 另外关于max-model-len的值,在使用相同的32B的模型时候,在单机多卡时未使用fp8_e4m3对kv缓存量化,当时的上下文长度只能设置27000,在对kv缓存量化后,已经可以把值设置为54960了,如果再把gpu内存利用率设置到95,那么上下文长度就可以设置到106000了。(在使用32B模型时可以不进行cpu-offload-gb操作,只要设置对模型上下文长度即可,如果想使用最大的模型上下文长度,那么内存不够还是需要设置cpu-offload-gb的)
在ray集群模式中跑多机多卡的时候,教你们一个小技巧:比如我要使用2个节点,每个节点使用2块GPU,那优先使用
--tensor-parallel-size 4
参数,而不要优先使用--tensor-parallel-size 2 --pipeline-parallel-size 2
参数。虽然效果一样,但是报错不一样!只要--tensor-parallel-size
参数跑通,那么改为另一种完全可行。比如我在使用时,假设是因为
max-model-len
太大而导致OOM,那么在只有--tensor-parallel-size
参数的时候会告诉你max-model-len
不能大于多少。而使用--tensor-parallel-size 2 --pipeline-parallel-size 2
参数的只有OOM失败退出。如果学到了,记得关注我给我点赞支持一下,谢谢!
Tips:我们没必要为了追求模型最大上下文长度来使用cpu内存,因为这样会降低模型推理效率,反而得不偿失!!!能用 --max-model-len
解决的,就尽量不要用 --cpu-offload-gb
。
运行模型期间碰到的错误
在调试期间遇到了很多错误,我会附上遇到错误时的日志及当时的执行命令,以及如何解决的。
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0.
错误日志如下:
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B --tensor-parallel-size 2 --pipeline-parallel-size 2 ...... ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half. ERROR 02-12 10:52:35 multiproc_worker_utils.py:124] Worker VllmWorkerProcess pid 617604 died, exit code: -15 INFO 02-12 10:52:35 multiproc_worker_utils.py:128] Killing local vLLM worker processes
解决方案:使用
--dtype
参数修改模型权重的数据类型。Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B --tensor-parallel-size 2 --pipeline-parallel-size 2 --dtype half
ValueError: Total number of attention heads (64) must be divisible by tensor parallel size (6).
错误日志如下:
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B --tensor-parallel-size 3 --pipeline-parallel-size 2 --dtype half ...... ValueError: Total number of attention heads (64) must be divisible by tensor parallel size (3).
错误原因是:模型配置文件
config.json
中num_attention_heads
的值不能被tensor_parallel_size
整除。解决方案:
方法一:更改vLLM的修改数值实际并不可用,会报另外一个数值不能被整除的错误。tensor_parallel_size
参数,使其可以被被部署的大模型的注意力头数整除即可,头数可以查看大模型config.json中的参数:num_attention_heads。方法二:因为是想跑在6块卡上,那么可以调整一下顺序,比如使用
--tensor-parallel-size 2 --pipeline-parallel-size 3
,这样也是6块卡,而--tensor-parallel-size
值为2就能被64整除了。(实测可行;另外说一下我这里用的是2个节点,c5使用2块卡,c6使用4块卡,并且强制使用了TCP的连接方式【见第6个错误】。)方法三:根本原因是因为在使用4块卡的时候内存
torch.OutOfMemoryError: CUDA out of memory.
,所以我们加了两张卡,但是又不能用。那么另一种方法就是当cpu内存足够的时候使用cpu内存,使用参数--cpu-offload-gb
。最终命令如下:Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B --tensor-parallel-size 2 --pipeline-parallel-size 2 --dtype half --cpu-offload-gb 20
发请求导致模型失败退出:mma layout conversion is only supported on Ampere。
错误信息指出, MMA 到 MMA 的布局转换仅在 Ampere 架构(例如 NVIDIA A10, A100, H100 等)的硬件上支持,但在非 Ampere 架构上不支持。(而我们的V100显然不支持)
错误日志如下:
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B --tensor-parallel-size 2 --pipeline-parallel-size 2 --dtype half --cpu-offload-gb 20 ...... INFO: 10.240.214.77:58610 - "GET /v1/models HTTP/1.1" 200 OK INFO 02-12 17:21:29 chat_utils.py:332] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. INFO 02-12 17:21:29 logger.py:39] Received request chatcmpl-78676df7895048c2b88e6084efb85083: prompt: '<|begin▁of▁sentence|><|User|>你是谁?<|Assistant|><think>\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131063, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None. INFO: 10.240.214.77:59914 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 02-12 17:21:29 async_llm_engine.py:211] Added request chatcmpl-78676df7895048c2b88e6084efb85083. python3: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. python3: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
解决方案:
使用参数
--enable-chunked-prefill=False
即可解决问题(在实际运行中日志有警告,告诉你可以使用这个参数)。同时根据官方文档(https://docs.vllm.ai/en/stable/features/reasoning_outputs.html)可知,要运行deepseek推理模型也可以增加
--enable-reasoning --reasoning-parser deepseek_r1
参数;但是实测发现,加于不加没区别,并且加上后反而没有了deepseek的标志,需要自己看一下哪些是它思考的,哪些是它总结的,反而不方便。 模型最大序列长度大于可使用的最大令牌数。
错误日志如下:
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --pipeline-parallel-size 2 --dtype half --cpu-offload-gb 20 --enable-chunked-prefill=False ...... ERROR 02-13 13:48:33 engine.py:389] ValueError: The model’s max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (46288). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. ERROR 02-13 13:48:36 multiproc_worker_utils.py:124] Worker VllmWorkerProcess pid 902872 died, exit code: -15
解决方案:按照提示增加
--max-model-len 46200
参数,只要这个值小于提示的值即可,建议与最大值一致,否则可能影响模型判断上下文。运行分布式模型时,在模型运行起来之前卡住?
错误日志:
...... WARNING 02-13 16:52:15 ray_utils.py:180] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 54df84f5aec0d46965a184ca55e7fe95ec8d5b19a9055d817e7920c4. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node. ...... c5:2498015:2498015 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer c5:2498015:2498015 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so c5:2498015:2498015 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin. c5:2498015:2498015 [0] NCCL INFO ncclCommInitRank comm 0xd060c10 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0x1214e3555fe7d049 - Init COMPLETE (RayWorkerWrapper pid=2498397) WARNING 02-13 16:53:12 custom_all_reduce.py:84] Custom allreduce is disabled because this process group spans across nodes. (RayWorkerWrapper pid=2498397) INFO 02-13 16:53:12 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='192.168.112.21', local_reader_ranks=[], buffer_handle=None, local_subscribe_port=None, remote_subscribe_port=40851) ...... # 这里是漫长的等待(半小时),其实并没有卡住,继续等过好久你会看到下面的错误: ERROR 02-13 17:23:12 worker_base.py:574] Error executing method 'init_device'. This might cause deadlock in distributed execution. ERROR 02-13 17:23:12 worker_base.py:574] Traceback (most recent call last): ERROR 02-13 17:23:12 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 566, in execute_method ERROR 02-13 17:23:12 worker_base.py:574] return run_method(target, method, args, kwargs) ERROR 02-13 17:23:12 worker_base.py:574] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...... ERROR 02-13 17:23:12 worker_base.py:574] work.wait() ERROR 02-13 17:23:12 worker_base.py:574] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
解决方法:
方案一:设置使用IB卡,并使用具有P2P连接的显卡。
这里其实有很多个错误,不知道到底是哪个,有可能是因为节点显卡分配不均匀(一个节点用了3块,另一个一块),也可能是因为CUDA的P2P连接问题,或者RDMA的问题。
c5节点3块卡,c6节点4块卡,我在c5节点运行serve,默认占了全部,另一块c6补全,所以也需要使用
CUDA_VISIBLE_DEVICES
进行限制使两个节点使用相同的数量。很多错误我在这里都有修正,首先使用
nvidia-smi topo -m
命令查看GPU的P2P连接情况,以c6节点为例:Apptainer> nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE SYS SYS 0-37,76-113 0 N/A GPU1 NODE X SYS SYS 0-37,76-113 0 N/A GPU2 SYS SYS X NODE 38-75,114-151 1 N/A GPU3 SYS SYS NODE X 38-75,114-151 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks # 根据输出结果可知,GPU0与GPU1相连接,GPU2与GPU3相连接。
在知道GPU连接情况后,我们重新进入容器并进行以下环境设置(根据节点实际情况进行设置,两个节点都需要设置):
export VLLM_LOGGING_LEVEL=DEBUG # 调试日志只需要在服务启动的节点设置即可 export NCCL_DEBUG=TRACE # 调试日志只需要在服务启动的节点设置即可 export CUDA_VISIBLE_DEVICES=0,1 # 要使用显卡的设备号,每个节点都要设置!根据实际需要的数量来设置 export NCCL_SOCKET_IFNAME=ibp101s0 # 每个节点相同ip段的网卡名,每个节点都要设置! export GLOO_SOCKET_IFNAME=ibp101s0 # 每个节点相同ip段的网卡名,每个节点都要设置!(尽量使用ib卡,使用ifconfig查看)
然后在运行ray,并设置ray主节点ip为ib卡的ip,最后运行模型服务:
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --pipeline-parallel-size 2 --dtype half --cpu-offload-gb 20 --enable-chunked-prefill=False
方案二:强制 NCCL 使用 TCP【见第6个错误的解决方案】。
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
错误日志如下:
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 2 --pipeline-parallel-size 3 --dtype half --kv-cache-dtype fp8_e4m3 --enable-chunked-prefill=False --max-model-len 106000 --enforce-eager ...... (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] Error executing method 'init_device'. This might cause deadlock in distributed execution. (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] Traceback (most recent call last): (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 566, in execute_method (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] return run_method(target, method, args, kwargs) (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2220, in run_method (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] return func(*args, **kwargs) (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] ^^^^^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 166, in init_device (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] init_worker_distributed_environment(self.vllm_config, self.rank, (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 506, in init_worker_distributed_environment (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1103, in ensure_model_parallel_initialized (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] initialize_model_parallel(tensor_model_parallel_size, (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1064, in initialize_model_parallel (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] _PP = init_model_parallel_group(group_ranks, (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] return GroupCoordinator( (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] ^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 218, in __init__ (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] self.pynccl_comm = PyNcclCommunicator( (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] ^^^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__ (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] self.comm: ncclComm_t = self.nccl.ncclCommInitRank( (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] raise RuntimeError(f"NCCL error: {error_str}") (RayWorkerWrapper pid=81149, ip=192.168.112.16) ERROR 02-18 14:45:16 worker_base.py:574] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) c5:3941396:3941396 [0] NCCL INFO ncclCommInitRank comm 0x15111fa0 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 17000 commId 0xa591fd24b4e7294c - Init COMPLETE
错误原因:NCCL 版本兼容性问题 或 集群没有正确配置 InfiniBand(IB)导致的通信问题。NCCL 需要 IB/RDMA(远程直接数据存取) 进行高效的 GPU 间通信。
解决方案:
强制 NCCL 使用 TCP 而不是 RDMA。在启动ray集群之前,在每个ray节点执行以下命令即可:
export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export NCCL_NET_GDR_LEVEL=0
运行分布式模型时,在
Capturing cudagraphs for decoding
期间OOM:错误日志如下:
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 4 --dtype half --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill=False --kv-cache-dtype fp8_e4m3 --cpu-offload-gb 5 ...... (RayWorkerWrapper pid=1104765, ip=192.168.112.22) INFO 02-14 10:31:42 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. Capturing CUDA graph shapes: 3%|█▉ | 1/35 [00:03<01:51, 3.28s/it]DEBUG 02-14 10:31:46 client.py:191] Waiting for output from MQLLMEngine. Capturing CUDA graph shapes: 14%|█████████▊ | 5/35 [00:13<01:18, 2.63s/it]DEBUG 02-14 10:31:56 client.py:191] Waiting for output from MQLLMEngine. Capturing CUDA graph shapes: 26%|█████████████████▋ | 9/35 [00:23<01:06, 2.55s/it]DEBUG 02-14 10:32:06 client.py:191] Waiting for output from MQLLMEngine. Capturing CUDA graph shapes: 37%|█████████████████████████▎ | 13/35 [00:33<00:55, 2.51s/it]DEBUG 02-14 10:32:16 client.py:191] Waiting for output from MQLLMEngine. Capturing CUDA graph shapes: 43%|█████████████████████████████▏ | 15/35 [00:38<00:50, 2.51s/it](RayWorkerWrapper pid=2714234) *** SIGSEGV received at time=1739500342 on cpu 52 *** (RayWorkerWrapper pid=2714234) PC: @ 0x143e35049b8a (unknown) addProxyOpIfNeeded() ...... (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffd68f510cb721ef5e6429668006000000 Worker ID: c6cf07079464c166417eed790f152c97e60d7a43371740cf7c226718 Node ID: 6b61278fe903b00b3cad2ca35dd8d89b7743402dee08bb56ad5d4dfd Worker IP address: 192.168.112.21 Worker port: 10017 Worker PID: 2714234 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
错误原因:在我没有增加
--cpu-offload-gb 5
参数的情况下只是报错了max-model-len
太长,我为了模型上下文能够全部使用,就使用--cpu-offload-gb
增加了5G内存,按理说不应该OOM退出的啊。经过多次测试,有时候能成功有时候不成功,应该是内存不稳定导致。有时候在运行过程中也会出现这种错误,大概都是内存崩溃导致的。
Tips:我们没必要为了追求最大上下文长度来使用cpu内存,因为这样会降低模型推理效率,反而得不偿失!!!能用
max-model-len
参数解决的,就尽量不要用--cpu-offload-gb
参数。解决方案:
方案一(最优解):增加
--enforce-eager
参数来提高稳定性。修改后的最终命令如下:Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 4 --dtype half --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill=False --kv-cache-dtype fp8_e4m3 --cpu-offload-gb 5
方案二:根据日志我们也可知有很多参数也能帮我们解决问题。比如:这里使用增大cpu内存,减小gpu内存率的方案,修改后的最终命令如下:
Apptainer> vllm serve /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 4 --dtype half --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill=False --kv-cache-dtype fp8_e4m3 --cpu-offload-gb 10 --gpu-memory-utilization 0.8