软件及环境
- DeepSeek-R1:https://github.com/deepseek-ai/DeepSeek-R1
- SGLang:https://github.com/sgl-project/sglang,使用的镜像版本为:
lmsysorg/sglang:v0.4.3-cu124
- Apptainer:https://apptainer.org/documentation/
- CUDA版本:
NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4
- GPU型号:Quadro RTX 5000 16GB;(Tesla V100-PCIE-32GB 由于计算能力太低,sglang不支持)
环境准备
拉取模型
[root@shield-head slurm]# export HF_ENDPOINT=https://hf-mirror.com
[root@shield-head slurm]# pip install huggingface_hub
[root@shield-head slurm]# python
>>> from huggingface_hub import snapshot_download
>>> snapshot_download(repo_id="deepseek-ai/DeepSeek-R1-Distill-Llama-70B", local_dir_use_symlinks=False, local_dir=r"/root/huggingface_model/DeepSeek-R1-Distill-Llama-70B") # 大小145G左右
拉取后的模型应该与huggingface中的file一致,即:
[root@c6 huggingface_model]# ls ./*
./DeepSeek-R1-Distill-Llama-70B:
config.json model-00005-of-000017.safetensors model-00014-of-000017.safetensors
figures model-00006-of-000017.safetensors model-00015-of-000017.safetensors
generation_config.json model-00007-of-000017.safetensors model-00016-of-000017.safetensors
hub model-00008-of-000017.safetensors model-00017-of-000017.safetensors
LICENSE model-00009-of-000017.safetensors model.safetensors.index.json
model-00001-of-000017.safetensors model-00010-of-000017.safetensors README.md
model-00002-of-000017.safetensors model-00011-of-000017.safetensors tokenizer_config.json
model-00003-of-000017.safetensors model-00012-of-000017.safetensors tokenizer.json
model-00004-of-000017.safetensors model-00013-of-000017.safetensors
制作镜像
制作SGLang镜像:(先用沙盒测,如果没问题再转为镜像)
apptainer build --sandbox sglang-v0.4.3-cu124 docker://lmsysorg/sglang:v0.4.3-cu124
使用单机多卡运行
# 进入容器
singularity shell --nv -B /share/home/hpcadmin/llz/huggingface_model:/root/.cache/huggingface sglang-v0.4.3-cu124/
# 进入容器后执行
# Apptainer> python3 -m sglang.launch_server --model-path /root/.cache/huggingface/DeepSeek-R1-Distill-Llama-8B/ --tp 2 --host 10.240.214.89 --port 30000 # 8B的两张16G卡刚好足够
Apptainer> python3 -m sglang.launch_server --model-path /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-14B/ --tp 2 --host 10.240.214.89 --port 30000 --cpu-offload-gb 25 --mem-fraction-static 0.7 # 14B模型两张16G不够,还要卸载到cpu一部分
......
[2025-02-28 15:36:29 TP1] Registering 260 cuda graph addresses
[2025-02-28 15:36:29 TP0] Registering 260 cuda graph addresses
[2025-02-28 15:36:29 TP1] Capture cuda graph end. Time elapsed: 3.94 s
[2025-02-28 15:36:29 TP0] Capture cuda graph end. Time elapsed: 3.94 s
[2025-02-28 15:36:30 TP0] max_total_num_tokens=96811, chunked_prefill_size=2048, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-02-28 15:36:30 TP1] max_total_num_tokens=96811, chunked_prefill_size=2048, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-02-28 15:36:30] INFO: Started server process [1629242]
[2025-02-28 15:36:30] INFO: Waiting for application startup.
[2025-02-28 15:36:30] INFO: Application startup complete.
[2025-02-28 15:36:30] INFO: Uvicorn running on http://10.240.214.89:30000 (Press CTRL+C to quit)
[2025-02-28 15:36:31] INFO: 10.240.214.89:36314 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-28 15:36:31 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-28 15:36:33] INFO: 10.240.214.89:36316 - "POST /generate HTTP/1.1" 200 OK
[2025-02-28 15:36:33] The server is fired up and ready to roll!
# 参数解释:
# --tp 2 【张量并行副本的数量】使用2块GPU显卡
# --cpu-offload-gb 【每个 GPU 卸载到 CPU 的空间(以 GiB 为单位)】每块卡额外占用cpu内存25G
# --mem-fraction-static 【用于静态内存(如模型权重和 KV 缓存)的可用 GPU 内存的比例。如果构建 KV 缓存失败,则应增加该比例。如果 CUDA 内存不足,则应减少该比例。】
# --port和--host 【设置 HTTP 服务器的主机】默认情况下和host: str = "127.0.0.1"port: int = 30000
具体的参数请根据自己的实际情况运行,官方参数解释请看:https://docs.sglang.ai/backend/server_arguments.html
当模型运行起来后,可以使用Open WebUI等其它UI来连接api进行使用。
使用多机多卡运行(资源不够实际未尝试!!!)
可以看我的另一篇博客:使用vLLM多机多卡部署deepseek大模型(含调试技巧)
SGLang官方文档多机多卡推理:https://docs.sglang.ai/references/multi_node.html
运行模型
在主节点(c5)运行命令:
apptainer shell --nv -B /share/home/hpcadmin/llz/huggingface_model:/root/.cache/huggingface sglang-v0.4.3-cu124/
# 进入容器后再运行:
export NCCL_DEBUG=TRACE # 开启调试信息
export CUDA_VISIBLE_DEVICES=0,1 # 使用指定的GPU卡,最好两个GPU卡有P2P连接
export NCCL_SOCKET_IFNAME=ibp101s0 # 指定网卡为ib卡
export GLOO_SOCKET_IFNAME=ibp101s0
Apptainer> python3 -m sglang.launch_server --model-path /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B/ --tp 4 --dist-init-addr 192.168.12.21:5000 --nnodes 2 --node-rank 0 --host 0.0.0.0 --port 40000
在子节点(c6)运行命令:
apptainer shell --nv -B /share/home/hpcadmin/llz/huggingface_model:/root/.cache/huggingface sglang-v0.4.3-cu124/
# 进入容器后再运行:
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_SOCKET_IFNAME=ibs20
Apptainer> python3 -m sglang.launch_server --model-path /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B/ --tp 4 --dist-init-addr 192.168.12.21:5000 --nnodes 2 --node-rank 1 --host 0.0.0.0 --port 40000
运行模型期间碰到的错误
在调试期间遇到了的错误,以及如何解决的。
RuntimeError: SGLang only supports sm75 and above.
错误日志如下:
Apptainer> python3 -m sglang.launch_server --model-path /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-32B/ --tp 4 --dist-init-addr 192.168.12.21:5000 --nnodes 2 --node-rank 0 --trust-remote-code ...... [2025-02-17 17:06:32 TP0] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 240, in __init__ self.tp_worker = TpWorkerClass( File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__ self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__ self.model_runner = ModelRunner( File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in __init__ self.load_model() File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 301, in load_model raise RuntimeError("SGLang only supports sm75 and above.") RuntimeError: SGLang only supports sm75 and above.
因为FlashInfer是默认的注意内核后端,它仅支持sm75 及以上版本。
这个 issue:1146 明确说明不支持sm70,所以可以使用其它的部署方法,比如vllm。
解决方案:更换部署方案。
Not enough memory 或者是 CUDA out of memory
错误日志日下:(这里用的是两块16G的RTX5000,去跑14B的模型报的错)
Apptainer> python3 -m sglang.launch_server --model-path /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-14B/ --tp 2 --host 10.240.214.89 --port 30000 ...... [2025-02-28 16:08:48 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=4.52 GB [2025-02-28 16:08:48 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=4.52 GB [2025-02-28 16:08:48 TP1] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 240, in __init__ self.tp_worker = TpWorkerClass( File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__ self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__ self.model_runner = ModelRunner( File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 215, in __init__ self.init_memory_pool( File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 628, in init_memory_pool raise RuntimeError( RuntimeError: Not enough memory. Please try to increase --mem-fraction-static. [2025-02-28 16:08:48 TP0] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 240, in __init__ self.tp_worker = TpWorkerClass( File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__ self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__ self.model_runner = ModelRunner( File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 215, in __init__ self.init_memory_pool( File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 628, in init_memory_pool raise RuntimeError( RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
解决方法:既然是内存不够,那就offload到cpu内存,如果还报相同的错,就增大offload到cpu的内存。
这个sglang很奇怪,20多G的模型,按理说50G内存差不多就够了,但是卸载到cpu竟然需要25G,用了两块卡,那么用的内存是 16G * 2 的显存 + 25 * 2 的内存,感觉像是把不够gpu使用的那部分拷贝了两份给cpu内存。
我使用vllm的offload,感觉就是gpu不够的那部分模型内存平分一下给cpu,cpu-offload只需要15G。
Apptainer> python3 -m sglang.launch_server --model-path /root/.cache/huggingface/DeepSeek-R1-Distill-Qwen-14B/ --tp 2 --host 10.240.214.89 --port 30000 --cpu-offload-gb 25 --mem-fraction-static 0.7