在海光 BW 加速卡上部署 vLLM 服务

Mar 03, 2026 · 4036 字

最近在超算互联网平台上白嫖了一块海光 BW 加速卡，尝试了一下在上面部署 vLLM 服务。

海光 BW 是基于海光 DCU 的国产 AI 加速卡，由成都海光集成电路设计有限公司研发。它兼容 ROCm 软件生态，支持主流深度学习框架，在大模型推理、科学计算等场景中可作为 NVIDIA CUDA GPU 的国产替代方案。

使用的服务器：

root@crdnotebook-2028865973678886914-accol683th-15971:~# fastfetch --logo none
-----------------------------------------------------
OS: Ubuntu 22.04.5 LTS (Jammy Jellyfish) x86_64
Host: CY62-T58
Kernel: Linux 5.10.134-17.1.3.sga8.x86_64
Uptime: 20 hours, 17 mins
Packages: 1070 (dpkg)
Shell: bash 5.1.16
Theme: Yaru [GTK3]
Icons: Yaru [GTK3]
Cursor: Adwaita
Terminal: /dev/pts/0 8.9p1 Ubuntu-3ubuntu0.13
CPU: 2 x Hygon C86 (128) @ 2.50 GHz
GPU: ASPEED Technology, Inc. ASPEED Graphics Family
Memory: 38.27 GiB / 503.19 GiB (8%)
Swap: Disabled
Disk (/): 335.82 GiB / 436.09 GiB (77%) - overlay
Local IP (eth0): 172.20.129.106/32
Locale: C

加速卡信息：

root@crdnotebook-2028865973678886914-accol683th-15971:~# rocm-smi --showproductname --showmeminfo vram

================================= System Management Interface ==================================
================================================================================================
HCU[0]          : Card Series:           BW
HCU[0]          : Card Vendor:           C-3000 IC Design Co., Ltd.
================================================================================================
================================================================================================
HCU[0]          : vram Total Memory (MiB): 65520
HCU[0]          : vram Total Used Memory (MiB): 57766
================================================================================================
======================================== End of SMI Log ========================================

这里我们直接选用预装了 PyTorch 2.4.1 + DTK 25.04 + Python 3.10 的镜像。对应的模型文件可以直接在平台上下载，路径为 /root/private_data/SothisAI/model/Aihub/*。在这里我们使用的是 Qwen3-0.6B 模型。

然后启动 vLLM 服务：

vllm serve /root/private_data/SothisAI/model/Aihub/Qwen3-0.6B/main/Qwen3-0.6B --served-model-name Qwen3-0.6B --tensor-parallel-size 1 --max-model-len 4096 --dtype bfloat16 --enforce-eager
...
INFO 03-04 01:10:16 [loader.py:460] Loading weights took 3.35 seconds
INFO 03-04 01:10:16 [model_runner.py:1155] Model loading took 1.1201 GiB and 3.505950 seconds
INFO 03-04 01:10:21 [worker.py:287] Memory profiling takes 5.16 seconds
INFO 03-04 01:10:21 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (0.90) = 57.59GiB
INFO 03-04 01:10:21 [worker.py:287] model weights take 1.12GiB; non_torch_memory takes 0.42GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 54.66GiB.
INFO 03-04 01:10:22 [executor_base.py:112] # rocm blocks: 31981, # CPU blocks: 2340
INFO 03-04 01:10:22 [executor_base.py:117] Maximum concurrency for 4096 tokens per request: 124.93x
INFO 03-04 01:10:27 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 11.10 seconds
WARNING 03-04 01:10:28 [config.py:1241] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 03-04 01:10:28 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 03-04 01:10:28 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 03-04 01:10:28 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8020
...

关于如何安装 Open WebUI 可参考 Quick Start | Open WebUI。

之后启动 Open WebUI：

open-webui serve

在管理员面板中设置外部连接 OpenAI 接口，地址为 http://localhost:8000/v1，之后就可以使用了。

指标：Avg prompt throughput: 75.1 tokens/s, Avg generation throughput: 39.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.