大模型部署

300I DUO

参考:https://blog.csdn.net/weixin_45724963/article/details/149979566?utm_medium=distribute.pc_relevant.none-task-blog-2defaultbaidujs_utm_term~default-0-149979566-blog-148839465.235

基础镜像
swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:1.0.0-300I-Duo-py311-openeuler24.03-lts

1) 启动镜像

docker run -it  --net=host --shm-size=1g \
    --name Qwen2.5-14B-Instruct \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    --device=/dev/davinci6 \
    --device=/dev/davinci7 \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /root/hw/Qwen/Qwen2.5-14B-Instruct:/model:ro \
    -v /root/hw:/root/hw \
harbor.huaweisoft.com/huaweisoft/ai/mindie:1.0.0-300I-Duo-py311-openeuler24.03-lts bash

这里注意两点,-v /root/hw/Qwen/Qwen2.5-14B-Instruct:/model:ro 直接映射到模型目录
第二个 -v /root/hw:/root/hw, 不能写 -v /root:/root 。会导致容器/root被覆盖

2)改配置

  1. 修改模型权重config.json中torch_dtype字段为float16
  2. chmod 750 /model/config.json
  3. 修改mindie配置文件:vim /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
    ① 修改3个监听端口,否则同一服务器上有部署多个模型在华为,会有冲突
    在这里插入图片描述
    ② 修改npu数量,外部映射6,7-> 容器内部可见0,1 ;worldSize设置为2
    ③ 修改模型路径
    在这里插入图片描述

3)改完后就可以固化启动了

docker commit 固化镜像,然后重建容器

docker run -itd  --restart  always --net=host --shm-size=1g \
--name Qwen2.5-14B-Instruct \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
-v /usr/local/sbin:/usr/local/sbin:ro \
-v /root/hw/Qwen/Qwen2.5-14B-Instruct:/model:ro \
-v /root/hw:/root/hw \
harbor.huaweisoft.com/huaweisoft/ai/mindie:1.0.0-300I-Duo-py311-openeuler24.03-lts-qwen bash /root/hw/run_qwen2.sh

常见报错

1) Error while loading conda entry point: conda-anaconda-tos (No module named ‘pydantic_core._pydantic_core’)
【原因】-v /root:/root 覆盖了容器内/root
【解决方法】 移除该映射

2)

[root@localhost bin]# ./mindieservice_daemon
LogConfig: [json.exception.out_of_range.403] key 'LogConfig' not found
ERR: Failed to init endpoint! Please check the service log or console output.
Killed

【原因】使用了cp /root/hw/config.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json 。 把容器内的config.json通过cp覆盖
【解决办法】要用vim编辑,不能用cp

3)

EE1001: [PID: 202] 2025-08-19-10:43:30.610.367 The argument is invalid.Reason: Set device failed, invalid device, set device=3, valid device range is [0, 2)
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        rtSetDevice execute failed, reason=[device id error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        open device 3 failed, runtime result = 107001.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

【原因】容器内可见显卡序号为0,1 ; 配置的不是0,1
【解决办法】修改显卡配置
在这里插入图片描述

4)模型加载成功, 但MindIE 服务在“启动 HTTP 监听端口”阶段失败。

[2025-08-19 10:44:57,046] [802] [281456198218080] [llm] [INFO] [logging.py-331] : >>>>>>id of kcache is 281462178701744 id of vcache is 281462178700016
2025-08-19 10:45:10,179 [INFO] standard_model.py:155 - >>>rank:0 done ibis manager to device
2025-08-19 10:45:10,179 [INFO] npu_compile.py:20 - 310P,some op does not support
2025-08-19 10:45:10,179 [INFO] standard_model.py:172 - >>>rank:0: return initialize success result: {'status': 'ok', 'npuBlockNum': '1695', 'cpuBlockNum': '426', 'maxPositionEmbeddings': '32768'}
2025-08-19 10:45:10,191 [INFO] standard_model.py:155 - >>>rank:1 done ibis manager to device
2025-08-19 10:45:10,192 [INFO] npu_compile.py:20 - 310P,some op does not support
2025-08-19 10:45:10,192 [INFO] standard_model.py:172 - >>>rank:1: return initialize success result: {'status': 'ok', 'npuBlockNum': '1656', 'cpuBlockNum': '426', 'maxPositionEmbeddings': '32768'}
ERR: Failed to init endpoint! Please check the service log or console output.
Killed

【原因】监听端口冲突,修改监听端口
在这里插入图片描述

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐