【音频标注】- deepseek-R1满血版 1.58 Bit模型落地部署(四)
本文记录了在Docker容器中部署deepseek-R1 1.58Bit大模型的过程。作者首先清理了无关Docker容器,新建容器并挂载必要路径,但在安装Anaconda和构建虚拟环境时遇到代理配置错误。通过排查发现环境变量中的无效代理设置,清除后解决了apt更新问题。随后尝试安装NVIDIA驱动535版本以支持CUDA 12.1,但出现"Failed to initialize NVM
【音频标注】- deepseek-R1满血版 1.58 Bit模型落地部署(四)
背景
被元宝 deepseek 垃圾玩意儿整破防了。这回我准备一步一个脚印,用最稳妥的方式在今天下班前实现大模型部署成功。太特么累了,操!
约束
cuda版本
pytorch版本
受flash_attr约束
cuda toolkit:11.8
pytorch:2.2
要确保 docker 挂载 大模型文件 和 ktransformer安装文件。
看docker空间是否还足够
docker 中 建立 anaconda3 虚拟环境,才足够稳定,才容易移植。否则容易扯着淡。
整体流程
步骤一:清理所有无关docker内容
步骤二:进入docker,建立虚拟环境
注意,要确保docker挂载了需要的内容
挂载路径:/opt/KTransformers+Unsloth
- 清理docker
(base) hyt@user-H3C-UniServer-R4900-G5:/opt/KTransformers+Unsloth/code/ktransformers$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c0f85d5ab457 nvidia/cuda:12.4.0-base-ubuntu22.04 "bash" 51 minutes ago Exited (0) 36 minutes ago deepseek-llm
1e5b78870661 nvidia/cuda:12.4.0-base-ubuntu22.04 "bash" 2 hours ago Up About an hour deepseek-models
d7149cadab4a llm-base:latest "bash" 3 hours ago Up 3 hours LLM
dd70e90a0c20 nvidia/cuda:12.4.0-base-ubuntu22.04 "bash" 5 days ago Up 25 hours deepseek-step
- 只剩 deepseek-step
(base) hyt@user-H3C-UniServer-R4900-G5:/opt/KTransformers+Unsloth$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
dd70e90a0c20 nvidia/cuda:12.4.0-base-ubuntu22.04 "bash" 5 days ago Up 25 hours deepseek-step
- 新建容器 复制deepseek-step内容,挂载路径:/opt/KTransformers+Unsloth(抚平我内心的焦躁、憋慌、气愤、怒火)
限300字
移除旧容器:docker rm deepseek-01
交互启动:docker run -it --name deepseek-01 -v "/opt/KTransformers+Unsloth":/opt/KTransformers+Unsloth new-deepseek-image bash
进入后直接ls /opt/KTransformers+Unsloth验证挂载。
(base) hyt@user-H3C-UniServer-R4900-G5:/opt/KTransformers+Unsloth$ docker rm deepseek-01
deepseek-01
(base) hyt@user-H3C-UniServer-R4900-G5:/opt/KTransformers+Unsloth$ docker run -it --name deepseek-01 -v "/opt/KTransformers+Unsloth":/opt/KTransformers+Unsloth new-deepseek-image bash
root@2109b966a8cf:/# ls /opt/KTransformers+Unsloth
bin code conf dist flash-attention flash_attn-2.5.8+cu118torch2.2cxx11abiTRUE-cp311-cp311-linux_x86_64.whl ktransformers model opt
- 下载anaconda3 - 报错
root@2109b966a8cf:/# wget https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
Error parsing proxy URL http://your-proxy:port: Bad port number.
root@2109b966a8cf:/#
- 安装虚拟环境 kt
- 往下安装,报错
Err:2 http://security.ubuntu.com/ubuntu jammy-security InRelease
Could not resolve 'your-proxy'
Err:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Could not resolve 'your-proxy'
Err:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Could not resolve 'your-proxy'
Err:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Could not resolve 'your-proxy'
Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease
Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease
Could not resolve 'your-proxy'
Reading package lists... Done
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy/InRelease Could not resolve 'your-proxy'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease Could not resolve 'your-proxy'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease Could not resolve 'your-proxy'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/jammy-security/InRelease Could not resolve 'your-proxy'
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/InRelease Could not resolve 'your-proxy'
W: Some index files failed to download. They have been ignored, or old ones used instead.
发现猫腻,错误明晃晃的太离谱:http://your-proxy:port
(kt) root@2109b966a8cf:~/autodl-tmp# echo $http_proxy $https_proxy
http://your-proxy:port http://your-proxy:port
成功定位到问题根源:环境变量 HTTP_PROXY和 HTTPS_PROXY被设置为无效代理 http://your-proxy:port。
找到它并删除。
# Remove proxy from environment variables
unset http_proxy
unset https_proxy
unset ftp_proxy
# Remove proxy from APT configuration
sudo rm -f /etc/apt/apt.conf.d/95proxies
sudo rm -f /etc/apt/apt.conf.d/proxy
- docker 安装驱动
nvidia-smi 无响应,驱动没装,cuda 和 显卡 不识别
听豆包的,安装 nvidia-driver-535 是为了让系统识别并驱动你的 NVIDIA 显卡,它是 CUDA 12.1 运行的基础。该版本适配 CUDA 12.1,只有装好驱动,nvidia-smi才会有输出,后续 CUDA 才能调用 GPU 算力跑 AI 任务。
装完感觉死翘翘了
(kt) root@2109b966a8cf:~/autodl-tmp# nvidia-smi
Failed to initialize NVML: Unknown Error
(kt) root@2109b966a8cf:~/autodl-tmp# lsmod | grep nvidia
nvidia_uvm 2076672 0
nvidia_drm 135168 0
drm_ttm_helper 16384 1 nvidia_drm
nvidia_modeset 1638400 1 nvidia_drm
video 77824 1 nvidia_modeset
nvidia 104071168 2 nvidia_uvm,nvidia_modeset
(kt) root@2109b966a8cf:~/autodl-tmp# dmesg | grep -i nvidia | tail -20
dmesg: read kernel buffer failed: Operation not permitted
(kt) root@2109b966a8cf:~/autodl-tmp# nvidia-detector
nvidia-driver-580
NVIDIA驱动在容器内无法正常工作,问题可能在于容器内核模块与宿主机不匹配。我们已经在容器内安装了NVIDIA驱动,但是容器内的内核版本(6.14.0-35-generic)可能与宿主机不同,导致模块无法加载。现在,我们需要确认宿主机的情况,并采取相应措施。由于容器内没有nvidia-container-cli,我们无法直接检查容器运行时配置。但是,我们可以尝试在容器内安装必要的工具来诊断,或者重新构建容器。容器内缺少必要的诊断工具。这表明容器环境可能不完整。关键问题是NVIDIA设备节点不完整(只有/dev/nvidiactl,缺少/dev/nvidia0等设备节点)。
在宿主机终端执行如下内容,用于解决镜像拉取超时问题:
# 创建Docker镜像加速配置
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json << 'EOF'
{
"registry-mirrors": [
"https://docker.mirrors.ustc.edu.cn",
"https://hub-mirror.c.163.com",
"https://registry.docker-cn.com"
]
}
EOF
# 重启Docker服务使配置生效
sudo systemctl daemon-reload
sudo systemctl restart docker
# 验证Docker服务状态
sudo systemctl status docker
验证宿主机GPU和Docker GPU支持是否正常:
# 测试基础NVIDIA功能
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
# 如果上述失败,尝试其他标签
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
hyt@user-H3C-UniServer-R4900-G5:~$ # 测试基础NVIDIA功能
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
# 如果上述失败,尝试其他标签
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:12.0-base' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
Run 'docker run --help' for more information
Thu Nov 27 08:10:21 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:8A:00.0 Off | Off |
| 30% 35C P8 22W / 450W | 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
第一个命令失败原因:
nvidia/cuda:12.0-base镜像在本地和镜像源中均不存在
网络超时可能是由于该标签已过期或不可用
第二个命令成功原因:
nvidia/cuda:12.4.0-base-ubuntu22.04是有效且可用的镜像标签
镜像加速配置已生效,成功从国内源拉取
docker run -it --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 bash
此命令在宿主机执行,用于启动一个支持NVIDIA GPU的Docker容器。具体作用:基于CUDA 12.4镜像创建交互式环境,并分配全部GPU资源,进入bash shell供操作。
- 新建docker,要包含30G显存,90G内存,若干CPU
docker run -itd --name deepseek-02 \
--gpus '"device=0"' \
--memory=90g \
--memory-swap=92g \
--cpus=0.5 \
-v /opt/KTransformers+Unsloth/:/app/models \
deepseek-custom:latest \
bash
我原本对docker的理解不对,出现错误要全部推倒重来(比如:docker 忘记带卷忘记设置 内存 显存 cpu 等)。是的,理解错了。docker可设置资源,比如GPU cpu 内存的占用。我先把当前某个docker(如deepseek-02)设置为 镜像(如:deepseek-custom:latest),把老docker停掉或删除,建立新docker(比如:deepseek-02),引用老镜像,就可以把老镜像里的已安装内容全部继承,包括已安装的虚拟环境(但 venv 虚拟环境好像依赖路径,容易损坏。anaconda3 的虚拟环境就非常能打,较为独立)。如上文命令所示,给 docker 指定 GPU CPU 内存 挂载卷 引用镜像。 我建立并进入镜像后,不小心安装了miniconda3,后面看到虚拟环境kt,想启用时报错
CondaError: Run 'conda init' before 'conda activate'
发现原有的 anaconda3 虚拟环境应该是被冻结了,原因应该是我莫名其妙装了miniconda3:
root@0335e05fabec:/# # 列出所有可用的conda环境
conda info --envs
#激活anaconda3环境
conda activate /root/anaconda3
#conda environments:
#* -> active
#+ -> frozen
base /opt/miniconda3
/root/anaconda3
/root/anaconda3/envs/kt
于是我用命令重新启用 anaconda3 环境 并进入kt虚拟环境:
/root/anaconda3/bin/conda init bash
root@0335e05fabec:/# source ~/.bashrc
(base) root@0335e05fabec:/# conda activate /root/anaconda3
(base) root@0335e05fabec:/# conda info --envs
# conda environments:
#
/opt/miniconda3
base * /root/anaconda3
kt /root/anaconda3/envs/kt
(base) root@0335e05fabec:/# conda activate kt
(kt) root@0335e05fabec:/# pip list
看安装的包,都在,很棒,我终于学会的对 docker 随心所欲的使用:
(kt) root@0335e05fabec:/# pip list
Package Version
------------------------ ------------
certifi 2022.12.7
charset-normalizer 2.1.1
cpufeature 0.2.1
einops 0.8.1
filelock 3.20.0
flash-attn 2.5.8
fsspec 2025.10.0
idna 3.4
Jinja2 3.1.6
MarkupSafe 3.0.3
mpmath 1.3.0
networkx 3.6
ninja 1.13.0
numpy 1.24.3
nvidia-cublas-cu11 11.11.3.6
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.7.0.84
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.3.0.86
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.1.48
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.5.86
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu11 2.19.3
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.9.86
nvidia-nvtx-cu11 11.8.86
nvidia-nvtx-cu12 12.1.105
packaging 25.0
pillow 11.3.0
pip 25.3
requests 2.28.1
setuptools 80.9.0
sympy 1.14.0
torch 2.2.0+cu121
torchaudio 2.2.0+cu121
torchvision 0.17.0+cu121
triton 2.2.0
typing_extensions 4.15.0
urllib3 1.26.13
wheel 0.45.1
这样,往下就可以马上装 ktransformer了,但忘了怎么进入挂载卷了。看当时创建的指令,挂载卷对应的 docker 内路径“/app/models”,进去看,有了:
(kt) root@0335e05fabec:/app/models/code# cd ..
(kt) root@0335e05fabec:/app/models# ll
total 119172
drwxrwxr-x 10 1002 1002 4096 Nov 26 06:41 ./
drwxr-xr-x 3 root root 4096 Nov 27 09:01 ../
-rw-r--r-- 1 root root 16 Nov 6 02:37 .webui_secret_key
drwxrwxr-x 2 1002 1002 4096 Nov 3 04:58 bin/
drwxrwxr-x 4 1002 1002 4096 Nov 13 09:19 code/
drwxrwxr-x 3 1002 1002 4096 Nov 3 04:58 conf/
drwxrwxr-x 2 1002 1002 4096 Nov 17 06:56 dist/
drwxrwxr-x 12 1002 1002 4096 Nov 17 07:48 flash-attention/
-rw-rw-r-- 1 1002 1002 121978927 Nov 17 08:05 flash_attn-2.5.8+cu118torch2.2cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
drwxr-xr-x 2 root root 4096 Nov 26 06:41 ktransformers/
drwxr-xr-x 4 root root 4096 Nov 6 01:29 model/
drwxrwxr-x 2 1002 1002 4096 Nov 3 04:58 opt/
- 安装 ktransformer
看 pdf 还有 一些工具貌似没装,于是装了一下装出问题了。
(kt) root@0335e05fabec:/app/models/code/ktransformers# conda install -c conda-forge libstdcxx-ng
Retrieving notices: ...working... ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <Failed to parse: http://your-proxy:port> for channel: pkgs/main url: https://repo.anaconda.com/pkgs/main/notices.json
ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <Failed to parse: http://your-proxy:port> for channel: conda-forge url: https://conda.anaconda.org/conda-forge/notices.json
ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <Failed to parse: http://your-proxy:port> for channel: pkgs/r url: https://repo.anaconda.com/pkgs/r/notices.json
done
Channels:
- conda-forge
- defaults
Platform: linux-64
Collecting package metadata (repodata.json): failed
# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.12/site-packages/urllib3/util/url.py", line 425, in parse_url
host, port = _HOST_PORT_RE.match(host_port).groups() # type: ignore[union-attr]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'groups'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.12/site-packages/requests/adapters.py", line 633, in send
conn = self.get_connection_with_tls_context(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/requests/adapters.py", line 476, in get_connection_with_tls_context
proxy = prepend_scheme_if_needed(proxy, "http")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/requests/utils.py", line 995, in prepend_scheme_if_needed
parsed = parse_url(url)
^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/urllib3/util/url.py", line 451, in parse_url
raise LocationParseError(source_url) from e
urllib3.exceptions.LocationParseError: Failed to parse: http://your-proxy:port
看起来是老配方老问题。
# 设置容器自动重启(推荐用于生产环境)
docker update --restart=unless-stopped deepseek-02
# 验证重启策略
docker inspect deepseek-02 | grep -A5 -B5 RestartPolicy
还是不行,提示底层驱动不对,还要重新开始,心累。
更多推荐


所有评论(0)