【音频标注】- deepseek-R1满血版 1.58 Bit模型落地部署(四)

背景

被元宝 deepseek 垃圾玩意儿整破防了。这回我准备一步一个脚印,用最稳妥的方式在今天下班前实现大模型部署成功。太特么累了,操!

约束

cuda版本
pytorch版本
受flash_attr约束
cuda toolkit:11.8
pytorch:2.2
要确保 docker 挂载 大模型文件 和 ktransformer安装文件。
看docker空间是否还足够

docker 中 建立 anaconda3 虚拟环境,才足够稳定,才容易移植。否则容易扯着淡。

整体流程

步骤一:清理所有无关docker内容

步骤二:进入docker,建立虚拟环境

注意,要确保docker挂载了需要的内容
挂载路径:/opt/KTransformers+Unsloth

  1. 清理docker
(base) hyt@user-H3C-UniServer-R4900-G5:/opt/KTransformers+Unsloth/code/ktransformers$ docker ps -a
CONTAINER ID   IMAGE                                 COMMAND   CREATED          STATUS                      PORTS     NAMES
c0f85d5ab457   nvidia/cuda:12.4.0-base-ubuntu22.04   "bash"    51 minutes ago   Exited (0) 36 minutes ago             deepseek-llm
1e5b78870661   nvidia/cuda:12.4.0-base-ubuntu22.04   "bash"    2 hours ago      Up About an hour                      deepseek-models
d7149cadab4a   llm-base:latest                       "bash"    3 hours ago      Up 3 hours                            LLM
dd70e90a0c20   nvidia/cuda:12.4.0-base-ubuntu22.04   "bash"    5 days ago       Up 25 hours                           deepseek-step
  1. 只剩 deepseek-step
(base) hyt@user-H3C-UniServer-R4900-G5:/opt/KTransformers+Unsloth$ docker ps
CONTAINER ID   IMAGE                                 COMMAND   CREATED      STATUS        PORTS     NAMES
dd70e90a0c20   nvidia/cuda:12.4.0-base-ubuntu22.04   "bash"    5 days ago   Up 25 hours             deepseek-step
  1. 新建容器 复制deepseek-step内容,挂载路径:/opt/KTransformers+Unsloth(抚平我内心的焦躁、憋慌、气愤、怒火)
    限300字
移除旧容器:docker rm deepseek-01
交互启动:docker run -it --name deepseek-01 -v "/opt/KTransformers+Unsloth":/opt/KTransformers+Unsloth new-deepseek-image bash
进入后直接ls /opt/KTransformers+Unsloth验证挂载。

(base) hyt@user-H3C-UniServer-R4900-G5:/opt/KTransformers+Unsloth$ docker rm deepseek-01
deepseek-01
(base) hyt@user-H3C-UniServer-R4900-G5:/opt/KTransformers+Unsloth$ docker run -it --name deepseek-01 -v "/opt/KTransformers+Unsloth":/opt/KTransformers+Unsloth new-deepseek-image bash
root@2109b966a8cf:/# ls /opt/KTransformers+Unsloth
bin  code  conf  dist  flash-attention  flash_attn-2.5.8+cu118torch2.2cxx11abiTRUE-cp311-cp311-linux_x86_64.whl  ktransformers  model  opt

  1. 下载anaconda3 - 报错
root@2109b966a8cf:/# wget https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
Error parsing proxy URL http://your-proxy:port: Bad port number.
root@2109b966a8cf:/#
  1. 安装虚拟环境 kt
  2. 往下安装,报错
Err:2 http://security.ubuntu.com/ubuntu jammy-security InRelease
  Could not resolve 'your-proxy'
Err:3 http://archive.ubuntu.com/ubuntu jammy InRelease
  Could not resolve 'your-proxy'
Err:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
  Could not resolve 'your-proxy'
Err:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
  Could not resolve 'your-proxy'
Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
  Could not resolve 'your-proxy'
Reading package lists... Done
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy/InRelease  Could not resolve 'your-proxy'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease  Could not resolve 'your-proxy'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease  Could not resolve 'your-proxy'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/jammy-security/InRelease  Could not resolve 'your-proxy'
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/InRelease  Could not resolve 'your-proxy'
W: Some index files failed to download. They have been ignored, or old ones used instead.

发现猫腻,错误明晃晃的太离谱:http://your-proxy:port

(kt) root@2109b966a8cf:~/autodl-tmp# echo $http_proxy $https_proxy
http://your-proxy:port http://your-proxy:port

成功定位到问题根源:环境变量 HTTP_PROXY和 HTTPS_PROXY被设置为无效代理 http://your-proxy:port。
找到它并删除。

# Remove proxy from environment variables
unset http_proxy
unset https_proxy
unset ftp_proxy

# Remove proxy from APT configuration
sudo rm -f /etc/apt/apt.conf.d/95proxies
sudo rm -f /etc/apt/apt.conf.d/proxy
  1. docker 安装驱动
    nvidia-smi 无响应,驱动没装,cuda 和 显卡 不识别

听豆包的,安装 nvidia-driver-535 是为了让系统识别并驱动你的 NVIDIA 显卡,它是 CUDA 12.1 运行的基础。该版本适配 CUDA 12.1,只有装好驱动,nvidia-smi才会有输出,后续 CUDA 才能调用 GPU 算力跑 AI 任务。
装完感觉死翘翘了

(kt) root@2109b966a8cf:~/autodl-tmp# nvidia-smi
Failed to initialize NVML: Unknown Error
(kt) root@2109b966a8cf:~/autodl-tmp# lsmod | grep nvidia
nvidia_uvm           2076672  0
nvidia_drm            135168  0
drm_ttm_helper         16384  1 nvidia_drm
nvidia_modeset       1638400  1 nvidia_drm
video                  77824  1 nvidia_modeset
nvidia              104071168  2 nvidia_uvm,nvidia_modeset
(kt) root@2109b966a8cf:~/autodl-tmp# dmesg | grep -i nvidia | tail -20
dmesg: read kernel buffer failed: Operation not permitted
(kt) root@2109b966a8cf:~/autodl-tmp# nvidia-detector
nvidia-driver-580

NVIDIA驱动在容器内无法正常工作,问题可能在于容器内核模块与宿主机不匹配。我们已经在容器内安装了NVIDIA驱动,但是容器内的内核版本(6.14.0-35-generic)可能与宿主机不同,导致模块无法加载。现在,我们需要确认宿主机的情况,并采取相应措施。由于容器内没有nvidia-container-cli,我们无法直接检查容器运行时配置。但是,我们可以尝试在容器内安装必要的工具来诊断,或者重新构建容器。容器内缺少必要的诊断工具。这表明容器环境可能不完整。关键问题是NVIDIA设备节点不完整(只有/dev/nvidiactl,缺少/dev/nvidia0等设备节点)。
在宿主机终端执行如下内容,用于解决镜像拉取超时问题:

# 创建Docker镜像加速配置
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json << 'EOF'
{
  "registry-mirrors": [
    "https://docker.mirrors.ustc.edu.cn",
    "https://hub-mirror.c.163.com",
    "https://registry.docker-cn.com"
  ]
}
EOF
# 重启Docker服务使配置生效
sudo systemctl daemon-reload
sudo systemctl restart docker
# 验证Docker服务状态
sudo systemctl status docker

验证宿主机GPU和Docker GPU支持是否正常:

# 测试基础NVIDIA功能
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
# 如果上述失败,尝试其他标签
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi


hyt@user-H3C-UniServer-R4900-G5:~$ # 测试基础NVIDIA功能
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
# 如果上述失败,尝试其他标签
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:12.0-base' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
Run 'docker run --help' for more information
Thu Nov 27 08:10:21 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:8A:00.0 Off |                  Off |
| 30%   35C    P8             22W /  450W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

第一个命令失败原因:
nvidia/cuda:12.0-base镜像在本地和镜像源中均不存在
网络超时可能是由于该标签已过期或不可用
第二个命令成功原因:
nvidia/cuda:12.4.0-base-ubuntu22.04是有效且可用的镜像标签
镜像加速配置已生效,成功从国内源拉取

docker run -it --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 bash

此命令在宿主机执行,用于启动一个支持NVIDIA GPU的Docker容器。具体作用:基于CUDA 12.4镜像创建交互式环境,并分配全部GPU资源,进入bash shell供操作。

  1. 新建docker,要包含30G显存,90G内存,若干CPU
docker run -itd --name deepseek-02 \
  --gpus '"device=0"' \
  --memory=90g \
  --memory-swap=92g \
  --cpus=0.5 \
  -v /opt/KTransformers+Unsloth/:/app/models \
  deepseek-custom:latest \
  bash

我原本对docker的理解不对,出现错误要全部推倒重来(比如:docker 忘记带卷忘记设置 内存 显存 cpu 等)。是的,理解错了。docker可设置资源,比如GPU cpu 内存的占用。我先把当前某个docker(如deepseek-02)设置为 镜像(如:deepseek-custom:latest),把老docker停掉或删除,建立新docker(比如:deepseek-02),引用老镜像,就可以把老镜像里的已安装内容全部继承,包括已安装的虚拟环境(但 venv 虚拟环境好像依赖路径,容易损坏。anaconda3 的虚拟环境就非常能打,较为独立)。如上文命令所示,给 docker 指定 GPU CPU 内存 挂载卷 引用镜像。 我建立并进入镜像后,不小心安装了miniconda3,后面看到虚拟环境kt,想启用时报错

CondaError: Run 'conda init' before 'conda activate'

发现原有的 anaconda3 虚拟环境应该是被冻结了,原因应该是我莫名其妙装了miniconda3:

root@0335e05fabec:/# # 列出所有可用的conda环境
conda info --envs
#激活anaconda3环境
conda activate /root/anaconda3
#conda environments:
#*  -> active
#+ -> frozen
base                     /opt/miniconda3
                       /root/anaconda3
                       /root/anaconda3/envs/kt

于是我用命令重新启用 anaconda3 环境 并进入kt虚拟环境:

/root/anaconda3/bin/conda init bash
root@0335e05fabec:/# source ~/.bashrc
(base) root@0335e05fabec:/# conda activate /root/anaconda3
(base) root@0335e05fabec:/# conda info --envs
# conda environments:
#
                         /opt/miniconda3
base                  *  /root/anaconda3
kt                       /root/anaconda3/envs/kt
(base) root@0335e05fabec:/# conda activate kt
(kt) root@0335e05fabec:/# pip list

看安装的包,都在,很棒,我终于学会的对 docker 随心所欲的使用:

(kt) root@0335e05fabec:/# pip list
Package                  Version
------------------------ ------------
certifi                  2022.12.7
charset-normalizer       2.1.1
cpufeature               0.2.1
einops                   0.8.1
filelock                 3.20.0
flash-attn               2.5.8
fsspec                   2025.10.0
idna                     3.4
Jinja2                   3.1.6
MarkupSafe               3.0.3
mpmath                   1.3.0
networkx                 3.6
ninja                    1.13.0
numpy                    1.24.3
nvidia-cublas-cu11       11.11.3.6
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu11   11.8.87
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu11   11.8.89
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11        8.7.0.84
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu11        10.9.0.58
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu11       10.3.0.86
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu11     11.4.1.48
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu11     11.7.5.86
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu11         2.19.3
nvidia-nccl-cu12         2.19.3
nvidia-nvjitlink-cu12    12.9.86
nvidia-nvtx-cu11         11.8.86
nvidia-nvtx-cu12         12.1.105
packaging                25.0
pillow                   11.3.0
pip                      25.3
requests                 2.28.1
setuptools               80.9.0
sympy                    1.14.0
torch                    2.2.0+cu121
torchaudio               2.2.0+cu121
torchvision              0.17.0+cu121
triton                   2.2.0
typing_extensions        4.15.0
urllib3                  1.26.13
wheel                    0.45.1

这样,往下就可以马上装 ktransformer了,但忘了怎么进入挂载卷了。看当时创建的指令,挂载卷对应的 docker 内路径“/app/models”,进去看,有了:

(kt) root@0335e05fabec:/app/models/code# cd ..
(kt) root@0335e05fabec:/app/models# ll
total 119172
drwxrwxr-x 10 1002 1002      4096 Nov 26 06:41 ./
drwxr-xr-x  3 root root      4096 Nov 27 09:01 ../
-rw-r--r--  1 root root        16 Nov  6 02:37 .webui_secret_key
drwxrwxr-x  2 1002 1002      4096 Nov  3 04:58 bin/
drwxrwxr-x  4 1002 1002      4096 Nov 13 09:19 code/
drwxrwxr-x  3 1002 1002      4096 Nov  3 04:58 conf/
drwxrwxr-x  2 1002 1002      4096 Nov 17 06:56 dist/
drwxrwxr-x 12 1002 1002      4096 Nov 17 07:48 flash-attention/
-rw-rw-r--  1 1002 1002 121978927 Nov 17 08:05 flash_attn-2.5.8+cu118torch2.2cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
drwxr-xr-x  2 root root      4096 Nov 26 06:41 ktransformers/
drwxr-xr-x  4 root root      4096 Nov  6 01:29 model/
drwxrwxr-x  2 1002 1002      4096 Nov  3 04:58 opt/
  1. 安装 ktransformer
    看 pdf 还有 一些工具貌似没装,于是装了一下装出问题了。
(kt) root@0335e05fabec:/app/models/code/ktransformers# conda install -c conda-forge libstdcxx-ng
Retrieving notices: ...working... ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <Failed to parse: http://your-proxy:port> for channel: pkgs/main url: https://repo.anaconda.com/pkgs/main/notices.json
ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <Failed to parse: http://your-proxy:port> for channel: conda-forge url: https://conda.anaconda.org/conda-forge/notices.json
ERROR conda.notices.fetch:get_channel_notice_response(73): Request error <Failed to parse: http://your-proxy:port> for channel: pkgs/r url: https://repo.anaconda.com/pkgs/r/notices.json
done
Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): failed

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/root/anaconda3/lib/python3.12/site-packages/urllib3/util/url.py", line 425, in parse_url
        host, port = _HOST_PORT_RE.match(host_port).groups()  # type: ignore[union-attr]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    AttributeError: 'NoneType' object has no attribute 'groups'

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "/root/anaconda3/lib/python3.12/site-packages/requests/adapters.py", line 633, in send
        conn = self.get_connection_with_tls_context(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/anaconda3/lib/python3.12/site-packages/requests/adapters.py", line 476, in get_connection_with_tls_context
        proxy = prepend_scheme_if_needed(proxy, "http")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/anaconda3/lib/python3.12/site-packages/requests/utils.py", line 995, in prepend_scheme_if_needed
        parsed = parse_url(url)
                 ^^^^^^^^^^^^^^
      File "/root/anaconda3/lib/python3.12/site-packages/urllib3/util/url.py", line 451, in parse_url
        raise LocationParseError(source_url) from e
    urllib3.exceptions.LocationParseError: Failed to parse: http://your-proxy:port

看起来是老配方老问题。

# 设置容器自动重启(推荐用于生产环境)
docker update --restart=unless-stopped deepseek-02

# 验证重启策略
docker inspect deepseek-02 | grep -A5 -B5 RestartPolicy

还是不行,提示底层驱动不对,还要重新开始,心累。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐